The major focus of my current research is how to extract structured data from scanned-in paper documents.
It’s common for reporters seeking electronic documents to received pdf files that are essentially formatted as image files. Even if these images are converted to text using optical character recognition (OCR) the underlying structure of the data–which value is from which field–is often lost. That’s not an accident; FOIA officers unfortunately often deliberately release structured data in a format that frustrates simple analysis.
Large-scale document digitization is now an established niche market, but typically at prices that are prohibitively expensive to newsrooms. Paying to have paperwork retyped, or automatically extracted under human supervision is a service priced towards those who can pay: litigators, medical offices, and other deep-pocketedclients.
There’s been some important work on this and related problems in the small, but thriving journalistic-software-development-community. Tabula allows users to extract values from tables that appear within the body of a pdf, although it won’t work on scanned-in documents. DocHive was successful at extracting values at precise locations from scanned-in pdfs, but an interface for creating templates remains in development. Alex Byrnes was able to systematically capture common variables in TV documents in this repository though ad buy costs were not captured.
My work is focused on documents that represent forms, invoices and contracts that have a fairly consistent, repetitive structure. Users should see a consistent interface regardless of whether the documents are scanned or text-based pdfs (though text-based pdfs will have fewer errors!).
There are several threads of inquiry that I’m pursuing:
Word Location. Tesseract, the open source OCR library, supports word “bounding boxes.” Pdf libraries for text-based pdf also support something similar (see Pdfminer, among others). The starting point for this project was the observation that we can throw word bounding boxes into a spatial database and use spatial queries to extract data we’re interested in. But this process is subject to OCR errors, words being missed altogether, and bounding boxes being incorrectly sized.
Page Filters. It’s common to receive document dumps where only certain pages are of interest. A naive way of filtering is by using word location / size: consider only the pages that say “SUMMARY” in big letters at the top.
Page Clustering. It should be possible to create clusters of similar pages with unsupervised machine learning, and perhaps improve on this with user input.
Template automation. The starting point of this project assumed user created templates, but a better solution would be a supervised learning approach to picking templates.
During my fellowship year I’m reading the literature, hacking on code, sitting in on some awesome classes and harassing people from the internet for help. If you’ve got any suggestions, or want to work on similar problems, please get in touch!