The major focus of my current research is how to extract structured data from scanned-in paper documents.

It’s common for reporters seeking electronic documents to received pdf files that are essentially formatted as image files. Even if these images are converted to text using optical character recognition (OCR) the underlying structure of the data–which value is from which field–is often lost. That’s not an accident; FOIA officers unfortunately often deliberately release structured data in a format that frustrates simple analysis.

Large-scale document digitization is now an established niche market, but typically at prices that are prohibitively expensive to newsrooms. Paying to have paperwork retyped, or automatically extracted under human supervision is a service priced towards those who can pay: litigators, medical offices, and other deep-pocketedclients.

There’s been some important work on this and related problems in the small, but thriving journalistic-software-development-community. Tabula allows users to extract values from tables that appear within the body of a pdf, although it won’t work on scanned-in documents. DocHive was successful at extracting values at precise locations from scanned-in pdfs, but an interface for creating templates remains in development. Alex Byrnes was able to systematically capture common variables in TV documents in this repository though ad buy costs were not captured.

My work is focused on documents that represent forms, invoices and contracts that have a fairly consistent, repetitive structure. Users should see a consistent interface regardless of whether the documents are scanned or text-based pdfs (though text-based pdfs will have fewer errors!).

There are several threads of inquiry that I’m pursuing:

During my fellowship year I’m reading the literature, hacking on code, sitting in on some awesome classes and harassing people from the internet for help. If you’ve got any suggestions, or want to work on similar problems, please get in touch!