I won a JSK Fellowship, and spent the 2015-16 academic year hanging at Stanford, pursuing my own research and learning from an amazing group of journalists from all over the world.
It was a heady, weird time to be in the Bay Area, where I split time between applied machine learning and ocean/oil/remote sensing classes. I worked in Palo Alto right after the Dotcom Crash in 2001, and a few things had changed. I watched in bafflement one day as a self-driving car got stuck doing a U-turn through a busy bike lane, and a human driver had to take over. A robotic mall security guard knocked a kid over. In computer vision and computational linguistics classes, the state of the art was being reset, annually, by deep learning techniques that longtime professors in the field had never heard of. Hearing a fifty-something professor explain why he'd given up on the techniques that had made him a leader in the field for decades in favor of using neural networks was bracing.
A major focus of my research as a JSK Fellow revolved around how to extract structured data from scanned-in paper documents. As a reporter, I'd written dozens of one-off document readers for increasingly complex text-based PDFs. At Sunlight, I was the first to extract the Senate's payroll as structured data from thousand page pdfs of the "Report of the Secretary of the Senate".
I got interested in applying similar logic to scanned in documents. The original project repo was called "What Word Where" which summed up the project's approach to creating it's own flavor of "word vectors": a word, plus it's quantized bounding box, so that documents could be indexed not only by words, but by geographic attributes. Storing a large scale corpus in this manner means that documents can be selected by geographic queries on their component words.
My work focused on documents that represent forms, invoices and contracts that have a fairly consistent, repetitive structure. Much of what I worked on is still available on github; I'm currently using a similar approach but different, less tightly coupled, tooling for extractions. I'm a big fan of pdfplumber too.