In 2016, the IRS began releasing the raw submitted XML version of nonprofit corporations' annual tax returns filed electronically: Forms 990, 990PF, and 990EZ. I spent much of 2017 building systems for processing this data, including projects for The Chronicle of Philanthropy / The Chronicle of Higher Education, and for ProPublica's nonprofit explorer. IRSx, the main python parsing library developed for ProPublica, is available as open source software. For documentation and more details of that project, see here. It's especially exciting that nonprofit explorer now also contains federal audits for nonprofits who've received U.S. government grants of $750,000 a year or more.
I'd used nonprofit tax forms as a government reporter, but spent more time at Sunlight on tracking political money, and some of the unlikely ways it was used against a a non-candidate, or the solar industry.
Nonprofit health is a trillion-dollar segment of the U.S. economy annually. Nonprofit education is worth hundreds of billions. What became available more recently is the complete record of the tax forms completed by nonprofits that have been filed electronically. Not all nonprofits are required to file electronically (and shady operators go out of their way to file on paper) but the largest generally do.
My work has focused on extracting, standardizing, and loading the full record set from the main form and all lettered schedules into relational databases for further analysis. I've standardized variables for tax schema years 2013 and forwards in a variable specific way, but I've been working with the Nonprofit Open Data Collective to standardize variables across forms as well.