Easy Extracts of US Census Data

By Matthew Wigginton Conway 11 Dec 2015

Here at Conveyal, we work with US Census data a lot. Historically, retrieving this data has been a bit difficult, as you have to get the block level geometries from one place, data on demographics from another, and data on employment from a third. You then have to extract all the files and join them in GIS. It’s even more complicated if you hail from an urban area that encompasses multiple states (like the home of Conveyal: Washington, DC). Finally, you have to interpret the column codes to map them to something meaningful for analysis (who knew that CNS05 meant “number of jobs in manufacturing”, for example?)

To solve this problem, we decided to create a seamless data source for US Census data. We retrieved the 11 million block-level geometries for all US states and territories, as well as LODES data for states where it is available. We merged all of these state-level datasets into a single national file, and then split it up into 63,645 Web Mercator tiles at zoom 11, stored on S3 in GeoBuf format. Each tile includes all blocks whose envelope overlaps that tile. We use our seamless-census tool to perform this processing step. We also gave all those cryptic columns human-readable names; since we’re not using shapefiles, column names are not limited to ten characters.

Census blocks in Maryland, Virginia and the District of Columbia, divided up into tiles

Once we’ve done that, it’s relatively easy to extract data for an arbitrary geographic bounding box (even one that crosses state lines). We just select the tiles that overlap the area of interest, download them, and then run the features through a final geographic filter to weed out any overselection. Once we’ve done that, we can dump the features to a new GeoBuf file. We also wrote a tool to do that, which is also in the GitHub repository. It’s also possible to perform extracts programatically from Java using our library, and it wouldn’t be hard to implement the extractor in another language.

Job density in and around Washington, DC, made with a seamless file extract. Note that the file contains parts of three states.

There’s no reason why we should keep this to ourselves, either. This is open data and it should be accessible to the world, so we’ve gone ahead and made the S3 bucket (lodes-data) where we store the tiles public. It’s a requester-pays bucket, so anyone using it pays the (miniscule) S3 bandwidth costs directly to Amazon. Just use the credentials from your AWS account to access it; the bandwidth you use will be added to your AWS bill. The data are the 2015 TIGER/Line blocks for every state, and 2013 LODES data for all segments and job types. Massachusetts, Puerto Rico and the US Virgin Islands have no data available, and Kansas uses the 2011 data (rather than 2013) since newer data is not available. We haven’t put demographic data from the decennial census in yet.

The format we’ve devised isn’t specific to the US Census, either. We could use exactly the same infrastructure to handle extracts from any large dataset that can be represented as vector data, and then it could be accessed using the same tooling.

The extractor is also available as a Java class (see SeamlessSource and its subclasses in the seamless-census repository), so it’s easy to integrate with programs written in Java. The extractor is fairly simple, so it shouldn’t be difficult to port it to other languages where a geobuf library exists.