Infrastructure for Fast Geographic OSM Extracts

By Andrew Byrd 28 Apr 2015

Conveyal is developing software that performs fast geographic extracts of up-to-date OpenStreetMap data anywhere in the world and delivers them on demand in compact binary formats. The results are promising, and we are now using Vanilla Extract to fetch road network data on demand for use in Transport Analyst.

Our needs

At Conveyal we make heavy use of OpenStreetMap (OSM) data. We don’t just render visual maps, we use the data the for finding and analyzing paths through transportation networks. OpenStreetMap data frequently contains small errors which have no serious effect on the appearance or readability of maps, but prevent routing algorithms from finding complete or correct paths. A good example is pedestrian bridges across railway lines or highways: when these bridges are missing or present but not connected to adjacent streets, the city is split into disconnected halves.

Fixing a pedestrian bridge Fixing a pedestrian bridge to improve a routing result

In our ideal OSM workflow, any user who spotted such a routing problem could edit the public OSM data using common web tools like iD. Those changes would immediately propagate to our local replica of OSM, and from there into the routing engine providing rapid feedback about the effects of edits.

Existing tools make this possible for small or moderately-sized extracts of OSM data from predetermined bounding boxes, but in our case this data flow must be integrated with our web-based transportation analysis tool whose users can specify an arbitrary zone of interest anywhere in the world.

Existing systems

The core OSM infrastructure is built on relational database management systems (RDMBS) such as PostgreSQL, and contains huge amounts of interlinked data. A compressed dump of this database is about 25 GB in size, not including the voluminous indexes that allow retrieval of elements based on geographic location or cross-entity references. It is constantly being modified by users around the world.

When someone needs to maintain their own up-to-date copy of this data, they have several options. The first is to set up their own RDMBS with a similar schema. General purpose RDBMS support very flexible queries and can robustly tolerate simultaneous modification by large numbers of concurrent users with almost no risk of corruption. These are exceptionally solid pieces of software and quite appropriate for a large central database. However, these systems make a trade-off, accepting heavy resource consumption to obtain these advantages. Initializing an RDBMS with OSM data can take several days and 300-400 gigabytes of storage. They are overkill for simple replication and geographic extraction.

The second option is to stick to flat-file dumps of the database or subsets of it in the OSM XML or PBF formats. Tools such as Osmosis, osmconvert, and osmfilter that fit into such processes work by iterating over the entire file, essentially copying it and introducing modifications or filters during the copy operation. This is a reasonable way to maintain an extract of a single city on a fairly long update cycle, but the approach becomes unwieldy when working on large numbers of regions or arbitrary zones.

The third option is to use a purpose-built replication server with a query API, the primary example of which is Overpass. The publicly accessible hosted instances of overpass generously allow users to perform up to 10k queries totaling 5GB of data per day. Unfortunately we were not able to fit these public servers or our own local Overpass instance into our workflow. While Overpass represents a major step forward in how people retrieve OSM data, it appears to be specialized in fetching small or moderately-sized blocks of data and applying flexible filtering or transformations for direct consumption by browser-based editors and other webapps. Most importantly, according to the project mailing list there has been an official decision not to include PBF support. We strongly favor compact binary OSM data exchange formats and depend on them for rapid turnaround.

Our solution: Vanilla Extract

We gradually came to the conclusion that there was a place for a new single-purpose, no-frills piece of infrastructure. The result is Vanilla Extract (VEX for short), which aims to do only a few things but do them well: replicate the entire planet locally in a space- and time-efficient manner, provide fast, scalable extracts of every object within a bounding rectangle in PBF format, and keep the whole clone up to date on a minutely basis.

The current incarnation of Vanilla Extract is written in C and relies heavily on sparse memory-mapped files. We defer to sophisticated and carefully tuned code in the operating system itself to handle getting blocks of OSM objects on and off of long-term storage as needed, and for handling gaps in the OSM identifier space and the spatial index grid. This makes the code straightforward to read and understand while providing performance that is entirely sufficient for our purposes, greatly exceeding anything we were able to achieve with existing tools.

Cloning from PBF and extraction to PBF are implemented. Minutely updates are still under development, but the data structures are designed to make them low-impact operations. Replicating the planet takes a couple of hours on a solid-state disk and consumes under 70GB for the whole planet. Extracting a city the size of Portland, Oregon from this replica takes between 3 to 30 seconds depending on how much of the data is currently paged in from long-term storage. Memory consumption during extraction should be essentially independent of bounding box size since the implementation amounts to sequential iteration over spatial index bins, but the process does benefit greatly from leaving the OS a lot of free memory for caches.

Experiments with recent versions of the excellent MapDB library indicate that we may be able to achieve comparable performance using this general-purpose storage backend in Java. It is likely that we will make the transition to MapDB for greater maintainability and integration with existing Conveyal code. A working prototype of Vanilla Extract in Java is included in our MapDB-based OSM library, which also contains example code to produce and consume our streamlined OSM data exchange format.