Reproducibility

Moving the discussion forward

Jens von Bergmann

11 minute read

CensusMapper has come a long way, in the latest iteration we opened up an API for convenient and pinpointed census data access for everyone. It’s a step that was overdue. The CensusMapper concept is built entirely around APIs, but they were geared toward mapping needs. With changes to only a couple of lines of code we adapted these to be useful for more general data needs.

But more importantly for us, this small change has already had a huge impact of how we at CensusMapper handle data analysis internally.

Why is the API such a big deal?

An “Application Programming Interface” is a formal protocol that specifies how different processes should interact. The CensusMapper API provides an stable, predictable and convenient interface how to access census data.

In the past we went straight into the CensusMapper database if we wanted to work with census data. In plain language, if we wanted a certain subset of the data to work with, we wrote a line or two of SQL and extracted a CSV or JSON dump for further processing. This works great if you have direct access to the database and is maximally flexible. But it also means that the data extracts lack consistency. And that means that the scripts we used for data processing, analysis and visualization always needed some adjustments to work with a new data extract. And that introduces friction. The API changes that. Now every data extract is formatted in the exact same way and our processing tools are reusable out of the box. The API has already made our workflows a lot more efficient. More than making up for the lack of flexibility when compared to diving directly into the database.

Opening up the API

Then we opened up the API. That forced us to further formalize the protocol. And we also initiated a public R package that acts as glue between R, a popular platform for statistical analysis, and the CensusMapper API. This enables anyone to easily use these tools. And to contribute to their development, which Dmitry and Aaron did. Open source can be a powerful tool to drive innovation and ensure high quality, and this is exactly what happened with the cancensus R package. Through the collaborative work of others cancensus development was much faster than I could have done it, and most importantly, it became a much better package than I could have done on my own.

I have been an advocate for open data and open source for quite some time now, but it was amazing to see this at work at some of my own products.

cancensus

The API on it’s own is already quite useful. One can easily download census data in CSV format using the web tool, and the geographic data in geojson. Part of the CensusMapper paradigm dictates that the geographic data and the census data need to remain separate, so the user will have to join these after accessing the data. When mapping data, CensusMapper takes care of this gluing operation in a way that is completely opaque to the user. Someone viewing CensusMapper maps would not know that each map tiles requires two API calls to display the data. One for the geographic data and one for the census variables. Similarly, the cancensus R package takes care of this gluing operation internally, the users just needs to specify what data format they want the geographic data in. Or that they don’t need geographic data for a particular analysis they do, although this is generally not advisable for various reasons when working with census data.

Reproducibility

Reproducibility is a big word. In this context I mean the ability for anyone to reproduce an analysis or visualization with minimal work. Internally at MountainMath, all our analysis is done via script, so we can re-run it at will and get the same result. This includes the extraction of the data out of the CensusMapper database in case we use census data. In that sense, pretty much everything we do is reproducible. By us. But not by other people, since they don’t have internal access to our database. And if we were a bigger shop with more people and various access privileges, it probably would not be reproducible by all people in the company.

The API changes that. There are still restrictions at the API level, some datasets we have come with access restrictions that only select users can access. But a good portion is, and any analysis based on that data can now easily be shared. And access permissions are easily managed to open up other portions as needed.

Open Access

Reproducibility fits neatly into the open access paradigm. Academia is slowly moving in that direction. In Canada, researchers on tri-council grants are now required to publish there results in open-access journals. And while the transition is slow, the direction is clear and the push (and public shaming) to publish in open access journals gets louder. In the interest of full disclosure, I admit that in my past academic life I have published in open access journals as well as deep in evil empire territory. But the direction toward open access at all levels is clear.

What has received less attention is that tri-council grants also require that any data acquired via the grant be made public. And while academics are generally quite willing to email out a pdf of their research published in paywalled journals, I have had very little success compelling them to share raw data unless there are valid legal objections to do so.

But increasingly one can find data and, if appropriate, code published together with research articles. One example of this is a recent interesting study by Miles Croak on Economic Opportunity in Canada that makes the data used in the study available. In cases like this, where the data is derived from a StatCan custom tabulation, providing a clean spreadsheet with detailed explanation is the best possible way to share the data.

But too often do people publish analysis results with simply citing the standard census release data from Stats Canada as the data source. I have spent countless hours trying to reproduce other people’s work by guessing exactly what variables were used, how the analysis was perform using what statistic and still could not get the results to match. In important cases I have contacted people who mostly were more than willing to share further details, but this process is less than ideal.

Analysis + Code + Data API

Ideally an analysis should be reproducible by anyone. Traditionally that meant explaining the methods used in great detail, often providing the formulas in the printed publication. This introduces all kinds of friction, from typos in the formula to necessarily jumping steps and incomplete descriptions, to the hassle of having to rewrite all the code in order to reproduce the analysis. Sharing the code, and the data, removed these barriers. But modern datasets are often massive. Even census data, which is sizable with about half a billion fields per standard census data release, but wouldn’t be considered “big” data by today’s standards, becomes somewhat unwieldy when shared. Plus it becomes impossible to verify the authenticity of the shared data.

That’s where APIs come in, they remove the need to share the data, the code calling the API is enough. And because API calls can pinpoint the required data much better one only grabs the data needed for the analysis instead of downloading a large census dataset and then cutting it down to the required chunk.

The CensusMapper API can provide this service, with the exception that it is not an “authoritative” source. We have spent quite some time to verify the accuracy of our data imports and that the API calls accurately reproduce StatCan data, but, as much as I hate to admit it, we lack the authority that StatCan has (even though StatCan themselves are not free of data quality issues). Ideally the API that we have would be built and hosted by StatCan. And while StatCan has for years promised to deliver just such an API under their “New Dissemination Model”, this has failed to materialize and prompted us to open up ours. StatCan would also have the resources to remove API quotas to unlock the full potential of such an API. That will never remove the need for third party APIs like ours, that will also offer processed data like our common tiling offering 2011 and 2016 data on the same geographies down to the dissemination block level. To the contrary, I expect it would accelerate the development of more third party APIs as it would make it a lot simple to build derived products.

Fully reproducible analysis

At CensusMapper we love maps. As much as we hate to admit it, a map isn’t an analysis. It may facilitate the communication of the result of an analysis, but it does not replace it. In some cases, for example our Diversity Index Map the map performs it’s own mini-analysis on the fly, it computes the diversity index based on the ethnicities in each area and displays the result. The map story gives some context. A more complex example is our Surprise Maps that dynamically perform some statistical hypothesis testing as explained in more detail in a previous post. But while these can count in some limited way as analysis, the more general point remains that maps aren’t analysis.

Transparency

For our outward facing mini (or sometimes not so mini) analysis here on our blog we have now migrated our blogging platform to Hugo-based blogdown, which means our blog posts are now written in R markdown. And it has immediately transformed how we write our blog posts. The analysis is now embedded directly into the post. Code, explanations, visualizations are all part of one and the same document. For legibility we do suppress the displaying of some of the code snippets, trying to strike a balance between general readability and providing details to the more technically minded reader. The codes blocks are clearly identifiable as such, and I expect most readers to skip right over them, while some readers that want to know in more detail how our analysis was performed can easily get that information. And anyone that wants to check, expand or modify the analysis can just download the R markdown, view the suppressed code blocks, and immediately re-run the analysis on their machine and make whatever changes they want.

Even simple things like the numbers in our post are actually just the result of computations done in R, eliminating all friction between analysis and the report.

R markdown code

Blog Post Output

In short, our blog posts and out analysis is one and the same thing, although they still reside in different places. The blog is served as HTML at out doodles.mountainmath.ca, and the R markdown resides on out public GitHub repository. That separation just technical though, our html website is a derived product. Updating the GitHub repo is what triggers an automated deployment to our website via netlify, so the two will never be out of sync. The GitHub repo keeps the entire history of commits, reflecting all updates to the blog and providing full transparency.

Moving the discussion forward

We believe that this is the process that is needed nowadays to “move the discussion forward”. In our hard to navigate information age we too often end up being torn between opposing opinions with no way to evaluate them, having to resolve to choose whom to trust. People trump up their credentials and titles to end up ahead in the battle for trust. (My pet peeve: I have a policy not to retweet or follow anyone that feel the need to add their academic credentials to their Twitter name.)

One way out of this is reproducible and transparent analysis. This point of view values the analysis itself over the author, it pitches the analysis as a stepping stone toward greater insight. Platforms like GitHub can serve as collaborative environments where people can discuss issues they find with particular parts of an analysis and collaboratively solve them. Or fork the analysis and move it further on their own. Not everyone has the ability to do this, but enough people do. It only takes a moderate technical skill level to download RStudio, grab the R markdown file from the web and run it. And one does not have to be an expert in R to quickly get a grasp what what is going on and make minor modifications to play with the data. And certainly anyone participating in data discussions will have the ability to do so.

Other Data Sources

The Census is just one data source, there are many other. After seeing how cancensus has sped up my work cycle, I have not written API wrappers for pretty much any data source I am using. Some of these are public, and I have started to publish the packages I wrote to access them, e.g. one to pry data out of the CMHC database. And I am by no means the only one doing this, a quick search reveals multiple packages people have written to access CANSIM data, both for R and for Python.

Give us Data Analysis!

In Vancouver we have heard cries for more data. And data is lacking for many important analyses, and we need to keep pushing for the release of more and better data. But what’s even more needed than data is people to analyse data and turn it into information, to provide answers to questions. I am hoping that the opening up of the CensusMapper API can make it easier for other to add their skills and expertise. And that the general idea of reproducible and transparent analysis can catch on and develop into a constructive cycle that propels the debate forward.

comments powered by Disqus