Civic Data Analysis

I would like to propose a system for collaborative analysis of datasets important to large publics.

We are in the age of big data – sensors and trackers are everywhere, whether in physical locations or in software applications, and they generate huge volumes of data. A large variety of tools have been built to deal with the data explosion, in particular systems for data storage and computation across a cluster of computers. However, many of these tools require deep understanding of programming and computer systems, and are difficult for casual data analysts to use. More recently, there have been web-based frontends for these complex systems developed for nontechnical analysts, but they are expensive and need to be set up by an IT staff on a company’s own computers. In short, there are few freely available, easily accessible (e.g. web-based) tools for recreational data analysts, probably because this demographic is too small for it to be the focus of a for-profit venture.

When we turn to datasets of broad public interest, though, it seems likely that there is a widespread desire among Americans – if not a means of a monetizing this desire – to analyze the data for themselves and draw their own conclusions. For example, anonymized U.S. census data is freely available, and there are numerous interesting questions that could be asked of it. What is the average age of the residents in every state? What about average income? It seems likely that there are analyses of census data that could yield shocking results about inequality or other matters and could spur citizens to action. I see such analyses as a type of civic journalism, one that is spare on prose and lets the data speak for itself. There are numerous other datasets that could be of similar civic value, including the Reference Energy Disaggregation Dataset on home energy usage, congressional voting records, and anonymized healthcare records.

So it seems the time is right for a collaborative web-based data analysis platform. The existing system most similar to what I propose is called DataHub (http://datahub.csail.mit.edu/www/), a research project from MIT CSAIL. It is a sort of GitHub for data – it allows users to upload datasets and other users to create their own copies of these datasets that they can play with and modify independently. It also has a powerful plugin system that allows users to write applications that can operate on datasets – for example, programs to clean up datasets (e.g. by identifying typographical errors and correcting them), to convert datasets from one format to another (unstructured to tabular), to run specialized analyses like machine learning algorithms and visualizations on the data, and more. This is exciting – as more and more people use the system, the number of applications available for it and the power of the analyses that can be performed will grow. As a computer science research project, DataHub focuses more on technical ideas like minimizing data duplication, and is less concerned with potential societal impacts. This is where I would like to come in.

I think the pieces missing from DataHub that would be particularly useful for civic datasets are comment sections for every data analysis, which would allow other users to chime in and discuss methodological issues with or potential implications of analyses. In addition, I think a news-like component would be interesting, with very popular analyses or datasets surfaced on the front page and potentially even articles written about them. This would support the idea of data-driven civic journalism.

Just to provide a visual of what the part of the system devoted to data analysis might look like, I’ve included a screenshot of Paxata, a commercial system for dataset cleaning:

paxata

1 thought on “Civic Data Analysis

  1. In some ways, this project is similar to what Yu is proposing around civic coding projects but instead (http://civicmediaclass.mit.edu/2015/03/17/delicious-for-civic-coding/) focuses on the aggregation and analysis of civic data. The Sunlight Foundation has a long history of working to open and generate datasets of public interest (http://sunlightfoundation.com/api/). My question for you would be too look closely at those projects and the documentation and news reports around them and try to understand what’s missing from their approach. I also want you to think deeply and look for evidence of your claim that their is a strong interest in citizens analyzing data for themselves to draw original conclusions. I would argue that one of the biggest challenges to the rhetoric of the open data movement has been that lack of interest in using that data. More efforts are trying to argue that this simply requires easier to use tools like Socrata, which sells open data portals to local governments. The other place you should investigate is the annotation community. This is still unsolved problem for texts broadly speaking. Look at tools like Annotation Studio from MIT (http://www.annotationstudio.org/) and hypothes.is (http://hypothes.is/). OpenCongress from Sunlight has some of these features though very rudimentary around congressional lawmaking.

Comments are closed.