Civic Data Analysis with DataHub

I propose a platform for civic data analysis based on the MIT CSAIL project DataHub. DataHub is a GitHub for data, allowing users to follow other users and their data analysis projects and fork others’ projects to extend them. It provides an ecosystem of composable applications for data ingestion, cleaning, visualization, and statistical analysis. I argue that DataHub’s centralization of datasets and analyses on a single platform, support for extension of others’ analyses, and emphasis on composability of applications make it uniquely suited – unlike Socrata and other competing platforms – for creating excitement around civic data analysis, which is imperative for a large community to form. Simply put, DataHub lowers the barriers to gleaning insights from civic data dramatically, eliminating the standard challenges of installing a variety of software applications, reformatting data, and more. I suggest a few extensions to DataHub as it appears today, including comment sections on datasets and analyses, news articles about interesting analyses, and a news feed surfacing analyses and datasets that may be of interest to users. I believe these extensions will further solidify the platform as a one-stop shop to be inspired by others’ work and quickly go from idea to execution on one’s own analyses.

Full Paper

Gladwell’s False Dichotomy

As mentioned in class, I think Gladwell is trying to create a false dichotomy between in-person, strong-tie based organizing and its social media counterpart. There is no reason why the two cannot complement one another as part of a single organizing effort. I completely agree that strong ties are required to inspire the vast majority of people to take large risks; I personally would be very unlikely to participate in a sit-in unless at least a few people I knew were going along with me and believed just as strongly in the cause. But that does not mean that social media does not have the power to tap into strong ties that might otherwise lie dormant.

Some of the most important supporting examples here come from after 2010, when Gladwell wrote the article. I think it is worth returning to the example of Egypt. Attending a protest in Tahrir Square, where injury or death due to police brutality or fighting among protesters was entirely possible, is certainly on par with participating in a sit-in at Woolworth’s. Yet it is unreasonable to say that Twitter, Facebook, and social media organizing did not play a significant role in bringing hundreds of thousands of Egyptians to the streets. The beauty of Twitter and Facebook here was that they connected people who already tended to activism and for whom weak ties were a sufficient inspiration to participate. These people in turn went out to their close friends and relatives and convinced them to join. But there is no way such strong ties alone could have led to the organization of such a large protest in a such a small timeframe. Networks of strong ties are by nature much sparser (e.g. each person only has a few close friends) than those of weak ties, and information flow through them is much slower. So while the Woolworth’s sit-ins may have occurred primarily due to word of mouth through strong ties, they took far longer to get off the ground than they would have in the Internet era. In effect, social networks in one fell swoop seeded many networks of strong ties with the idea of protesting, parallelizing the spread of the message.

I would also like to argue that social networks alone, without any help from strong ties, can effect significant change if the number of people who need to take bold action is small. For example, a large-scale social media effort to find an organ donor for a sick person may reach many people who, because there are no strong ties involved, choose to ignore the call. But if at just one of the thousands or millions of people who are reached is selfless enough to donate, the effort is a success. In essence, social networks dramatically increase the likelihood of reaching such outliers who, contrary to Gladwell’s accurate generalization, respond strongly to requests from weak ties. In fact, such a response probably could not be obtained just by resorting to strong ties – sometimes even people who are close to us may be either be incapable of or unwilling to do something that a complete stranger might. Now, if Gladwell means to focus solely on large-scale organizing then this sort of search for outliers is irrelevant, as the few outliers who may exist are not sufficient to achieve the goals at hand. But he speaks more broadly about change and the inability of weak ties to do anything nontrivial, and I think on those fronts these points directly refute him.

Wael Ghonim and the Egyptian Revolution of 2011

By the start of the second decade of the 2000’s, Egyptians were tiring of dictator Hosni Mubarak’s rule. Unemployment was high, police brutality was on the rise, and elections were rigged. As access to the Internet and social media in Egypt grew, citizens gained a newfound ability to voice their grievances, and the discontent that had been brewing in Egypt for decades finally came to a head.

The story of Wael Ghonim is particularly telling. A 29-year-old marketing executive at Google, Ghonim was a member of Egypt’s upper class and prior to 2010 had little interest in politics. In June of 2010, though, the story of Khaled Said, a 28-year-old who had been beaten to death by the police, finally inspired him to action. Capitalizing on the groundswell of emotion over this incident across the country, he created a Facebook page entitled “We Are All Khaled Said.” Within two minutes, the page had 300 likes, and within three months, it had 250,000. Egypt’s online revolution was underway. Ghonim’s fluency with online marketing had helped him realize that a Facebook page would spread far more quickly than a Facebook group. Furthermore, he recognized that to maintain credibility with the page’s followers, he should use the pronoun “I” rather than “we” in posts so as to avoid appearing like yet another organization or political party. This was a revolution about personal freedom and liberation from the oppressive institutions and parties of old.

Ghonim skillfully channeled the emotions of an unsettled populace through social media, but he also understood that social media would not be sufficient if lasting change was to be made. To organize physical protests of significant scale, he knew that the vast population of Egyptians without Internet access would need to be reached. He canvassed his page’s followers for ideas on how to best spread the word. They suggested flyers and text messaging, which turned out to be very effective. These interactions suggest a comparison to the classic principles of political organizing. In well-funded political campaigns, like those for higher office in the United States, a large staff of paid organizers can be hired, and these organizers can then recruit further volunteer staff and go door-to-door, send texts, or otherwise interact with constituents. The difference in many social movements is that there is no money, so this initial bootstrapping process with paid staffers cannot occur. So instead a core of educated, motivated activists must form through social media, just as was the case in Egypt. Through online discussions, this core can devise strategies for reaching large offline populations, and these strategies will be very similar to their counterparts in traditional political organizing, except carried out entirely by volunteer activists rather than paid staffers. The end result is the same: a large group of citizens galvanized around a common set of issues.

Needless to say, Ghonim’s efforts were largely successful. Leveraging his existing base of followers and using the Twitter hashtag “#jan25” and a Facebook page that was inclusive of a large number of interest groups, Ghonim helped organize a huge anti-Mubarak protest in Tahrir Square on January 25, 2011. Around two and a half weeks and a few protests later, President Mubarak resigned and the floodgates of democracy were opened. Social media had unleashed the incredible power of the Egyptian people.

Primary source: http://www.nytimes.com/2012/02/19/books/review/how-an-egyptian-revolution-began-on-facebook.html?_r=0

CodeCademy & the Importance of Programming

I consider CodeCademy and other websites that teach computer programming to be fantastic examples of inclusive civic technology. Our readings from last week talked about how now digital inequality is not defined by differences in access to technology, but rather in literacy, so I feel that inclusive civic technology should seek to promote digital literacy. Full participation in the digital public sphere requires considerable technical sophistication. For example, it may be important for digital citizens to understand how to encrypt messages in order to communicate in a secure manner with others, especially in countries where governments spy actively on their people. This may sound extreme, but almost all complex digital interactions require at least some understanding of files and filesystems, the Internet and routing, email protocols, and more – think uploading photos to Facebook, using BitTorrent, or setting up a desktop mail client. If they are not in a better position to use technology, those with greater knowledge of computer systems are at the very least more likely to know when new technology could be useful and to actually build it.

I claim that learning how to program is the first step to gaining a thorough high-level picture of the mechanics of common computer systems. Sure, beginning programmers on CodeCademy may start with high-level interpreted languages like Ruby or Python that provide little insight into the underlying capabilities of the hardware. But the right Ruby or Python tutorial – and I believe CodeCademy has many of these – shows a learner the power of programming and the range of tools that can be programmed and inspires them to dig deeper. When a Python programmer tries to open a connection to another server and gets some opaque error, they will likely be motivated to learn more about TCP and the other protocols that power computer networks to understand exactly what could have caused the problem. The basic libraries of Python and Ruby touch a number of central concepts, including networking, filesystems, user interfaces, numerical methods, and more. If a curious programmer expanded out from this basic core, they would soon find themselves with a fairly thorough grasp of many key ideas. They might be motivated to learn a language like C that exposes more of the bare hardware to them, giving them an intuition for what can be done in a computer system. I think this intuition – for the capabilities of a machine or the capabilities of software running on particular operating system – is what helps a digital citizen challenge the conventional wisdom about what is possible, enabling more effective use of technology or creation of new technology. For example, a less educated citizen may see that his BitTorrent download is taking a long time and may blame his slow network connection, not thinking that maybe BitTorrent is not using much bandwidth because it is not opening concurrent network connections or that a different torrent altogether may be faster because it will connect to different peers. A more educated citizen might recognize the power of the Bitcoin blockchain to form the basis of some new means of peer-to-peer exchange that is impervious to government spying.

In all, the basic understanding of programming that websites like CodeCademy provides can motivate many previously nontechnical citizens to learn about the underpinnings of the digital public sphere and empower them to be content creators and influencers.

 

 

Civic Data Analysis

I would like to propose a system for collaborative analysis of datasets important to large publics.

We are in the age of big data – sensors and trackers are everywhere, whether in physical locations or in software applications, and they generate huge volumes of data. A large variety of tools have been built to deal with the data explosion, in particular systems for data storage and computation across a cluster of computers. However, many of these tools require deep understanding of programming and computer systems, and are difficult for casual data analysts to use. More recently, there have been web-based frontends for these complex systems developed for nontechnical analysts, but they are expensive and need to be set up by an IT staff on a company’s own computers. In short, there are few freely available, easily accessible (e.g. web-based) tools for recreational data analysts, probably because this demographic is too small for it to be the focus of a for-profit venture.

When we turn to datasets of broad public interest, though, it seems likely that there is a widespread desire among Americans – if not a means of a monetizing this desire – to analyze the data for themselves and draw their own conclusions. For example, anonymized U.S. census data is freely available, and there are numerous interesting questions that could be asked of it. What is the average age of the residents in every state? What about average income? It seems likely that there are analyses of census data that could yield shocking results about inequality or other matters and could spur citizens to action. I see such analyses as a type of civic journalism, one that is spare on prose and lets the data speak for itself. There are numerous other datasets that could be of similar civic value, including the Reference Energy Disaggregation Dataset on home energy usage, congressional voting records, and anonymized healthcare records.

So it seems the time is right for a collaborative web-based data analysis platform. The existing system most similar to what I propose is called DataHub (http://datahub.csail.mit.edu/www/), a research project from MIT CSAIL. It is a sort of GitHub for data – it allows users to upload datasets and other users to create their own copies of these datasets that they can play with and modify independently. It also has a powerful plugin system that allows users to write applications that can operate on datasets – for example, programs to clean up datasets (e.g. by identifying typographical errors and correcting them), to convert datasets from one format to another (unstructured to tabular), to run specialized analyses like machine learning algorithms and visualizations on the data, and more. This is exciting – as more and more people use the system, the number of applications available for it and the power of the analyses that can be performed will grow. As a computer science research project, DataHub focuses more on technical ideas like minimizing data duplication, and is less concerned with potential societal impacts. This is where I would like to come in.

I think the pieces missing from DataHub that would be particularly useful for civic datasets are comment sections for every data analysis, which would allow other users to chime in and discuss methodological issues with or potential implications of analyses. In addition, I think a news-like component would be interesting, with very popular analyses or datasets surfaced on the front page and potentially even articles written about them. This would support the idea of data-driven civic journalism.

Just to provide a visual of what the part of the system devoted to data analysis might look like, I’ve included a screenshot of Paxata, a commercial system for dataset cleaning:

paxata

We the People

A common complaint about the U.S. government is that lawmakers, cooped up in their offices in Washington, D.C., are ignorant of many of the most pressing issues facing the American people. They spend hours debating bills that don’t address fundamental problems, and there is no way for citizens to collectively inform them that they are off track.

As more and more people began to draw attention to the potential of the Internet to connect citizens and lawmakers and rectify this problem, the federal government decided to take action. The White House created a system called We the People that allows citizens to submit petitions that they want lawmakers to acknowledge and act upon. If a petition receives enough signatures (100,000 now, formerly 25,000 and 5,000), the White House promises to at the very least write a response to it, and where possible take further action.

We the People certainly has some good qualities. It has a nice user interface that surfaces petitions that have recently cleared 150 signatures – the minimum required for them to be publicly visible – enabling such new petitions to gain further traction; allows users to search over the text of petitions or filter them by popularity or issue; facilitates easy creation of new petitions; and supports easy sharing of petitions on Twitter or Facebook. In fact, the source code of the website is on GitHub and users can submit suggested changes to the code. From the perspective of Benkler’s “Networked Public Sphere,” the system successfully takes advantage of the Web to unite people from across the country behind common issues. My only criticism on this front is that it doesn’t have facilities for users to comment on and discuss the texts of petitions so that they can be improved, which would make the interactions among participants richer than they are now, since interactions now mostly involve agreement by signing.

The main criticisms of We the People revolve around the responsiveness of the government to petitions that have cleared the signature threshold. In most cases the White House provides a few words affirming its general commitment to the principles outlined in the petition, without any concrete details or plans of action. For example, one petition requested that during his visit to India President Obama ask Indian Prime Minister Narendra Modi why the Indian constitution does not recognize Sikhs. It received a response that praised Obama for “underscoring that India’s success depended on the nation not being splintered along the lines of religious faith” and made no mention of Sikhs. It seems to me that these frequent non-responses have only led to more and more extreme petitions (e.g. “begin a Justice Department investigation of Congressman Boehner for illegal activities under the Logan Act”), creating an growing divide between the requests in the petitions and what the government is willing to do.

One other criticism of the system is that since Web literacy is not yet universal in America, the petitions reflect the needs and interests of a subset of the American people. In particular, niche issues that are relevant to those in the tech community are far more likely to clear the signature threshold than, say, issues relevant to poor youth. For example, one petition requested the government to fire Carmen Ortiz, the U.S. District Attorney who prosecuted hacktivist Aaron Swartz for his downloading of large volumes of copyrighted content on MIT’s network and who some say caused his subsequent suicide. It received over 60,000 signatures and got a response from the government, even though Aaron Swartz was known primarily among those in the tech community. The only major success story stemming from a We the People petition – the Unlocking Consumer Choice and Wireless Competition Act, which legalized cell phone unlocking – is also fittingly in the realm of technology.

On the whole, I have mixed opinions about We the People but think it is at least a step in the right direction in terms of greater government engagement with the American people through the Web.

Hacker News

I consider Hacker News (news.ycombinator.com) to be a high-quality digital public sphere. Run by the prestigious seed-stage incubator Y Combinator, Hacker News surfaces user-submitted links and posts using an upvote and downvote system and allows users to comment on submitted content. Here is a screenshot of part of the front page:

Screen Shot 2015-03-03 at 8.11.17 PM

In general, the content focuses on matters relating to the software engineering community, including news of startup funding and acquisitions, hot technical topics such as machine learning, and occasionally relevant political issues like net neutrality.

In terms of impact, I think Hacker News has done incredibly well. In general, with at least 200,000 unique visitors each day, including successful and well-connected individuals from across the tech community, it provides high visibility to important causes in the community. Many web startups report that they need to beef up their server infrastructure to prepare for launches on Hacker News, which can drive tens of thousands of people to their websites if they reach the Hacker News front page. A front-page appearance can put a startup on the radar of important venture capitalists and set the stage for a round of funding. Hacker News’ impact isn’t limited to putting startups in the limelight. It has also been used to publicize campaigns to raise money for prominent software developers struggling with illnesses and raise awareness of (supposed) injustices like the prosecution of hacktivist Aaron Swartz. A search for “Thank you HN” on the site yields posts by individuals thanking the community for helping them find jobs, and even one post from a father who said that Hacker News’ popularization of an article about his son’s undiagnosed medical issues led him to parents of other children with similar problems and eventually to a diagnosis.

In terms of productive discussion, Hacker News has been successful as well. The comment sections of front-page links are always filled with lively and informed discussion. When users launch their side projects or startups with “Show Hacker News” posts, other users provide useful feedback in the comments. If a project has already been done before, you can be sure that at least a few people who see the post will know and point that out. In general, people with a wide variety of experience across the technical stack frequent Hacker News, so new products will get critiqued from a number of angles, including front-end design, technical sophistication, product-market fit, and more. On expository articles about technical topics, users from industry, academia, and other communities will post their own experiences with those topics, adding further color to the articles and providing additional data to either support or refute their claims and generalizations.

This is not to say that Hacker News cannot be improved or has gone without criticism. Some have claimed that community members are too harsh in criticizing posted projects that are not very polished or clearly lacking in some areas. The authors of such projects, who are likely new to software development, may be so discouraged by the negative feedback that they become reluctant to work on software again. The maintainers of Hacker News have acknowledged this issue and implemented changes to the algorithm that surfaces comments to mitigate the problem. Another issue is that community members are generally of a similar political and ideological bent, so posts that espouse conservative values are unlikely to reach the front page. So, while the upvote and downvote system brings many benefits – likely the filtering of disrespectful or technically inaccurate content – it can also shelter users from opposing views.

In all, despite some minor shortcomings, Hacker News is a strong community and certainly the premier forum for topics related to software engineering and technology.

Facebook and Voter Turnout

One of the most important civic crises in the U.S. is the declining interest in voting. In the November 2014 midterm elections, only 36.4% of eligible voters participated, the lowest rate since World War II. Even participation rates for presidential elections are about 5-10% lower today than they were 50 years ago. The reasons for this decline are hotly debated. Many researchers claim that the rise of personal technology like televisions and computers has isolated individuals from their communities and instilled a belief that political processes have little effect on them.

In the November 2012 and 2014 elections, Facebook attempted to reverse this trend by allowing users to post stock “I voted” statuses celebrating their participation in the elections. The following user interface was used in 2014:

bp2-pic

With a user base of over 150 million Americans by November 2014, Facebook was in the perfect position to influence voter turnout. Facebook’s hope was that if people saw many of their friends post these statuses, they would be compelled to vote as well. To make voting as easy as possible, Facebook also displayed a map showing the polling locations closest to users.

Facebook evaluated the effectiveness of similar efforts in 2010, when it estimated it had about 61 million users in the U.S. That year, the “I’m a Voter” button that Facebook displayed to its users did not actually post a status; it simply logged the information to Facebook’s servers. For some users, the button was presented along with a list of the users’ friends who had also clicked it. In other cases, this list was not presented. Facebook reported that in about 20% of the cases where the friends list was presented users clicked on the “I’m a Voter” button; the figure was only 18% in the case without the list. Facebook claimed that seeing that their friends voted inspired additional people to vote in the case with the friends list, and it estimated that in total its efforts increased voter turnout by around 300,000. These results seem credible, but it is unclear whether Facebook properly controlled for the possibility that users were simply more likely to report whether they voted if their friends did as well; their actual rates of voting may have been unaffected.

In 2012, besides showing the aforementioned button, Facebook also placed news stories higher on certain users’ news feeds in the run-up to the election. Since these news stories were predominantly about the election, the affected users saw much more election-related content on Facebook prior to the election. When election day rolled around and users began reporting whether they voted by their “I’m a Voter” button clicks, Facebook found that 67% of affected users decided to vote, while only 64% of unaffected users voted. This experiment has fewer apparent confounding factors than the 2010 button experiment. However, it strays into grayer ethical territory. Facebook was heavily criticized recently for manipulating the news feeds of various groups of users to trigger emotions like happiness, sadness, and anger. While news articles are unlikely to cause negative emotions, some users may nevertheless be opposed on principle to such active distortion of their news feeds.

Overall, I found Facebook’s efforts commendable and am impressed by their results. Their work tackled an important civic crisis at a very large scale and provides great data for future civic renewal efforts.

Most data for this post was obtained from the following article: http://www.vox.com/2014/11/4/7154641/midterm-elections-2014-voted-facebook-friends-vote-polls

 

 

 

Citizen Journalism on Quora

I define citizen journalism to be any sort of writing about matters of significance beyond the self performed by individuals not paid for their work. With this definition in mind, I find Quora to be a compelling example of citizen journalism. Quora is a question and answer website that allows users to submit questions of interest and categorize them, request that certain individuals answer specific questions, and, of course, answer questions themselves. This model is not completely novel – it was tried for the first time about a decade ago by Yahoo Answers, but Quora has managed to attain an unprecedented level of quality, presumably by employing more sophisticated spam detection algorithms and by targeting educated social networks.

In my mind, one of the great merits of Quora is that it provides individual narratives about topics that are most often spoken about in generalizations. One of my favorite questions involved the intelligence of former U.S. president George W. Bush. A former White House staffer wrote an answer describing Bush’s careful analysis of a report about the progress made in space exploration during his tenure and his identification of errors made by experts on the topic. Most popular articles about Bush’s supposed stupidity focus on a few public gaffes, such as his mispronunciation of the word “nuclear”; rarely do we hear such intimate personal stories about public figures. A similarly intimate story was written by a man whose business was based in the World Trade Center at the time of the 9/11 attacks. He describes the horror of hearing of the deaths of some of his colleagues, trying to account for all of his employees, and talking on the phone to the families of some of the deceased.

Another example of quality content on Quora is that which is targeted to small interest groups. For example, many users of Quora are software engineers. One of the most popular types of specialized content is therefore descriptions of the interview processes at top software companies. Although these descriptions are written by employees and thus likely paint a rosier than deserved picture of the interviews, they often provide important details that help candidates better prepare for them. Writers generally elaborate upon question types, number of interviews, and emphasis on cultural versus technical fit. Their answers provide a personal touch to a process that can otherwise seem impersonal and intimidating.

In sum, Quora connects people with niche knowledge and an interest in sharing it to those who are looking for it. Writers on Quora generally do not have an incentive to start blogs because it is unlikely that those seeking their expertise will be able to locate their blogs from among the millions of others on the Web, even with the help of Google. By keeping content within Quora, authors can rely on Quora’s algorithms for understanding users to connect them with the right audiences. In addition, by submitting questions, users can elicit knowledge from others who did not even realize they had it or did not previously have a conscious interest in sharing it.