Admittedly, I was a little late to jump on the podcast bandwagon. It took NPR’s Serial to get me on board, but now I can’t get enough. As a visual learner, I also didn’t think listening to people talk about data and data science would be that interesting…man, was I wrong.

Today I tuned into what is quickly becoming a favorite: The Digital Analytics Power Hour (highly recommend). The topic was “open data,” something we pay a lot of lip service to in international development programs but never seems quite well-executed. “Open data” simply means  publically sharing information you collect and store, and there’s a huge push now to get governments to open up information like never before (partially thanks to our friends over at Development Gateway). Though important, the power of open data can really be felt when everyone—from a PhD student to a frontline global NGO worker—knows how to use it.

The guests of the podcast, Jon Loyens and Brett Hurt (veterans of several data-driven, private sector companies), flagged some concepts that resonated and got me thinking about some of the major issues with fully unlocking the potential of data for development.

Issue #1: “Data is tribal”

Jon Loyens’ use of this phrase struck me because it accurately describes how often information about development data stays within small groups. Even if some data see the light of day, few people know what to do with it because they don’t understand what it is.

For example, in Malawi health facility data collection happens all the time. Health workers are surveyed, facilities are assessed, staff is monitored, and health outputs are tracked and uploaded to repositories, but each of these activities is administered by different groups with different goals. If demographic information is collected (e.g. “education”), the responses will vary and standards for how these data are captured don’t exist. Some responses might be “secondary” while others say “MSCE.”  Are these the same or mutually exclusive? Ultimately, when looking through results reports or data files (if available), there is little to no documentation of what criteria are set to determine these categories, so comparability is limited.

The same is true for health facility names. In one dataset a site might be named “Monkey Bay District Hospital” and in another, “Monkey Bay Hospital.” Are these the same site? The only way you could know for sure is to ask the group who collected it.

Knowledge about the data—the context, definition, how to interpret—stays within the “tribe” that collected it. This phenomenon doesn’t happen because people don’t want to be collaborative. I think it happens because developing adequate documentation is super time consuming and no one forces organizations to do it. According to the podcast guests (and I agree), 80% of analytics is janitorial. Not fun, but necessary.

Once the study is over, the report is written, and the funding has dried up, the data are filed away to collect dust in the silo. Meanwhile, in the silo next door, another organization is creating their own data collection tool from scratch with similar, but slightly different, categories.

Imagine a world where all the information collected at the site level in Malawi over the past decade suddenly has the same linked “key,” like an official site ID, and you could compare vast data from multiple sectors over time for a single site.  <sheds tear>

Note: I’m certainly not picking on Malawi. This a massive problem in every country and sector. It just happens to be what’s fresh on my mind.  Malawi is currently developing a nation-wide site registry that will link all sites with a common ID, so kudos to them on that front.

Issue #2: “People won’t understand the nuance”

Exceptionalism. <big sigh>  

I put issue #2 in quotes because, unfortunately, I said it in the past. When I worked for PEPFAR, we created a data stream that generates massive amounts of information on US government expenditures linked to program outputs (e.g. expenditure per person tested for HIV). It covers 58 countries, geographic regions within countries, thousands of implementing partners, tons of indicators.  The short story is, it’s a big and detailed dataset with many dimensions and provocative information. Once we had the data we used it extensively within PEPFAR, but refused to release the contents publically. Our justification? “It’s incredibly nuanced and there is potential for people to misuse it.”

I certainly get why this is a problem—I tend to be more of sharer than a keeper—but there was a real fear at the time that these data could damage perceptions of the program or reputations of our partners. Not because there was evidence of any glaring malfeasance, but because it shed light on areas where PEPFAR really needed to do better with the money available. 

Our mistake in the above example came from a generally good, if not misguided, motivation: a fear that can be ameliorated with better documentation and tools. There are, however, more shady situations where failure to share information is due to a fear that people find evidence of fraud or information mishandling.

In either case, “nuance” is not a valid excuse to keep data locked up tight. On the contrary, choosing not to share information that could be better used by someone else to improve development programs should be viewed as potentially damaging and a challenge to progress.

Issue #3: Data Quality

Though “misuse” can be invoked as a justification for sharing data, “data quality” is by far the most ubiquitous. This warrants a whole discussion on its own, which I won’t get into here. I will just say 3 things:

  1. People are generally nervous about how more information about their activities will impact them, which causes them to recoil at the notion of open data
  2. This fear is compounded when financial information is involved
  3. “All decisions are made on the basis of incomplete data, so either learn to live with this fact or get out of the game.” – Robert Townsend

Issue #4: We don’t understand the power of the semantic web

The semantic web is basically a concept and set of tools for better data documentation and standards that enhance congruency of data sources across the web. Seems basic, but we can’t even fathom the power this unsexy and unglamorous work unlocks.

Data janitorial work—data hygiene as I call it—has the power to democratize big data. We don’t all need to be data scientists to enjoy the insights better linked data will produce. We just need to line things up more effectively and let some amazing learning tools and bright minds step in.

As Brett Hurt states in the podcast, “The NSA gets it. Palantir gets it.  Facebook, Google, they get it.” But the rest of us aren’t there yet. Further, we can’t expect machines to really make our lives easier until we make it easier for machines to understand our data.

For this to work, the development community has to rally around data hygiene as a first principle and actually mean it. Then we have to get over our hesitance to share our work, warts and all.

I look forward to the day when data is communal, instead of tribal. 

- Tyler Smith