yes, data can lie


My office doorway, leading out into policy analytics lab environment. This was a gift from one of my former vice presidents when he left to become a college president.

A couple hours after I published i have questions, my brilliant friend Tressie tweeted:

The Digital Goofball, I mean, Guru, above is wrong. Data do lie on occasion. They can lie for a whole bunch of reasons, from the simple to the complex. The lies can begin at point of collection and continue on through aggregation and analysis.

Another brilliant friend, Laura, published a new blog post on the same day that begins with a story of a failure in medical screening.  The nurse’s failure in this account suggests that her lack of questioning is her normal behavior, or hints at assumptions she makes about patients. Whatever the case, the collected data are suspect.

Data collection is an expensive process to do well. Putting aside Big Data, which generally captures data that are a byproduct of transactions, good data collection requires careful thought and planning. I wrote about counting to one almost two years ago and point to it again because it is the basis of what I do and think about. Understanding what you are counting, and why. What I didn’t discuss, and it is implied in friend Jeff’s (also brilliant) essay that is linked within that post, is choosing what not to count. Or who not to count. Every choice in collection defines the truth and reality of what the data can represent as information.

Once we move beyond collection into shaping we encounter the same choices. We shape (most people call it “transform” but I like to shape the data) into forms that fit our understanding and the understanding we wish to share with others. Data are like Play-Doh and can take all sorts of shapes and dimensions. It can be worked and reworked for endless variety. But, it can only stretch so far before it breaks and becomes separate pieces. This is what happens with data when you stretch the definition and structure too far, original meaning is lost and the provenance is broken. Small pieces can be lost during this shaping, or blended with other “colors” creating something new, but increasingly more abstract than the original data.

There’s a key word: “provenance.” Familiar from my art history/museum studies days. Relevant today as “data provenance” for both the ownership and meaning of the data all along the way. While one may be able to demonstrate a chain of ownership and handling of data, at some point it is possible to have shaped the data into something that violates the provenance, intentionally or not. Developing checkpoints along the way of shaping and transformation is needed to reliably maintain original meaning or at least the path to original meaning.

As data are aggregated, as the counting begins, the lies can take on new dimensions of possibility. This wonderful essay by Dana Boyd covers a number of examples. At the close of our first panel presentation at the Governor’s Data Analytic Summit that I mentioned here ,each of us were asked who the ultimate beneficiary of our work would be. We all gave the same answer, but I was naturally bit blunt about it, “Clearly the citizens of Virginia. If not, we are doing it wrong, and that’s all that really matters.” Later in the day I raised the need I felt – that every session should have some discussion about the ethics of data use and analytics. There was much less response to that than I had hoped for.

Data can be made to lie. They can also just be wrong. Errors do occur, and hopefully they can be corrected before damage is done. The data can also be right, but misused, misunderstood, and misinterpreted. For example, attributing an outcome to say, skin color, instead of attributing to differences in treatment because of skin color, is more than likely in most cases a complete misunderstanding of the data. Or a determined willingness to see an interpretation that fits your desired model, something along the lines of confirmation bias but with clear intentionality.

One has to know the distance between zero and one and what that distance measures and accept that distance in order to have an honest conversation about data. The fact that I am inspired/driven to write this post should be an indicator that I feel strongly about this topic. “Data don’t lie” is right up there with “the check is in the mail” and “don’t you trust me baby? of course I love you.” It is a crock of shit to say in an all encompassing way “data don’t lie.” Some of us work damned hard trying to ensure our data don’t unintentionally mislead. We spend hours wrangling with people about nuance, not just nuance of definition, but the nuances of calculation, why one way is more accurate than another. And why it takes longer to do it right.

Even when the data are pure as the driven snow, the provenance is impeccable, and the interpretation admirably circumspect, there is still room for doubt. What was left out? What assumptions were not valid? What don’t we know about what we don’t know?

In other words, we should be cautious with even truthful data. It is never, in my experience, “the whole truth and nothing but the truth” as there are always unknowns.