Some of the inherent biases in conversations about data

This is something I think about a lot. I’ve been working with with student-level data for 23 years now. And talking about stuff for longer. There is something uniquely frustrating about conversations in which the person I am talking with does two things: using terms casually and presuming they convey specific meanings.

It drives me crazy because I tend towards the specific when I talk data. Or most other things. Especially if I am talking to someone who I think is supposed to be equally knowledgeable.

Frustrating, and representative of a larger problem. Meaning and bias.

Without going into too much detail, it is kind of difficult to explain well, so bear with me.

When we collect data at the state level, the data are much removed from reality. Institutions collect data for the purpose of running the enterprise. Some pieces they collect because we require it, but there really very few examples of that. Since most data are collected for their purposes, they define the manner of collection that fits their needs best – or as best fits the design of their administrative system.

At this level, we are talking already of two or three levels of abstraction from the original students the data represent. Further, the data are biased according to the definitions used by the institution. Some of these are based on local standards, some on national standards, some simply represent standards implemented by the software vendor.

Since we collect from public and private institutions, two-year and four-year, we impose additional standards and abstractions on the data. This is necessary in order for us to use the data and assure some cleanliness to it (we have thousands of edits in the collection system). However, this adds additional abstraction to the data.

Don’t mistake me, the data are still quite usable. They are simply usable only within realistic limits and are best used for descriptive and trend analyses.

Additional abstraction comes into play in other ways when external taxonomies are applied, such as the Classification of Instructional Programs (CIP). Just because something has the same name, that does mean it is the same thing. Sometimes differences are minor, but not always. And of course, sometimes it is just a dead horse being beaten.

Strangeness occurs when taxonomies are linked to taxonomies to taxonomies.

In a completely different dataset I have been working with very recently, I find that a column represents both a desired outcome and qualifying characteristic. Neat trick, huh? The values for that column are based on a crosswalk between two taxonomies (one of which is based on a third). Simply speaking, if there is a one-to-one correspondence, this should not be a problem. However, what if the relationship is one-to-many or, worse yet, many-to-many? Or even worser what if the original value crosswalked for the individual in question is one of multiple options? This provides multiple many-to-many outcomes, yet somehow  the file is produced with only one outcome.

This may be fine and perfectly correct….but how do I know? At this point I can’t even guess how much of this result is based solely on computing algorithm or individual choice in the selection.

It’s kind of crazy.

A week or so ago, someone tweeted the point that algorithms are not unbiased. I’m sorry I don’t remember who it was, or the exact words. The fact is though, algorithms are not unbiased. Some human or three wrote the algorithm and there is every possibility that their biases are reflected in the algorithm – or in how it is used.

So, I guess what I am trying to say is this. The data I work with are as limited as they are powerful. The most important thing they tell us is where to start investigating with less abstract data.






Be nice. It won't hurt either of us.

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s