True On Average: the bias for data

I just finished taking down notes for Ruha Benjamin's Race After Technology, and the process confirmed that this will be on my list of best books from this year's reading. Benajmin's book focuses on the ways algorithms, applications, and programs, which are often presented to the public as neutral, can reinforce or deepen inequality, bias, and structural racism. I'm looking forward to collecting a few of my thoughts for a reading review later this year.

There is one idea I have been thinking about this week, which is related to Benjamin's comment regarding the way some will demand more data before committing to action. From her perspective, this comes up even when many experts are already in agreement regarding the necessary next step toward solving a problem. Benjamin uses the task of improving childhood education as an example of this phenomenon, noting how despite expert agreement that reducing poverty is the most important intervention toward achieving this goal, some demonstrate a certain perversion of knowledge by demanding to collect more data before committing to the intervention.

There are many reasons why some might demand more data in this type of situation, with cynical motivations certainly among the possibilities. However, what I've been thinking about this week regards the type of mentality that could lead to an innocent mistake - the tendency among data-driven thinkers to view the collection of data, either in volume or quality terms, as an unassailable strategy. There is, in other words, a bias for data, and this bias manifests in situations where data collection itself becomes both process and outcome, with no consideration allowed for whether additional information can improve the quality of upcoming decisions. The trend over the past few years has elevated the importance of being "data-driven" to the point where it would be unfathomable for a person, team, or organization to describe itself otherwise, but like with many empty buzzwords its strictest adherents would struggle to explain the drawbacks. The key distinction to me is that although having more data increases the odds of making the right decision, there is no guarantee that collecting additional data in a given moment will increase the odds of making the right decision.

The question in my mind as it relates to Benjamin's example is how to separate the cynical intent from the innocent errors made by those who have stumbled blindly into the cult of the data-driven approach. But in a general sense, the distinction may be a trivial one, for those who believe more is always better will never sate their appetite for additional data, which means their behavior will always be indistinguishable from those in outright opposition. The good intent of data collection simply hides the fact that this mentality has more in common with certain sins like gluttony or greed, which are likewise defined by the inability to know when enough is enough. I would prefer that we generally adopt a more careful approach to data collection. I think it's impossible to effectively adopt a data-driven mindset if those in charge of a project cannot identify how much data is necessary to reach their outcome; those collecting additional data for its own sake likely do not understand the issue at the core of their specific problem.

I've had this on my mind over the past week because I recently realized that I will soon encounter this type of situation in my work. Over the next few months, I'll join a workgroup seeking to drive progress against a set of inclusion, diversity, and equity goals within the organization. I'm unsure about how to implement or even introduce my approach because I fear it will challenge and possibly threaten some members, particularly those who perceive themselves as data-driven without having given the label a great deal of thought. Should I simply demand to see the plan that would be set in motion given the accumulation of more data? Or is it best to start with the idea, then work out a way to apply the philosophy to the situation? It may be wise to simply point out what I think is plainly evident - if the goal to improve decisions necessitates collecting better data, then the obvious question is to find a consensus regarding the point where we have enough higher quality data. In healthcare, there is a concept defined by HIPPA known as "minimum necessary", and perhaps invoking this principle could help us find the right starting point.

My fear is that we will make the kinds of mistakes that result from good intentions. For example, we may decide that certain decisions will be driven by the kind of information we can extract from only the highest quality data. This sounds good in the planning stage, but it's easy to imagine that we'll have more success collecting such data from those who are the easiest to collect from. And who would be the easiest to collect from? My strong hunch is that we'll collect from those already affiliated with our organization, excluding the potential supporter groups that we are trying to include in our work, which would only reinforce the existing structural challenges that prompted the establishment of the workgroup in the first place. This concern will remain right at the front of my mind, placed there perhaps by what I've recently read and noted from Race Against Technology, which maintains this kind of thinking as its unofficial theme - we must think seriously about how we use so-called neutral tools, and remember that their neutrality is no guarantee we won't use them to reinforce existing biases.

Sunday, October 10, 2021

the bias for data