Sunday, May 24, 2020

leftovers - proper corona admin, vol 35.0 - data science has a null hypothesis

I posted Volume 35.0 about two weeks ago but I must admit that I was rudely interrupted - I was just about to go off on a data science rant, but unfortunately my half-finished post went up before I could shake the remaining ink out of my pen. Let's blame that damn editor, always getting in the way (editor's note - not true, on average or otherwise, editing is an essential business).

Anyway, today I'm back to finish off my interrupted rant about data science.

TOA finishes off the rant about data science

We all know I'm prone to go on a data science rant from time to time. I think that's fine, based on my experiences I've earned it, and of course I think I'm right about the false promises made by the field. I'm not going to rehash my tired arguments today.

Of course, if I believe in one thing, it's to prove myself wrong, so I try to spend some of my leisure time seeking out contrary information. It's a good way to test my own hypotheses. In my endless quest to maintain a correct perception about data science - technology's future since 2001! - I occasionally check in on articles or newsletters to see what's new and exciting in the field. If things change, I'll change my opinion, but amazingly nothing seems to have changed in the past few years. It's still prohibitively expensive, still leveraging semantics to deflect criticism, and still measuring success by intricacy of method rather than impact of result. In my most recent news roundup, I read for the 10,000th time about a data scientist out there moaning - all I need is clean data! I double checked, it wasn't from The Onion.

The clean data concept always fascinates me. I often wonder - OK, but if you had clean data, the problem would already be solved, right, by someone less skilled... like me? It reminds me of an old joke about economists - an economist sees a $20 bill on the ground, but walks past it, thinking - if it was really a twenty, someone would have already picked it up. They might be rich, but I'm $20 richer. I guess the key question is, at what point will I become too good to pick up a twenty? It's probably right around when I become too good to clean up the data. If you can't do your job without perfect conditions, how good are you at your job? I don't go to Dr. House when I cut my arm, I go when no one has a clue why my tongue is swelling up. Further, it would be odd if he (or any doctor) complained that the patients were always sick.

However, as I've thought more about data cleanliness, the more I've noticed my position shifting as it regards data science. In my most recent news roundup, I realized that the fundamental problem in data science isn't very different from the fundamental problem in most fields - a lack of clear, highly defined questions. (In this sense, data science is a lot like TOA.) Unclean data is an issue, sure, but no data is clean until a question comes along and sets a standard of cleanliness. It's like how my doorknobs used to be clean, but then this pandemic came along and now it's a disease vector.

One of the data scientists I work with sometimes refers to himself as a 'data janitor'; past colleagues have invoked the 'data plumber' comparison. These references to the cleanup aspect of the work are rare instances of self-deprecation missing the point - cleanup is the work, as it is whenever science requires the assistance of data. Most of science doesn't work without clean data and most scientists seem to have accepted that it's their job to ensure the data is clean enough for best practices and techniques. In this current pandemic, I can't help but think of John Snow's famous work during the London cholera outbreak - did he complain to The Lancet that his entire job consisted of cleaning up messy data? Maybe he did, but he still managed to find the time to reach a meaningful conclusion.

One advantage Snow had was a highly-defined question - why are people all over London getting sick? It gave him a reference point for a hypothesis, set the definition for clean data, and helped him focus his efforts. When I read about data science or talk to data scientists, it seems like the lack of clear problems is always at the root of every problem. The end result is predictable - a lot of endlessly cool accomplishments, often incorporating complex, intricate, or advanced techniques, yet no obvious progress toward solving our most pressing issues. This article has been hard to get out of my head for a long time - the concept of using 6,000 data variables per map is cool, but what can any data map tell us that isn't crystal-clear in the two photographs included within the piece? And what to make of the mapmaker's comments which imply confusion about how hypothesis testing works, suggesting 'use data to support the hypothesis'? I don't recall the point of science being to prove yourself right.

It might be a little easier if Atlanta treated poverty with the same urgency that London once demonstrated for cholera, and framed its issues in terms of problems that helped the field direct its efforts. I'm increasingly certain this is the next critical step. You could argue Snow had just enough data to solve his problem, but I think Snow had just enough problems to use his data. Without the focus and direction of a clear problem, data science is going to continue making the worst of its strength - finding patterns - just as individuals have invented correlations since the dawn of time.

But like a promising teenager, data science is on the cusp, and with a little focus might soon convert its potential to results. It's been said that when it comes to technological progress, we overestimate the next year but underestimate the next ten. My glee in reminding people that the autonomous car is just around the corner - but stuck in its own traffic - often closes the door on bringing up the latter half of the aphorism. What comes around that corner a decade from now is going to be interesting for sure, and impossible to predict, because it will be a promising field's answer to a very difficult question - what do you want to be when you grow up? Like it's been the case with just about anyone I knew in their teen years, the answer will be worth the wait.