Chun: Bias is Mathematically Certain

Investigations into algorithmic bias reveal humanity’s own flaws.

by Steven Chun | 5/16/17 12:45am

When you hear about algorithms — like the one Facebook uses to construct your personal newsfeed or the one Google is fine-tuning to fight the spread of fake news — it’s likely that you’re hearing about predictive analytics. An algorithm is just a series of instructions: Multiply the two, carry the three or go to class at these three times every Monday, Wednesday and Friday. Predictive analytics, as the name might imply, are algorithms meant to collect and use data to generate a prediction about an outcome that cannot be definitively known until the event occurs. Based on what you know, is a given person going to make it to class? If it’s Green Key Friday, then that probability might go down.

By human standards, algorithms are incredible at what they do. Researchers at Stanford University trained a deep convolutional neural network to identify the most common and the most lethal skin cancers with the same or better accuracy than 21 board-certified dermatologists. Algorithms have been incredibly accomplished at everything from predicting Supreme Court outcomes to car collisions. We now know more about the future than at any point in history.

But recently, algorithms have been accused of bias. That’s a complicated accusation to make. One contention is that their authors are predominantly male and that more generally human programmers beget human bias. This is a valid concern, but one that is minimized because programmers rarely ever make value judgments. Rather, the decisions that would usually present opportunities for bias to creep in are instead informed by massive datasets of historical, real-world data. But this contention misses a far more worrisome discovery: When it comes to predicting the future, it is mathematically impossible to avoid bias — that goes for humans and computers. How three Cornell University and Harvard University researchers came to that conclusion is the story of America’s prison system, three different definitions of fairness and a lot of math.

Let’s start from the beginning. Courts and parole boards use an algorithm called Correctional Offender Management Profiling for Alternative Sanctions, or COMPAS, to determine recidivism risk. The algorithm uses 137 questions, ranging from criminal and family history to drug use. It doesn’t use race or geography as a consideration. Yet, a report and accompanying statistical analysis from the public interest journalism site ProPublica found evidence of racial bias in COMPAS. The rate of false-positives — individuals the system believed had a high risk of recidivism but who did not reoffend for a two-year period — for black defendants was nearly twice that of white defendants. Furthermore, the rate of false-negatives — individuals rated as low risk who went on to reoffend within two years — for white defendants was almost twice that of black defendants. Northpointe, the company that owns and sells COMPAS, debated the claim. It demonstrated that COMPAS risk ratings were equally accurate for both blacks and whites. For example, roughly 60 percent of whites with a score of seven and 60 percent of blacks with a score of seven reoffended. So what gives? How can this algorithm be biased and unbiased at the same time?

The question intrigued Cornell computer scientists Jon Kleinberg and Sendhil Mullainathan and Harvard economics professor Manish Raghavan. They realized that ProPublica and Northpointe were defining “fair” differently. In the resulting paper, they outlined three general conditions of fairness for risk assessment. First, the assessment should be well-calibrated; in short, if it says there’s a 40 percent chance of an event happening, then it should happen roughly 40 percent of the time. Second, false positives — in the case of COMPAS, low scorers who would go on to reoffend — would be identified at the same rate across groups. Third, false negatives — high scorers who didn’t reoffend — would be identified at the same rate across groups.

ProPublica highlighted failures of the second and third conditions. Northpointe responded by showing how COMPAS met the first condition. They were both right. The researchers proved that given two groups with differing average rates of positive instances, any assessment of the risk would fail at least one of the conditions. There is no way for prediction to be fair in every sense. The authors provided an example. If a disease existed that impacted women more frequently than men, “any test designed to estimate the probability that someone is a carrier” would have an undesirable property based on the three conditions they outlined. Either “the test’s probability estimates are systematically skewed upward or downward for at least one gender,” “the test assigns a higher average risk estimate to healthy people (non-carriers) in one gender than the other” or “the test assigns a higher average risk estimate to carriers of the disease in one gender than the other.”

This proof isn’t constrained to computer programs. It holds for anything or anyone trying to make a prediction of risk. Humans make decisions and predictions through complicated biological algorithms. Our dataset is some subset of our knowledge and experiences. Yet even the most level-headed, reasonable person can’t make a prediction without having to choose between being accurate and being biased.

This realization is quite bleak, but it also seems fundamentally human. Since antiquity, we have tried to make better, fairer decisions only to come to the realization that we’ll never be perfect. The human condition is dogged by a never-ending series of tradeoffs. But equally human is the desire to relentlessly improve in the face of seemingly insurmountable obstacles. So regardless of the decision maker, the question of bias and fairness is far murkier and more complex than anyone anticipated. Over-simplification is anathema to progress. It’s not quite that humans or algorithms are biased — the very act of prediction is biased.