[This blog is cross-posted at the Cyber Alliance Blog]
(This post was originally written in August of 2018.)
Get a bunch of humans together to make some decisions, and some of them are bound to get it wrong. Humans are wildly different from each other, with so many different life experiences that lead up to a decision that there is almost never 100% consensus on anything. But it doesn’t stop there. Humans are so different from each other that not only will many of them fail, many of them will fail in wildly different ways.
I swear I’m going somewhere with this, so bear with me.
This diversity of failure is an important property of a system that I think a lot of people overlook. It has two important consequences: First, it means that if we were actually wrong about the best answer in the first place, then there’s a good chance that someone will chance on the better solution. This gives us a chance to discover this improvement and move to it, or just let the evolution of our answer slowly select for it over many iterations. Second, though, it means that we did not create a systemic problem by failing in only one way. It means that even if we undershot our answer some of the time, we did not undershoot our answer all of the time - sometimes we overshot as well. Hopefully, overshooting some of the time balances out some of the cost of undershooting.
Let me give an example. Let’s say ten people are going to a potluck, and must decide how much food, and what kind, to bring. If they don’t bring enough, then people will be hungry. If they bring too much, the leftovers will spoil and go to waste. Everyone has some kind of estimate of how much food the party will eat in total. Inevitably, some people are going to overestimate the amount, and some will underestimate it. But on the whole, the errors probably balance each other out and the final amount of food is probably within the bounds of acceptable. In fact, if this party continues to have potlucks, and everyone brings a slightly different amount of food every time, it’ll still continue to work out on average, though the variance says that sometimes there will be too much food left over, and sometimes there won’t be enough.
Now let’s say we design an algorithm to do the same thing. It should calculate the amount of food the party wants and buy it for the potluck. Since it’s the same people every time, its estimate doesn’t change. (Later we’ll deal with an algorithm that’s slightly more sophisticated, but for now, let’s say it’s deterministic.)
If the algorithm were "perfect," it would get this right every time. Unfortunately, the chance of this algorithm being perfect is pretty much nil.
Instead, the algorithm is probably going to be wrong in some direction or another. The point is that it will always be wrong in the same way. There will either always be too much food, or always too little food. This creates a systemic problem that wasn’t there when we failed in a heterogeneous way.
“But wait!” I hear you saying. “Wouldn’t the party realize that they were always buying too little food, and just tell the algorithm to buy more?”
That approach works great if the party is in charge of the algorithm. But in my example, this algorithm is actually a service used by hundreds of millions of people. It’s not going to change its approach just for one party of 10 - that’s more expensive and complicated for it, and anyway, everyone has to keep using the potluck service that their friends use, so no one is going to switch to a more flexible, competing service. Sometimes it might not be a service at all, but is instead mandated by the government, but that’s beside the point.
This doesn’t even get to the problem where the food ordered by the algorithm is almost certainly objectionable to some people due to allergies, religion, or preferences. Ordering through that algorithm rather than allowing for a diversity of failures has now locked certain people out of eating at the potluck at all - not just in one party, but in all of them.
You could argue that a good food-choosing algorithm wouldn’t do this, but personal experience has confirmed to me that even in scenarios where food is trying to cater to everyone involved, it can still be true that there are no consumable food options for a lot of the population. It’s not a given that a food-choosing algorithm would be sensitive to this kind of thing, especially if the responsible parties are not incentivized to care.
Okay, so now that we’ve talked about the potluck example, let’s talk about automated content moderation.
Daniele Citron, a privacy expert and professor at the University of Maryland Francis King Carey School of Law, was invited to give a talk at the Cyber Alliance this week on content moderation and how recent developments have given some cause for concern.
Until recently, Facebook and other social media platforms have taken a First Amendment approach to content moderation, at least in the context of stalking. In general, they would err on the side of allowing the speech. In addition, their moderation was done by humans in a fairly ad-hoc manner - a user would report a post, then human moderators would make a decision on whether to allow the post or not. The main regulation governing this in the U.S. was (and still is) Section 230 of the Communications Decency Act. This act includes protection for “good samaritan” under- or over-filtering of content; that is, as long as you can demonstrate that you were trying to do a good job filtering your content, you won’t be punished for being a little overzealous in either direction.
As of 2015, the main method for making difficult filtering decisions was human moderation. Three things changed over the next couple years that irrevocably altered the playing field: The first change accompanied the 2015 terrorist attacks on Paris and Brussels. The second was the Russian disinformation campaign during the 2016 U.S. presidential election, and the third was the shockwave as the public learned of companies like Cambridge Analytica and their involvement in elections.
EU states began to blame social media companies for allowing extremist content to remain on their sites, and imposed a rule that such content must be removed within 24 hours of its posting. In the U.S., Facebook especially got a lot of bad press in the media, and Mark Zuckerberg had to testify about the Facebook moderation process (among other things) in front of Congress. In the wake of the extremely stringent 24-hour window and the volume of content being uploaded, Facebook and others turned to machine learning to block unwanted content.
The problem with automating something like content moderation is that you run into all the problems I said before - everyone had different opinions about what the right answer is, and now you only get one of them. There is now one systemic failure mode, rather than a diversity of failure. And if the only downside of an algorithm is user unhappiness rather than a legal restriction, then there is no incentive to ever change the algorithm and explore in the hope of better results. If the algorithm errs too much on the side of censorship, then Facebook has a handful of angry users. But if it errs too much on the side of speech, the company gets punished with huge fines and other angry users.
This concern applies to many scenarios, not only content moderation. In fact, content moderation is probably one of the less concerning ones when you think about algorithms governing parole decisions, recruiting and admissions decisions, marketing and price decisions, and so on. But content moderation has a global reach on social media platforms like facebook - as Daniele pointed out, individual countries have different laws, but Facebook’s Terms of Service are global. And it sets the tone for how we are able to discuss other issues. We’re coming out of something of a turning point, and the longer the current status quo continues, the more it will seem “normal” and we will cease judging it on its actual merits. So if you were ever going to start thinking about this, now is a good time.
This homogeneous-failure problem would go away if the algorithm never failed - if it could somehow “perfectly” distinguish Bad content from Good content. There are two barriers to this. One is the ability of the ML algorithm to solve whatever problem it is given - lowering error rates. A lot of research goes into trying to improve this process, and in a few years it may be possible that ML is better than each individual human at choosing its answers.
But the bigger problem is that we cannot even properly define the problem, and algorithms are poor decision-makers for ill-defined problems. There are going to be some obvious posts that most people agree are Bad, and some posts that most people agree are Good. But there is a whole universe of things in the middle where there is widespread disagreement - and for good reason. Society’s views shift among individuals, over generational changes, over political events, and naturally over lifetimes. Today’s algorithm will not agree with tomorrow’s standards, and it will certainly never agree with everyone.
In my opinion, this homogenization of failure is the biggest cost of automation, and is a depressingly rarely-recognized problem.
Automation has many benefits in addition to this cost - and sometimes the benefits will outweigh the cost. And, as the state of the art improves, an algorithm may be a very desirable alternative to a single human decision-maker, since they had no diversity of failure to begin with. It’s just that this cost should not be ignored, especially in these high-impact decisions.
I’ll end with a couple possible mitigations to the failure homogenization problem. One simple way is to have a fallback human decision-maker - or better yet, multiple humans. Only bring in the humans on content that is contested or where the algorithm had low confidence, and you’ve at least added an ability to adjust the system in difficult cases.
If the amount of data to be classified is so great that the entire process must be automated, try an ensemble of ML algorithms - a committee, if you will - each designed differently and presumably yielding slightly different results. There are algorithms for combining decision-makers in this way. The process tends to raise the overall accuracy of the system, which is a plus. More importantly though, it enables humans to judge the algorithms’ performance and to learn new perspectives on the problem they were trying to solve, just like back in the food committee example. It re-injects diversity of results, even if that diversity is only among algorithms themselves.
This post was written by Sarah Scheffler, a second-year Ph.D. student in computer science studying applied cryptography.