In 1998, I inadvertently created a racially biased artificial intelligence algorithm. There are lessons in that story that resonate even more strongly today.
The dangers of bias and errors in AI algorithms are now well known. So why has there been a spate of blunders by tech companies in recent months, especially in the world of AI chatbots and image generators? First versions of ChatGPT produced racist output. The image generators DALL-E 2 and Stable Diffusion showed both prejudices based on skin color in the pictures they took.
My own revelation as a white man computer scientist took place while teaching computer science in 2021. The class had just watched a video poem by Joy Buolamwini, AI researcher and artist and self-described poet of code. Her 2019 Video Poem “AI, am I not a woman?is a devastating three-minute exposé of racial and gender biases in automatic facial recognition systems — systems developed by tech companies like Google and Microsoft.
The systems often fail women of color, mislabeling them as male. Some failures are particularly egregious: Black civil rights leader Ida B. Wells’s hair is labeled a “coonskin cap”; another black woman is labeled as possessing a “walrus mustache.”
Echoes through the years
I had a terrible déjà vu moment in that computer science class: I suddenly remembered that I, too, had once created a racially biased algorithm. In 1998 I was a PhD student. My project involved tracking someone’s head movements based on input from a video camera. My promoter had already developed mathematical techniques for accurate head tracking in certain situations, but the system needed to be much faster and more robust. Earlier in the 1990s, researchers in other laboratories had shown that skin-colored areas of an image could be extracted in real time. So we decided to focus on skin color as an additional clue to the tracker.
Source: John MacCormick, CC BY-ND
I used a digital camera – a rarity at the time – to take a few shots of my own hand and face, and I also took shots of the hands and faces of two or three other people who happened to be in the building. It was easy to manually extract some of the skin-colored pixels from these images and construct a statistical model for the skin tones. After some tweaking and debugging, we had surprisingly robust real-time head tracking system.
Not long after, my consultant asked me to demonstrate the system to some visiting company executives. When they entered the room, I was immediately overcome with fear: the executives were Japanese. In my chance experiment to see if a simple statistical model would work with our prototype, I had collected data from myself and a handful of others who happened to be in the building. But 100% of these subjects had “white” skin; the Japanese executives do not.
Miraculously, the system worked reasonably well with the executives. But I was shocked to realize that I had created a racially biased system that could easily have failed for other non-white people.
Privilege and priorities
How and why do well-trained scientists with good intentions produce biased AI systems? Sociological theories of privilege provide a useful lens.
Ten years before I created the head-tracking system, scholar Peggy McIntosh proposed the idea of a “invisible backpackworn by whites. Inside the knapsack is a wealth of privileges such as “I can do well in a challenging situation without taking credit for my race”, and “I can criticize our government and talk about how scared I am of its policies and behavior without being seen as a cultural outsider.”
In the age of AI, that backpack needs some new items, such as “AI systems won’t give bad results because of my race.” A white scientist’s invisible backpack would also require: “I can develop an AI system based on my own appearance and know it will work well for most of my users.”
A suggested remedy for white privilege is to be active anti-racist. For the 1998 head-tracking system, it may seem obvious that the anti-racist remedy is to treat all skin colors equally. Of course, we can and must ensure that the system’s training data represents the range of all skin tones as evenly as possible.
Unfortunately, this does not guarantee that all skin tones detected by the system will be treated equally. The system must classify each possible color as skin or non-skin. That’s why colors exist right on the borderline between skin and non-skin — an area computer scientists call the decision boundary. Anyone whose skin color exceeds this decision threshold is classified incorrectly.
Scientists also face a nasty unconscious dilemma when integrating diversity into machine learning models: diverse, inclusive models underperform narrow models.
A simple analogy can explain this. Imagine being given the choice between two tasks. Task A is to identify a certain type of tree, for example elms. Task B is to identify five types of trees: elm, ash, locust, beech and walnut. It is clear that if you are given a fixed amount of time to practice, you will perform better on task A than on task B.
Similarly, an algorithm that tracks only fair skin will be more accurate than an algorithm that tracks the full range of human skin tones. Even if they are aware of the need for diversity and fairness, scientists can be subconsciously influenced by this competing need for accuracy.
Hidden in the numbers
My creation of a biased algorithm was thoughtless and potentially offensive. More worryingly, this incident demonstrates how bias can remain deeply hidden in an AI system. To see why, consider a particular set of 12 numbers in a three-row, four-column matrix. Do they seem racist? The head-tracking algorithm I developed in 1998 is controlled by a matrix like this one, which describes the skin color model. But it is impossible to conclude from these numbers alone that this is in fact a racist matrix. They are just numbers, automatically determined by a computer program.

Source: John MacCormick, CC BY-ND
The problem of bias hiding in plain sight is much more serious in modern machine learning systems. Deep neural networks – currently the most popular and powerful type of AI model – often have millions of numbers into which bias can be coded. The biased facial recognition systems criticized in “AI, Ain’t I a Woman?” are all deep neural networks.
The good news is that a lot of progress has already been made in AI fairness, both in academia and industry. For example, Microsoft has a research group known as LOT, dedicated to Honesty, Accountability, Transparency and Ethics in AI. A leading machine learning conference, NeurIPS, has detailed this ethical guidelinesincluding a list of eight points of negative social impacts that researchers submitting papers should consider.
Who is in the room is who is at the table
On the other hand, even in 2023, fairness may still fall victim to competitive pressures in academia and business. The flawed Bard and Bing chatbots from Google and Microsoft are recent proofs of this stark reality. The commercial need to build market share led to the premature release of these systems.
The systems have exactly the same problems as my 1998 head tracker. Their training data is biased. They were designed by a non-representative group. They are faced with the mathematical impossibility of treating all categories equally. They have to trade accuracy for honesty somehow. And their biases are hidden behind millions of inscrutable numerical parameters.
So, how far has the AI field actually come since it was possible more than 25 years ago for a PhD student to design and publish the results of a racially biased algorithm without obvious oversight or consequences? It is clear that biased AI systems can still be created unintentionally and easily. It is also clear that the bias in these systems can be harmful, difficult to detect and even more difficult to eliminate.
Today it is a cliché to say that industry and academia need diverse groups of people “in the room” to design these algorithms. It would be helpful if the field could reach that point. But in reality, with North American computer science doctorate programs only about to graduate 23% female and 3% Black and Latino studentsthere will still be many rooms and many algorithms in which underrepresented groups are not represented at all.
That’s why the fundamental lessons of my 1998 head-tracker are even more important today: It’s easy to make a mistake, it’s easy for bias to sneak in undetected, and everyone in the room is responsible for preventing it.