Commercial facial-analysis artificial intelligence programmes tend to demonstrate skin-type and gender biases, a study has found. In experiments, the error rates of three commercial programmes in determining the gender of light-skinned men were never worse than 0.8 per cent.
For darker-skinned women, however, the error rates ballooned to more than 20 per cent in one case and more than 34 per cent in the other two. The findings raise questions about how today’s neural networks, which learn to perform computational tasks by looking for patterns in huge data sets, are trained and evaluated. For instance, researchers at a major US technology company claimed an accuracy rate of more than 97 per cent for a face-recognition system they’d designed.
However, the data set used to assess its performance was more than 77 per cent male and more than 83 per cent white. “What’s really important here is the method and how that method applies to other applications,” said Joy Buolamwini, a researcher at Massachusetts Institute of Technology (MIT) in the US. “The same data-centric techniques that can be used to try to determine somebody’s gender are also used to identify a person when you’re looking for a criminal suspect or to unlock your phone,” said Buolamwini.
“It’s not just about computer vision. I’m really hopeful that this will spur more work into looking at other disparities,” he said. The three programmes that researchers investigated were general-purpose facial-analysis systems, which could be used to match faces in different photos as well as to assess characteristics such as gender, age, and mood.
All three systems treated gender classification as a binary decision – male or female – which made their performance on that task particularly easy to assess statistically. However, the same types of bias probably afflict the programmes’ performance on other tasks, too. To begin investigating the programs’ biases systematically, Buolamwini first assembled a set of images in which women and people with dark skin are much better-represented than they are in the data sets typically used to evaluate face-analysis systems. The final set contained more than 1,200 images.
Next, she worked with a dermatologic surgeon to code the images according to the Fitzpatrick scale of skin tones, a six-point scale, from light to dark, originally developed by dermatologists as a means of assessing risk of sunburn. Then she applied three commercial facial-analysis systems from major technology companies to her newly constructed data set.
Across all three, the error rates for gender classification were consistently higher for females than they were for males, and for darker-skinned subjects than for lighter-skinned subjects. For darker-skinned women, the error rates were 20.8 per cent, 34.5 per cent, and 34.7. But with two of the systems, the error rates for the darkest-skinned women in the data set were worse – 46.5 per cent and 46.8 per cent. Essentially, for those women, the system might as well have been guessing gender at random.