Study Finds Machine Learning Technique May Worsen Fairness
Researchers with the Center for Advancing Safety of Machine Intelligence (CASMI) have found that a commonly used technique in machine learning (ML), which is meant to improve accuracy, is more likely to worsen fairness for historically disadvantaged groups.
The research team, which included Julia Stoyanovich, associate professor of computer science & engineering and of data science at New York University (NYU) and director of its Center for Responsible AI; Falaah Arif Khan, a PhD student at NYU; and their collaborators Prof. Sebastian Schelter and Shubha Guha from the University of Amsterdam, published its findings in a paper entitled, “Automated Data Cleaning Can Hurt Fairness in Machine Learning-Based Decision Making.” While researchers discovered automated data cleaning has an insignificant impact on both accuracy and fairness in the majority of cases, they added that when cleaning does have an impact on fairness, it is more likely to worsen than to improve it.
“The message in this paper is we need to pay attention to what happens during data cleaning,” Stoyanovich said. Her research in data management systems focuses on how to ensure data is processed responsibly before AI technologies make predictions. Stoyanovich is the principal investigator for the CASMI-funded project, “Incorporating Stability Objectives into the Design of Data-Intensive Pipelines.”
The study examined more than 26,000 models on five publicly available datasets in census, finance, and healthcare. Researchers reviewed data on race, sex, and age and applied common error detection strategies, including missing values, which can occur when people intentionally omit information about themselves.
Stoyanovich provided an example of how missing values can impact people. If an older person is applying for a job, they may exclude the year they graduated from college because they don’t want to be discriminated against. By the same token, people who temporarily stay at home to care for family members may leave out the dates they worked at different jobs. This can taint the data pool. If a company is using this data to screen job applicants with AI, the model would likely suggest lower salaries to less experienced people.
“This is a problem,” Stoyanovich said. “Very often, these are women who stay at home. This is going to reinforce a gender wage gap.”
One surprising finding was that errors in the data were equally distributed for everyone, regardless of their demographics, but when the ML model’s prediction was evaluated, the outcome was not equally good for everyone, said Khan, NYU Center for Responsible AI PhD student.
“This sort of implicates two things,” Khan said. “One is that we're not detecting errors correctly. Maybe we're missing more errors that exist for certain groups. Two is that we're not fixing them equally well.”
Data can also be mislabeled, which can lead to bad outcomes. In one health-related dataset, the fraction of false positives was significantly higher for the privileged group than for the disadvantaged group. Researchers say this could be problematic.
“When mislabels occur, it's the direction of impact that we need to look at,” Khan said.
Overall, the study found auto-cleaning produced worse outcomes 23.6% of the time with missing values and in 33.3% of cases with labeling errors.
“We need to stop, pay attention, and investigate this better so that when we do pre-processing, we’re actually aware of effects like missing values,” Stoyanovich said. “If you know this, you can guess better. For example, instead of taking the median of the entire population, you could take the median of older people or the median of women, and it’ll be a more precise estimate.”
This paper builds on Stoyanovich’s previous research about AI systems used in hiring. Her research team published a paper entitled, “An External Stability Audit Framework to Test the Validity of Personality Prediction in AI Hiring,” which studied two frequently used AI-powered personality testing products by Humantic AI and Crystal. The paper concluded that neither test should be considered a valid testing instrument for pre-hire assessment, because it would predict “personality” differently for the same job applicant when the input was changed in ways that should not impact prediction. For example, both tests produced vastly different “personality scores” for a person’s resume and the same person’s LinkedIn profile. Further, one of the tests produced different “personality scores” for some identical resumes that were saved in different file formats, such as PDF or rich text.
Stoyanovich’s approach is to help regulators create new laws, and she has been successful in spearheading change in New York City. In July, a new law will go into effect that will require companies that are using AI systems in hiring to notify job seekers that they will be screened using automated tools. It would also require companies to conduct independent audits to assess biases.
“There’s still ongoing work here,” Stoyanovich said. “The end goal is making sure these tools work well, and that they work equally well for people in different demographic and socioeconomic groups.” Her work on AI hiring regulation was recently covered by the New York Times.