Researchers Investigating Popular Predictor Tool, Working to Mitigate Bias in Data

Sep 25, 2023

Researchers with the Northwestern Center for Advancing Safety of Machine Intelligence (CASMI) are working to understand what is causing bias in a popular predictor tool that is quick to train.

Romila Pradhan

Romila Pradhan, assistant professor in computer and information technology at Purdue University, is the principal investigator of the CASMI-funded project, “Diagnosing, Understanding, and Fixing Data Biases for Trusted Data Science.” Her research team is investigating data from multiple domains (such as finance and criminal justice) to find the source of discrimination that is observed in a commonly used tree-based machine learning (ML) model called the random forest classifier.

“What we observe is that on the unseen data, we might have some bias,” Pradhan said. “We had a model that worked fairly well in training, but it did not perform as well on test data. It may be favoring one gender over another or one race over another. I want to trace all of this bias back to different tiny steps in the machine learning process and eventually to the training data.”

A random forest classifier makes a prediction based on the outcomes of many decision-making trees. Each tree considers multiple factors before making a prediction. For example, a bank may use a random forest classifier to decide whether someone should be approved for a house loan. One decision-making tree may focus on credit score and income, while another might look at employment history and down payment. The random forest classifier combines all these individual predictions before calculating whether the loan should be approved.

This research project builds upon previous work involving a different, smaller class of ML models. Pradhan was the lead author of the paper, “Interpretable Data-Based Explanations for Fairness Debugging,” which developed a method to identify subsets of training data as root causes of bias or unexpected model behavior. Researchers found that removing problematic datasets can reduce bias.

Pradhan provided an example of a model that uses demographic and financial information to predict whether an individual would make more or less than $50,000 a year. Under this model, males were 15% more likely to make more than $50,000 a year than females. However, researchers found that if they pinpointed 100 people to remove from the data (for example, removing 100 Indiana men between the ages of 35 and 49), bias reduced from 15% to 4% immediately.

“I'm trying to come up with that kind of explanation for random forest classifiers,” Pradhan said.

Tanmay Surve

The research team primarily assumes that the ML model is fair, but the training data is at fault, said Tanmay Surve, a Purdue graduate student whose work focuses on fairness in ML. Several factors can cause faulty training data, such as human error and underrepresentation.

To find datasets that are causing bias, Surve said they are using a technique called machine unlearning.

“For example, I have a Facebook account, and I want to delete my Facebook account,” he said. “I also want Facebook to stop using my personal, sensitive information because I'm no longer their customer. This is a very hard problem to solve in real life. The ML model has already been trained on your data. It’s not practical to retrain the model from scratch, excluding my data points. To solve this, machine unlearning works to find ways to remove influence of a few data points without retraining the model from scratch again.”

Surve said the research mathematically proves which data subsets are causing issues.

Removing problematic datasets can decrease the accuracy of an ML model by one or two percent. Pradhan said that is a small sacrifice she is willing to take.

“By compromising a little bit on accuracy, you can achieve a lot of fairness,” Pradhan said. “We have observed that we do not lose out much on accuracy. But I also have plans to extend this line of work to incorporate both fairness and accuracy. What happens when we are not only optimizing for fairness, but also for accuracy? How do things change then? That is the long-term plan.”