Award-Winning Research Highlights Challenges and Opportunities for More Reliable Human-AI Decision-Making
Researchers with the Northwestern Center for Advancing Safety of Machine Intelligence (CASMI) have investigated threats to validity and reliability in human-AI decision-making and have demonstrated how new methods may help address these challenges.
The team from Carnegie Mellon University (CMU) – comprised of Luke Guerdan, PhD student at the CMU Human-Computer Interaction Institute; Amanda Coston, PhD student in machine learning and public policy; Kenneth Holstein, Assistant Professor at the Human-Computer Interaction Institute and principal investigator for the CASMI project, “Supporting Effective AI-Augmented Decision-Making in Social Contexts”; and Steven Wu, Assistant Professor at the Software and Societal Systems Department – won a best paper award in June at the Association for Computing Machinery Conference on Fairness, Accountability, and Transparency (ACM FAccT) for their research paper, “Counterfactual Prediction Under Outcome Measurement Error.”
The researchers first developed a new framework for human-AI decision-making. To build the framework, the team reviewed real-world deployments of AI-based decision support systems in a variety of fields. Researchers describe how information sources and decision-makers can influence one another in real-world human-AI decision-making settings. The framework is presented in “Ground(less) Truth: A Causal Framework for Proxy Labels in Human-Algorithm Decision-Making.” In the paper, the researchers analyzed prior published experimental studies of human-AI decision-making and found that 92% of the studies examined a narrow subregion of the framework. Specifically, most study designs assumed the absence of key factors that threaten the reliability of AI tools, compared with human decision-makers, in real-world settings.
“If we want to ensure safer human-AI decision-making, it’s urgent that the research knowledge we’re developing as a community is reflective of real-world deployment conditions. We hope researchers can use this framework to reflect on key aspects of human-AI decision-making in the real world that have been neglected in existing research,” Holstein said.
The research team has begun work to address this gap. For example, in a recent paper, “Toward Supporting Perceptual Complementarity in Human-AI Collaboration via Reflection on Unobservables,” the researchers experimentally investigated one key factor described in the framework: the availability of relevant information that human decision-makers can access, but an AI model cannot. Similarly, in an upcoming paper, “Training Towards Critical Use: Exploring How Humans Learn to Make AI-Assisted Decisions in the Presence of Incomplete Information” (led by Anna Kawakami, PhD student at the CMU Human-Computer Interaction Institute), the team experimentally investigated the effects of training interfaces that help human decision-makers learn to use AI tools more critically, taking into account the ways their own knowledge may complement that of an AI model.
Guerdan presented two papers at ACM FAccT on June 14: one paper introducing the framework, and a second that investigates implications for AI model development. The award-winning research focused on three key threats to the reliability of AI models described in the framework. Outcome measurement error refers to the difference between what human decision-makers actually care about (such as a patient’s need for medical care) versus the easy-to-measure indirect outcome that an AI model is trained to predict (like previously incurred healthcare costs). Treatment effects refer to the impacts of decisions on the real-world outcomes a model is trained to predict (where these predictions are in turn intended to inform decisions). Finally, selection bias refers to the over- or under-representation of certain subgroups in the data used to train a model, resulting from past decision-making that generated that data.
Prior approaches had focused on addressing each of these challenges separately when training models. But the researchers found that it was necessary to correct for their combined impacts.
“If you don't address all three of these challenges, it turns out that your model can do a bad job at the task you set out to achieve,” Guerdan said. “You need to carefully address all of these issues in parallel if you want to develop a model that is valid.”
These findings can help in any setting that is using a predictive algorithm to inform decision-making, Guerdan said. Examples include education, where schools can deploy early warning systems to identify at-risk students; and healthcare, where clinicians may use a support tool to identify high-risk patients.
As a next step, the researchers are building upon the work described in these FAccT papers to develop new practical tools for AI developers. These include tools aimed at helping AI developers conduct more reliable and informative evaluations of human versus AI decision-making, as well as tools to help developers make more informed decisions.
“The question is what should a model predict?” Guerdan wondered. “People assume that a decision is the same as a prediction, and if we can predict something more accurately, that's going to make the decision better.
“But often, people care about multiple different factors,” Guerdan emphasized. “So, the question is, how can model developers holistically account for all the different factors that humans care about when they are making decisions about the design, evaluation, and legitimacy of a predictive model?”