Empowering Human Knowledge for More Effective AI-Assisted Decision-Making
In today’s world, humans and machines are making decisions together. Doctors may use artificial intelligence (AI) systems to help diagnose diseases. Banks might deploy AI-based tools before determining whether to lend someone a loan. While people often rely on knowledge of cause and effect and their abilities to reason, machines use statistical inferences to make decisions.
Research from the Northwestern Center of Advancing Safety of Machine Intelligence (CASMI) is exploring key differences between human and AI judgment to support more effective human-AI decision-making. The project, called “Supporting Effective AI-Augmented Decision-Making in Social Contexts,” is made up of researchers from Carnegie Mellon University (CMU), and its principal investigator is Kenneth Holstein, assistant professor at the CMU Human-Computer Interaction Institute (HCII).
“Our field work aims to understand what AI-augmented decision-making actually looks like in real-world contexts,” Holstein said. “What we find, again and again, is that there are a lot of complexities that aren't typically talked about or studied in the academic research on AI-augmented decision-making.”
Holstein traveled with Charvi Rastogi, CMU Machine Learning Department Ph.D. student; and Anna Kawakami, CMU HCII Ph.D. student, on Nov. 6-9 to Delft, Netherlands to present the research team’s findings at a joint convening between the Association for the Advancement of Artificial Intelligence (AAAI) Conference on Human Computation and Crowdsourcing (HCOMP) and the Association for Computing Machinery (ACM) Collective Intelligence (CI) Conference.
Framework for AI-Augmented Decision-Making: Asking Why
A main contribution of the research is a framework that helps researchers and practitioners who are implementing human-AI systems ask why and how they are combining human and AI decision-making.
Rastogi and Leqi Liu, Princeton Language and Intelligence postdoctoral researcher, are joint lead authors of the HCOMP-accepted paper, “A Taxonomy of Human and ML Strengths in Decision-Making to Investigate Human-ML Complementarity.” Holstein and Hoda Heidari, CMU assistant professor in ethics and computational technologies, are co-authors.
Rastogi said the research team reviewed prior literature on human-AI decision-making and concluded that there was a clear need to form a deeper, more fine-grained understanding of which systems perform better with humans and AI systems working together (compared to when they work alone) and why ‒ a concept that the researchers call complementary performance.
“We developed a taxonomy that gives researchers and practitioners a language for why they are putting the two agents together and how they expect the complementarities to manifest,” Rastogi said. “This also works as a good sanity check. Why should one expect the human-AI team to perform better in the first place? Does that make any sense, given the abilities of the human and the AI model in the concerned application domain?”
For example, researchers interviewed radiologists who work with AI tools for diagnosing pneumonia. Radiologists said that while they detect pneumonia from examining a patient’s medical record and symptoms, the AI tool examines the chest X-ray, which is unnecessary in most cases.
The taxonomy weighs the strengths and weaknesses of human and AI decision-making and is divided into four parts: task definition (the objective), input (data collection), internal processing (data processing), and output (what is produced). Each part has distinguishing characteristics that either favor the human or AI model. For example, humans can see and hear things that a machine cannot, but AI models have access to a much larger swath of data. In these cases, it makes sense to implement human-AI systems to combine all the information.
“It’s intuitive. It makes sense that, if you have information that I don't, I should talk to you before making a decision, but how and when?” Rastogi said. “Our work provides a formalization with simulations and quantifiable metrics to measure how different human-AI collaboration strategies would have worked out before actually implementing them.”
This formalization comes in the form of an optimization framework that produces optimal weights for deciding how much to rely on the human's or the machine’s decisions. The researchers ran simulations to test what these optimal weights look like in different settings.
“This work is a call to action for researchers and practitioners working on human-AI collaboration,” Rastogi said. “How can we be very specific about the differences between human and AI decision-making? There are increasingly many high-stakes application domains in this space, conferring responsibility on researchers to do their due diligence in understanding the sources of complementarity in their application domain and designing towards effective human-AI decision-making.”
Centering Human Expertise When Supporting Learning in High-Stakes AI-Assisted Decisions
Another main contribution of the research is a study which examines how training can support humans in leveraging their own unique knowledge against an AI model’s predictions ‒ a concept called critical use. The study included a randomized online experiment of crowd workers, social work graduate students, and social workers.
The paper, called “Training Towards Critical Use: Learning to Situate AI Predictions Relative to Human Knowledge,” was accepted to the ACM Collective Intelligence CI Conference. Its authors are Anna Kawakami; Luke Guerdan, CMU HCII Ph.D. student; Yanghuidi Cheng, CMU HCII Master’s student; Kate Glazko, University of Washington Ph.D. student; Matthew Lee, Toyota Research Institute (TRI) staff research scientist; Scott Carter, TRI staff research scientist; Nikos Arechiga, TRI senior research scientist; Haiyi Zhu, CMU associate professor of human-computer interaction; and Holstein.
The online experiment simulates an AI-assisted child maltreatment screening task. Past field research from the team has demonstrated that, in this setting, decision-making improves when experienced workers exercise discretion around AI predictions, rather than adopting predictions uncritically. The 354 participants are asked to assess whether they would screen in or out a series of cases alleging child maltreatment; each of these practice cases are based on real, historical cases that past social workers encountered. Each case has an AI risk score, which predicts the long-term likelihood of future involvement in child welfare. This is an imperfect proxy for social workers' underlying decision-making goals, which typically focus on assessing immediate risk and harm to the child. Participants see the AI risk score, along with the case overview and details, before making a screening decision.
The experiment found that the participants learned to more critically use the AI risk score through the training. In other words, with repeated practice, they learned to exercise their own judgment, using not only the AI risk score but also qualitative information about a case that the AI model did not have access to. As a result, they came to disagree with the AI risk score more often in ways that resembled the decision-making of past workers who had months or even years of experience with AI-augmented decision-making. For example, participants agreed with the AI prediction 70% of the time at first. That decreased to 42% through training. Meanwhile, participants’ patterns of disagreement with AI predictions became more aligned with those of past experienced workers, rising from 30% to nearly 70% aligned.
“In prior field research, where we observed social workers using the same AI tool in their actual jobs, we were surprised to learn that social workers weren't provided with meaningful training on the AI tool,” Kawakami said. “They had seen conceptual information about the AI model, such as what it is trained to predict. But they didn't have any opportunities to actually practice making decisions with the AI model before going into the field and using it.”
Kawakami said the research community should move towards measuring critical use because humans have unique expertise that will enable them to make use of AI predictions more critically and responsibly.
“This paper, at a really high level, is just about centering human expertise when evaluating and supporting learning for AI-assisted decision-making,” she said.