As CASMI research initiatives progress, we are committed to producing deliverables that can be utilized by researchers and practitioners. These outputs will include publications and articles, as well as data sets, open source code, and other materials.

AI Failure Cards: Understanding and Supporting Grassroots Efforts to Mitigate AI Failures in Homeless Services

Ningjing Tang, Jiayin Zhi, Tzu-Sheng Kuo, Calla Kainaroi, Jeremy J. Northup, Kenneth Holstein, Haiyi Zhu, Hoda Heidari, and Hong Shen
ACM Conference on Fairness, Accountability, and Transparency (ACM FAccT) (2024) DOI
In this paper, the authors present AI Failure Cards, a novel method for both improving communities’ understanding of AI failures and for eliciting their current practices and desired strategies for mitigation, with a goal to better support those efforts in the future.

AI Safety: A Domain-Focused Approach to Anticipating Harm

Kristian Hammond, Andrea Lynn Azzo, and Jim Guszcza
Center for Advancing Safety of Machine Intelligence (CASMI), September 2024
CASMI convened a workshop with support from the Rockefeller Foundation on June 24-25, 2024, aimed at exploring the different needs and requirements of different industries associated with the benefits and harms of AI. This convening was aimed at considering AI and its impact in a somewhat different way. The goals were to examine AI from the perspective of human interaction and harm, establish methods for determining the sources of those harms, discover approaches on how to mitigate them in the context of current systems, and to discuss how to avoid harms when developing new systems.

Anticipating Impacts: Using Large-Scale Scenario Writing to Explore Diverse Implications of Generative AI in the News Environment

Kimon Kieslich, Nicholas Diakopoulos, and Natali Helberger
arXiv (November 2023) DOI
Research on anticipating the impact of generative AI is still in its infancy and mostly limited to the views of technology developers and/or researchers. In this paper, the authors aim to broaden the perspective and capture the expectations of three stakeholder groups (news consumers; technology developers; content creators) about the potential negative impacts of generative AI, as well as mitigation strategies to address these.

Are Vision Transformers Robust to Spurious Correlations?

Soumya Suvra Ghosal and Yixuan Li
International Journal of Computer Vision (IJCV) (2023) DOI
Deep neural networks may be susceptible to learning spurious correlations that hold on average but not in atypical test samples. As with the recent emergence of vision transformer (ViT) models, it remains unexplored how spurious correlations are manifested in such architectures. In this paper, the authors systematically investigate the robustness of different transformer architectures to spurious correlations on three challenging benchmark datasets. Their study reveals that for transformers, larger models and more pre-training data signiﬁcantly improve robustness to spurious correlations.

Automated Data Cleaning Can Hurt Fairness in Machine Learning-based Decision Making

Shubha Guha, Falaah Arif Khan, Julia Stoyanovich, and Sebastian Schelter
International Conference on Data Engineering (ICDE) (April 2023) DOI
In this paper, the authors interrogate whether data quality issues track demographic characteristics such as sex, race and age, and whether automated data cleaning — of the kind commonly used in production ML systems — impacts the fairness of predictions made by these systems. To the best of the researchers' knowledge, the impact of data cleaning on fairness in downstream tasks has not been investigated in the literature.

CASMI White Paper: Toward a Safety Science in Artificial Intelligence

Alexander Einarsson, Andrea Lynn Azzo, and Kristian Hammond
Center for Advancing Safety of Machine Intelligence (CASMI), May 2023
Inspired by discussions and thoughts shared by a diverse group of researchers and stakeholders in a workshop on the topic, this work presents a series of steps for the AI community to define safety based on potential harms different AI systems may cause. The steps of 1) identification 2) mapping 3) quantification 4) remediation and 5) prevention will dampen the harms caused by existing AI systems in the short term, while also serving to inform all stakeholders in AI safety on how to minimize harm caused by novel systems in the long term. This work will be a resource for various groups, from developers of AI systems, to legislative groups and bodies, to anyone who has a vested interest in minimizing harm AI systems cause to the public.

Characterizing Eye Gaze for Assistive Device Control

Larisa Y.C. Loke, Demiana R. Barsoum, Todd D. Murphey, and Brenna D. Argall
Institute of Electrical and Electronics Engineers (IEEE) International Conference on Rehabilitation Robotics (ICORR) (September 2023) DOI
Eye gaze tracking is increasingly popular due to improved technology and availability. However, in assistive
device control, eye gaze tracking is often limited to discrete control inputs. In this paper, the authors present a method for collecting both reactionary and control eye gaze signals to build an individualized characterization for eye gaze interface use.

Counterfactual Prediction Under Outcome Measurement Error

Luke Guerdan, Amanda Coston, Kenneth Holstein, and Zhiwei Steven Wu
ACM Conference on Fairness, Accountability, and Transparency (ACM FAccT) (June 2023) DOI
In this work, the authors study intersectional threats to model reliability introduced by outcome measurement error, treatment effects, and selection bias from historical decision-making policies. Researchers develop an unbiased risk minimization method which, given knowledge of proxy measurement error properties, corrects for the combined effects of these challenges. They also develop a method for estimating treatment-dependent measurement error parameters when these are unknown in advance. The authors demonstrate the utility of their approach theoretically and via experiments on real-world data from randomized controlled trials conducted in healthcare and employment domains. As importantly, they demonstrate that models correcting for outcome measurement error or treatment effects alone suffer from considerable reliability limitations. This work underscores the importance of considering intersectional threats to model validity during the design and evaluation of predictive models for decision support.

A Debiasing Technique for Place-based Algorithmic Patrol Management

Alexander Einarsson, Simen Oestmo, Lester Wollman, Duncan Purves, and Ryan Jenkins
arXiv (December 2023) DOI
In recent years, there has been a revolution in data-driven policing. With that has come scrutiny on how bias in historical data affects algorithmic decision making. In this exploratory work, the authors introduce a debiasing technique for place-based algorithmic patrol management systems. They show that the technique efficiently eliminates racially biased features while retaining high accuracy in the models. Finally, they provide a lengthy list of potential future research in the realm of fairness and data-driven policing which this work uncovered.

Evaluating Metrics for Impact Quantification

Ryan Jenkins and Lorenzo Nericcio
Center for Advancing Safety of Machine Intelligence (July 2023) DOI
This project proposes concrete metrics to assess the human impact of machine learning applications, thus addressing the gap between ethics and quantitative measurement. The authors identify relevant metrics to measure an application’s impact on human flourishing and propose a “Human Impact Scorecard” that can include both qualitative and quantitative metrics. These scorecards allow for comparisons between applications, thus enabling informed decision-making. The authors illustrate this approach by applying it to three real-world case studies.

Example-based Explanations for Random Forests using Machine Unlearning

Tanmay Surve and Romila Pradhan
arXiv (February 2024) DOI
Tree-based machine learning models, such as decision trees and random forests, have been hugely successful in classification tasks primarily because of their predictive power in supervised learning tasks and ease of interpretation. Despite their popularity and power, these models have been found to produce unexpected or discriminatory outcomes. In this work, the authors introduce FairDebugger, a system that utilizes recent advances in machine unlearning research to identify training data subsets responsible for instances of fairness violations in the outcomes of a random forest classifier.

'Explanation' is Not a Technical Term: The Problem of Ambiguity in XAI

Leilani H. Gilpin, Andrew R. Paley, Mohammed A. Alam, Sarah Spurlock, and Kristian J. Hammond
arXiv (June 2022) DOI
In this paper, the authors explore the features of explanations and how to use those features in evaluating explanation utility. The focus is on the requirements for explanations defined by their functional role, the knowledge states of users who are trying to understand them, and the availability of the information needed to generate them. Further, the authors discuss the risk of XAI enabling trust in systems without establishing their trustworthiness and define a critical next step for the field of XAI to establish metrics to guide and ground the utility of system-generated explanations.

An External Stability Audit Framework to Test the Validity of Personality Prediction in AI Hiring

Alene K. Rhea, Kelsey Markey, Lauren D’Arinzo, Hilke Schellmann, Mona Sloane, Paul Squires, Falaah Arif Khan, and Julia Stoyanovich
Data Mining and Knowledge Discovery (September 2022) DOI
AI-based automated hiring systems are seeing ever broader use and have become as varied as the traditional hiring practices they augment or replace. In this paper, the authors focus on automated pre-hire assessment systems, as some of the fastest-developing of all high-stakes uses of AI. The authors interrogate the validity of such systems using stability of the outputs produced and developed a socio-technical framework for auditing system stability. This contribution is supplemented with an open-source software library that implements the technical components of the audit, and can be used to conduct similar stability audits of algorithmic systems.
Github Repository for System Stability Audit Framework
This is the open-source library developed to support the paper, "An External Stability Audit Framework to Test the Validity of Personality Prediction in AI Hiring." This repository contains the code base used to audit the stability of personality predictions made by two algorithmic hiring systems, Humantic AI and Crystal. The application of this audit framework demonstrates that both listed systems show substantial instability with respect to key facets of measurement, and hence cannot be considered valid testing instruments.

Is Fine-tuning Needed? Pre-trained Language Models Are Near Perfect for Out-of-Domain Detection

Rheeya Uppaal, Junjie Hu, and Yixuan Li
Association for Computational Linguistics (ACL) (2023) DOI
The authors in this work present a study investigating the efficacy of directly leveraging pre-trained language models for out-of-distribution (OOD) detection, without any model fine-tuning on the ID data. They compare the approach with several competitive fine-tuning objectives, and offer new insights under various types of distributional shifts. The researchers show that using distance-based detection methods, pre-trained language models are near-perfect OOD detectors when the distribution shift involves a domain change. Furthermore, they study the effect of fine-tuning on OOD detection and identify how to balance ID accuracy with OOD detection performance.

A Framework for the Design and Evaluation of Machine Learning Applications

Northwestern University Machine Learning Impact Initiative, September 2021
The framework document was compiled by Kristian J. Hammond, Ryan Jenkins, Leilani H. Gilpin, and Sarah Spurlock with assistance from Mohammed A. Alam, Alexander Einarsson, Andong L. Li Zhao, Andrew R. Paley, and Marko Sterbentz. The content reflects materials and meetings that were held as part of the Machine Learning Impact Initiative in 2020 and 2021, with the participation of a network of researchers and practitioners.

A Framework for Generating Dangerous Scenes for Testing Robustness

Shengjie Xu, Lan Mi, and Leilani H. Gilpin
Conference and Workshop on Neural Information Processing Systems (NeurIPS) (2022) DOI
In this work, the authors propose a framework for perturbing autonomous vehicle datasets, the DANGER framework, which generates edge-case images on top of current autonomous driving datasets.

Ground(less) Truth: A Causal Framework for Proxy Labels in Human-Algorithm Decision-Making

Luke Guerdan, Amanda Coston, Zhiwei Steven Wu, and Kenneth Holstein
ACM FAccT (June 2023) DOI
In this paper, the authors identify five sources of target variable bias that can impact the validity of proxy labels in human-AI decision-making tasks. They develop a causal framework to disentangle the relationship between each bias and clarify which are of concern in specific human-AI decision-making tasks. Researchers demonstrate how their framework can be used to articulate implicit assumptions made in prior modeling work, and they recommend evaluation strategies for verifying whether these assumptions hold in practice. The authors then leverage their framework to re-examine the designs of prior human subjects experiments that investigate human-AI decision-making, finding that only a small fraction of studies examine factors related to target variable bias. They conclude by discussing opportunities to better address target variable bias in future research.

Large Language Models Need Symbolic AI

Kristian Hammond and David Leake
International Workshop on Neural-Symbolic Learning and Reasoning (NeSy2023) (July 2023) DOI
This position paper argues that both over-optimistic views and disppointments reflect misconceptions of the fundamental nature of large language models (LLMs) as language models. As such, they are statistical models of language production and fluency, with associated strengths and limitations; they are not—and should not be expected to be—knowledge models of the world, nor do they reflect the core role of language beyond the statistics: communication.

A Machine Learning Evaluation Framework for Place-based Algorithmic Patrol Management

Duncan Purves and Ryan Jenkins
arXiv (September 2023) DOI
This report aims to provide a comprehensive framework, including many concrete recommendations, for the ethical and responsible development and deployment of place-based algorithmic patrol management (PAPM) systems. Targeting developers, law enforcement agencies, policymakers, and community advocates, the recommendations emphasize collaboration among these stakeholders to address the complex challenges presented by PAPM.

My Future with My Chatbot: A Scenario-Driven, User-Centric Approach to Anticipating AI Impacts

Kimon Kieslich, Natali Helberger, and Nicholas Diakopoulos
arXiv (January 2024) DOI
In this article, the authors leverage scenario writing at scale as a method for anticipating AI impact that is responsive to challenges. Empirically, the authors tasked 106 US-citizens to write short fictional stories about the future impact (whether desirable or undesirable) of AI-based personal chatbots on individuals and society and, in addition, ask respondents to explain why these impacts are important and how they relate to their values.

The Possibility of Fairness: Revisiting the Impossibility Theorem in Practice

Andrew Bell, Lucius Bynum, Nazarii Drushchak, Tetiana Herasymova, Lucas Rosenblatt, and Julia Stoyanovich
ACM FAccT (June 2023) DOI
The “impossibility theorem” — which is considered foundational in algorithmic fairness literature — asserts that there must be trade-offs between common notions of fairness and performance when fitting statistical models, except in two special cases: when the prevalence of the outcome being predicted is equal across groups, or when a perfectly accurate predictor is used. However, theory does not always translate to practice. In this work, the authors challenge the implications of the impossibility theorem in practical settings.

Predictive Performance Comparison of Decision Policies Under Confounding

Luke Guerdan, Amanda Coston, Kenneth Holstein, and Zhiwei Steven Wu
arXiv (April 2024) DOI
In this work, the authors propose a method to compare the predictive performance of decision policies under a variety of modern identification approaches from the causal inference and off-policy evaluation literatures (e.g., instrumental variable, marginal sensitivity model, proximal variable). Key to this method is the insight that there are regions of uncertainty that the authors can safely ignore in the policy comparison. They develop a practical approach for finite-sample estimation of regret intervals under no assumptions on the parametric form of the status quo policy. They verify their framework theoretically and via synthetic data experiments.

Purposeful AI

Tianyi Li and Francisco Iacobelli
CSCW '23 Companion: Companion Publication of the 2023 Conference on Computer Supported Cooperative Work and Social Computing (October 2023) DOI
This special interest group (SIG) proposal aims to initiate a multidisciplinary discussion around the design of AI systems that are purposefully targeted to marginalized populations.

Responsible Model Selection with Virny and VirnyView

Denys Herasymuk, Falaah Arif Khan, and Julia Stoyanovich
ACM Special Interest Group on Management of Data (June 2024) DOI

In this demonstration, the authors present a comprehensive software library for model auditing and responsible model selection, called Virny, along with an interactive tool called VirnyView. The library is modular and extensible, it implements a rich set of performance and fairness metrics, including novel metrics that quantify and compare model stability and uncertainty, and enables performance analysis based on multiple sensitive attributes, and their intersections.

Searching for the Non-Consequential: Dialectical Activities in HCI and the Limits of Computers

Haoqi Zhang
ACM CHI Conference on Human Factors in Computing Systems (CHI) (May 2024) DOI
This paper examines the pervasiveness of consequentialist thinking in human-computer interaction (HCI), and forefronts the value of non-consequential, dialectical activities in human life. Dialectical activities are human endeavors in which the value of the activity is intrinsic to itself, including being a good friend or parent, engaging in art-making or music-making, conducting research, and so on. The author argues that computers—the ultimate consequentialist machinery for reliably transforming inputs into outputs—cannot be the be-all and end-all for promoting human values rooted in dialectical activities.

Seeking in Cycles: How Users Leverage Personal Information Ecosystems to Find Mental Health Information

Ashlee Milton, Juan F. Maestre, Abhishek Roy, Rebecca Umbach, and Stevie Chancellor
ACM CHI Conference on Human Factors in Computing Systems (CHI) (May 2024) DOI
This work proposes theoretical implications for social computing and information retrieval on information seeking in users’ personal information ecosystems. The authors offer design implications to support users in navigating personal information ecosystems to find mental health information.

Separating facts and evaluation: motivation, account, and learnings from a novel approach to evaluating the human impacts of machine learning

Ryan Jenkins, Kristian Hammond, Sarah Spurlock, and Leilani Gilpin
AI & Society (March 2022) DOI
In this paper, the authors outline a new method for evaluating the human impact of machine-learning (ML) applications. In partnership with Underwriters Laboratories Inc., the collaborators developed a framework to evaluate the impacts of a particular use of machine learning that is based on the goals and values of the domain in which that application is deployed.

Simulating Policy Impacts: Developing a Generative Scenario Writing Method to Evaluate the Perceived Effects of Regulation

Julia Barnett, Kimon Kieslich, Nicholas Diakopoulos
arXiv (May 2024) DOI
In this work, the authors develop a method for using large language models (LLMs) to evaluate the efficacy of a given piece of policy at mitigating specified negative impacts. They do so by using GPT-4 to generate scenarios both pre- and post-introduction of policy and translating these vivid stories into metrics based on human perceptions of impacts. The authors leverage an already established taxonomy of impacts of generative AI in the media environment to generate a set of scenario pairs both mitigated and non-mitigated by the transparency legislation of Article 50 of the EU AI Act. They then run a user study (n=234) to evaluate these scenarios across four risk-assessment dimensions: severity, plausibility, magnitude, and specificity to vulnerable populations.

The Situate AI Guidebook: Co-Designing a Toolkit to Support Multi-Stakeholder, Early-stage Deliberations Around Public Sector AI Proposals

Anna Kawakami, Amanda Coston, Haiyi Zhu, Hoda Heidari, and Kenneth Holstein
ACM CHI Conference on Human Factors in Computing Systems (CHI) (May 2024) DOI
Through an iterative co-design process, the authors created the Situate AI Guidebook: a structured process centered around a set of deliberation questions to scaffold conversations around (1) goals and intended use for a proposed AI system, (2) societal and legal considerations, (3) data and modeling constraints, and (4) organizational governance factors. The authors discuss how the guidebook’s design is informed by participants’ challenges, needs, and desires for improved deliberation processes. They further elaborate on implications for designing responsible AI toolkits in collaboration with public sector agency stakeholders and opportunities for future work to expand upon the guidebook.

Studying Up Public Sector AI: How Networks of Power Relations Shape Agency Decisions Around AI Design and Use

Anna Kawakami, Amanda Coston, Hoda Heidari, Kenneth Holstein, and Haiyi Zhu
arXiv (May 2024) DOI
As public sector agencies rapidly introduce new AI tools in high-stakes domains like social services, it becomes critical to understand how decisions to adopt these tools are made in practice. The authors borrow from the anthropological practice to "study up'' those in positions of power, and reorient their study of public sector AI around those who have the power and responsibility to make decisions about the role that AI tools will play in their agency. Through semi-structured interviews and design activities with 16 agency decision-makers, the authors examine how decisions about AI design and adoption are influenced by their interactions with and assumptions about other actors within these agencies (e.g., frontline workers and agency leaders), as well as those above (legal systems and contracted companies), and below (impacted communities).

Supporting Information Integration in Human-AI Augmentation via Reflection on Unobservables

Maria De-Arteaga and Kenneth Holstein
Social Science Research Network (SSRN) (July 2024) DOI
In this work, the authors conducted a series of two online experiments to understand whether and how explicitly communicating potentially relevant unobservables influences how people integrate model outputs and unobservables when making predictions. Their findings indicate that presenting prompts about unobservables can change how humans integrate model outputs and unobservables, but do not always lead to improved performance.

A Taxonomy of Human and ML Strengths in Decision-Making to Investigate Human-ML Complementarity

Charvi Rastogi, Liu Leqi, Kenneth Holstein, and Hoda Heidari
AAAI Conference on Human Computation and Crowdsourcing (HCOMP) (November 2023) DOI
The goal in this work is to provide conceptual tools to advance the way researchers reason and communicate about human-ML complementarity. Drawing upon prior literature in human psychology, machine learning, and human-computer interaction, the authors propose a taxonomy characterizing distinct ways in which human and ML-based decision-making can differ.

Toward Supporting Perceptual Complementarity in Human-AI Collaboration via Reflection on Unobservables

Kenneth Holstein, Maria De-Arteaga, Lakshmi Tumati, and Yanghuidi Cheng
ACM on Human-Computer Interaction (April 2023) DOI
In this work, the authors conducted an online experiment to understand whether and how explicitly communicating potentially relevant unobservables influences how people integrate model outputs and unobservables when making predictions. Their findings indicate that presenting prompts about unobservables can change how humans integrate model outputs and unobservables, but do not necessarily lead to improved performance. Furthermore, the impacts of these prompts can vary depending on decision-makers' prior domain expertise. The authors conclude by discussing implications for future research and design of AI-based decision support tools.

Training Towards Critical Use: Learning to Situate AI Predictions Relative to Human Knowledge

Anna Kawakami, Luke Guerdan, Yanghuidi Cheng, Kate Glazko, Matthew Lee, Scott Carter, Nikos Arechiga, Haiyi Zhu, and Kenneth Holstein
ACM Collective Intelligence Conference (CI) (November 2023) DOI
In this paper, the authors introduce a process-oriented notion of appropriate reliance called critical use that centers the human’s ability to situate AI predictions against knowledge that is uniquely available to them but unavailable to the AI model. To explore how training can support critical use, the researchers conduct a randomized online experiment in a complex social decision-making setting: child maltreatment screening.

Wikibench: Community-Driven Data Curation for AI Evaluation on Wikipedia

Tzu-Sheng Kuo, Aaron Lee Halfaker, Zirui Cheng, Jiwoo Kim, Meng-Hsin Wu, Tongshuang Wu, Kenneth Holstein, and Haiyi Zhu
ACM CHI Conference on Human Factors in Computing Systems (CHI) (May 2024) DOI
How might we empower communities to drive the intentional design and curation of evaluation datasets for AI that impacts them? The autors investigate this question on Wikipedia, an online community with multiple AI-based content moderation tools deployed. This work introduces Wikibench, a system that enables communities to collaboratively curate AI evaluation datasets, while navigating ambiguities and differences in perspective through discussion. A field study on Wikipedia shows that datasets curated using Wikibench can effectively capture community consensus, disagreement, and uncertainty. Furthermore, study participants used Wikibench to shape the overall data curation process, including refining label definitions, determining data inclusion criteria, and authoring data statements.

"Working from the Middle Out: A Domain-Level Approach to Evaluating the Human Impacts of Machine Learning"

Ryan Jenkins, Kristian Hammond, Sarah Spurlock, and Leilani Gilpin
Ryan Jenkins presented the paper as part of the AAAI 2022 Spring Symposia Series - Approaches to Ethical Computing: Metrics for Measuring AI’s Proficiency and Competency for Ethical Reasoning virtual meeting hosted by Stanford University, Palo Alto, California, March 21-23, 2022.

CENTER FOR ADVANCING SAFETY
OF MACHINE INTELLIGENCEA collaboration between Northwestern University and UL Research Institutes

Outcomes