Skip to main content


As CASMI research initiatives progress, we are committed to producing deliverables that can be utilized by researchers and practitioners. These outputs will include publications and articles, as well as data sets, open source code, and other materials.

Anticipating Impacts: Using Large-Scale Scenario Writing to Explore Diverse Implications of Generative AI in the News Environment

Kimon Kieslich, Nicholas Diakopoulos, and Natali Helberger
arXiv (November 2023) DOI
Research on anticipating the impact of generative AI is still in its infancy and mostly limited to the views of technology developers and/or researchers. In this paper, the authors aim to broaden the perspective and capture the expectations of three stakeholder groups (news consumers; technology developers; content creators) about the potential negative impacts of generative AI, as well as mitigation strategies to address these.

Are Vision Transformers Robust to Spurious Correlations?

Soumya Suvra Ghosal and Yixuan Li
International Journal of Computer Vision (IJCV) (2023) DOI
Deep neural networks may be susceptible to learning spurious correlations that hold on average but not in atypical test samples. As with the recent emergence of vision transformer (ViT) models, it remains unexplored how spurious correlations are manifested in such architectures. In this paper, the authors systematically investigate the robustness of different transformer architectures to spurious correlations on three challenging benchmark datasets. Their study reveals that for transformers, larger models and more pre-training data significantly improve robustness to spurious correlations.

Automated Data Cleaning Can Hurt Fairness in Machine Learning-based Decision Making

Shubha Guha, Falaah Arif Khan, Julia Stoyanovich, and Sebastian Schelter
International Conference on Data Engineering (ICDE) (April 2023) DOI
In this paper, the authors interrogate whether data quality issues track demographic characteristics such as sex, race and age, and whether automated data cleaning — of the kind commonly used in production ML systems — impacts the fairness of predictions made by these systems. To the best of the researchers' knowledge, the impact of data cleaning on fairness in downstream tasks has not been investigated in the literature.

CASMI White Paper: Toward a Safety Science in Artificial Intelligence

Alexander Einarsson, Andrea Lynn Azzo, and Kristian Hammond
Center for Advancing Safety of Machine Intelligence (CASMI), May 2023
Inspired by discussions and thoughts shared by a diverse group of researchers and stakeholders in a workshop on the topic, this work presents a series of steps for the AI community to define safety based on potential harms different AI systems may cause. The steps of 1) identification 2) mapping 3) quantification 4) remediation and 5) prevention will dampen the harms caused by existing AI systems in the short term, while also serving to inform all stakeholders in AI safety on how to minimize harm caused by novel systems in the long term. This work will be a resource for various groups, from developers of AI systems, to legislative groups and bodies, to anyone who has a vested interest in minimizing harm AI systems cause to the public.

Characterizing Eye Gaze for Assistive Device Control

Larisa Y.C. Loke, Demiana R. Barsoum, Todd D. Murphey, and Brenna D. Argall
Institute of Electrical and Electronics Engineers (IEEE) International Conference on Rehabilitation Robotics (ICORR) (September 2023) DOI
Eye gaze tracking is increasingly popular due to improved technology and availability. However, in assistive
device control, eye gaze tracking is often limited to discrete control inputs. In this paper, the authors present a method for collecting both reactionary and control eye gaze signals to build an individualized characterization for eye gaze interface use.

Counterfactual Prediction Under Outcome Measurement Error

Luke Guerdan, Amanda Coston, Kenneth Holstein, and Zhiwei Steven Wu
ACM Conference on Fairness, Accountability, and Transparency (ACM FAccT) (June 2023) DOI
In this work, the authors study intersectional threats to model reliability introduced by outcome measurement error, treatment effects, and selection bias from historical decision-making policies. Researchers develop an unbiased risk minimization method which, given knowledge of proxy measurement error properties, corrects for the combined effects of these challenges. They also develop a method for estimating treatment-dependent measurement error parameters when these are unknown in advance. The authors demonstrate the utility of their approach theoretically and via experiments on real-world data from randomized controlled trials conducted in healthcare and employment domains. As importantly, they demonstrate that models correcting for outcome measurement error or treatment effects alone suffer from considerable reliability limitations. This work underscores the importance of considering intersectional threats to model validity during the design and evaluation of predictive models for decision support.

A Debiasing Technique for Place-based Algorithmic Patrol Management

Alexander Einarsson, Simen Oestmo, Lester Wollman, Duncan Purves, and Ryan Jenkins
arXiv (December 2023) DOI
In recent years, there has been a revolution in data-driven policing. With that has come scrutiny on how bias in historical data affects algorithmic decision making. In this exploratory work, the authors introduce a debiasing technique for place-based algorithmic patrol management systems. They show that the technique efficiently eliminates racially biased features while retaining high accuracy in the models. Finally, they provide a lengthy list of potential future research in the realm of fairness and data-driven policing which this work uncovered.

Evaluating Metrics for Impact Quantification

Ryan Jenkins and Lorenzo Nericcio
Center for Advancing Safety of Machine Intelligence (July 2023) DOI
This project proposes concrete metrics to assess the human impact of machine learning applications, thus addressing the gap between ethics and quantitative measurement. The authors identify relevant metrics to measure an application’s impact on human flourishing and propose a “Human Impact Scorecard” that can include both qualitative and quantitative metrics. These scorecards allow for comparisons between applications, thus enabling informed decision-making. The authors illustrate this approach by applying it to three real-world case studies.


Example-based Explanations for Random Forests using Machine Unlearning

Tanmay Surve and Romila Pradhan
arXiv (February 2024) DOI
Tree-based machine learning models, such as decision trees and random forests, have been hugely successful in classification tasks primarily because of their predictive power in supervised learning tasks and ease of interpretation. Despite their popularity and power, these models have been found to produce unexpected or discriminatory outcomes. In this work, the authors introduce FairDebugger, a system that utilizes recent advances in machine unlearning research to identify training data subsets responsible for instances of fairness violations in the outcomes of a random forest classifier.

'Explanation' is Not a Technical Term: The Problem of Ambiguity in XAI

Leilani H. Gilpin, Andrew R. Paley, Mohammed A. Alam, Sarah Spurlock, and Kristian J. Hammond
arXiv (June 2022) DOI
In this paper, the authors explore the features of explanations and how to use those features in evaluating explanation utility. The focus is on the requirements for explanations defined by their functional role, the knowledge states of users who are trying to understand them, and the availability of the information needed to generate them. Further, the authors discuss the risk of XAI enabling trust in systems without establishing their trustworthiness and define a critical next step for the field of XAI to establish metrics to guide and ground the utility of system-generated explanations.

An External Stability Audit Framework to Test the Validity of Personality Prediction in AI Hiring

Alene K. Rhea, Kelsey Markey, Lauren D’Arinzo, Hilke Schellmann, Mona Sloane, Paul Squires, Falaah Arif Khan, and Julia Stoyanovich
Data Mining and Knowledge Discovery (September 2022) DOI
AI-based automated hiring systems are seeing ever broader use and have become as varied as the traditional hiring practices they augment or replace. In this paper, the authors focus on automated pre-hire assessment systems, as some of the fastest-developing of all high-stakes uses of AI. The authors interrogate the validity of such systems using stability of the outputs produced and developed a socio-technical framework for auditing system stability. This contribution is supplemented with an open-source software library that implements the technical components of the audit, and can be used to conduct similar stability audits of algorithmic systems.
Github Repository for System Stability Audit Framework
This is the open-source library developed to support the paper, "An External Stability Audit Framework to Test the Validity of Personality Prediction in AI Hiring." This repository contains the code base used to audit the stability of personality predictions made by two algorithmic hiring systems, Humantic AI and Crystal. The application of this audit framework demonstrates that both listed systems show substantial instability with respect to key facets of measurement, and hence cannot be considered valid testing instruments. 

Is Fine-tuning Needed? Pre-trained Language Models Are Near Perfect for Out-of-Domain Detection

Rheeya Uppaal, Junjie Hu, and Yixuan Li
Association for Computational Linguistics (ACL) (2023) DOI
The authors in this work present a study investigating the efficacy of directly leveraging pre-trained language models for out-of-distribution (OOD) detection, without any model fine-tuning on the ID data. They compare the approach with several competitive fine-tuning objectives, and offer new insights under various types of distributional shifts. The researchers show that using distance-based detection methods, pre-trained language models are near-perfect OOD detectors when the distribution shift involves a domain change. Furthermore, they study the effect of fine-tuning on OOD detection and identify how to balance ID accuracy with OOD detection performance.

A Framework for the Design and Evaluation of Machine Learning Applications

Northwestern University Machine Learning Impact Initiative, September 2021
The framework document was compiled by Kristian J. Hammond, Ryan Jenkins, Leilani H. Gilpin, and Sarah Spurlock with assistance from Mohammed A. Alam, Alexander Einarsson, Andong L. Li Zhao, Andrew R. Paley, and Marko Sterbentz. The content reflects materials and meetings that were held as part of the Machine Learning Impact Initiative in 2020 and 2021, with the participation of a network of researchers and practitioners.

A Framework for Generating Dangerous Scenes for Testing Robustness

Shengjie Xu, Lan Mi, Leilani H. Gilpin
Conference and Workshop on Neural Information Processing Systems (NeurIPS) (2022) DOI
In this work, the authors propose a framework for perturbing autonomous vehicle datasets, the DANGER framework, which generates edge-case images on top of current autonomous driving datasets.

Ground(less) Truth: A Causal Framework for Proxy Labels in Human-Algorithm Decision-Making

Luke Guerdan, Amanda Coston, Zhiwei Steven Wu, and Kenneth Holstein
ACM FAccT (June 2023) DOI
In this paper, the authors identify five sources of target variable bias that can impact the validity of proxy labels in human-AI decision-making tasks. They develop a causal framework to disentangle the relationship between each bias and clarify which are of concern in specific human-AI decision-making tasks. Researchers demonstrate how their framework can be used to articulate implicit assumptions made in prior modeling work, and they recommend evaluation strategies for verifying whether these assumptions hold in practice. The authors then leverage their framework to re-examine the designs of prior human subjects experiments that investigate human-AI decision-making, finding that only a small fraction of studies examine factors related to target variable bias. They conclude by discussing opportunities to better address target variable bias in future research.

Large Language Models Need Symbolic AI

Kristian Hammond and David Leake
International Workshop on Neural-Symbolic Learning and Reasoning (NeSy2023) (July 2023) DOI
This position paper argues that both over-optimistic views and disppointments reflect misconceptions of the fundamental nature of large language models (LLMs) as language models. As such, they are statistical models of language production and fluency, with associated strengths and limitations; they are not—and should not be expected to be—knowledge models of the world, nor do they reflect the core role of language beyond the statistics: communication.

A Machine Learning Evaluation Framework for Place-based Algorithmic Patrol Management

Duncan Purves and Ryan Jenkins
arXiv (September 2023) DOI
This report aims to provide a comprehensive framework, including many concrete recommendations, for the ethical and responsible development and deployment of place-based algorithmic patrol management (PAPM) systems. Targeting developers, law enforcement agencies, policymakers, and community advocates, the recommendations emphasize collaboration among these stakeholders to address the complex challenges presented by PAPM.

My Future with My Chatbot: A Scenario-Driven, User-Centric Approach to Anticipating AI Impacts

Kimon Kieslich, Natali Helberger, and Nicholas Diakopoulos
arXiv (January 2024) DOI
In this article, the authors leverage scenario writing at scale as a method for anticipating AI impact that is responsive to challenges. Empirically, the authors tasked 106 US-citizens to write short fictional stories about the future impact (whether desirable or undesirable) of AI-based personal chatbots on individuals and society and, in addition, ask respondents to explain why these impacts are important and how they relate to their values.

The Possibility of Fairness: Revisiting the Impossibility Theorem in Practice

Andrew Bell, Lucius Bynum, Nazarii Drushchak, Tetiana Herasymova, Lucas Rosenblatt, and Julia Stoyanovich
ACM FAccT (June 2023) DOI
The “impossibility theorem” — which is considered foundational in algorithmic fairness literature — asserts that there must be trade-offs between common notions of fairness and performance when fitting statistical models, except in two special cases: when the prevalence of the outcome being predicted is equal across groups, or when a perfectly accurate predictor is used. However, theory does not always translate to practice. In this work, the authors challenge the implications of the impossibility theorem in practical settings.

Purposeful AI

Tianyi Li & Francisco Iacobelli
CSCW '23 Companion: Companion Publication of the 2023 Conference on Computer Supported Cooperative Work and Social Computing (October 2023) DOI
This special interest group (SIG) proposal aims to initiate a multidisciplinary discussion around the design of AI systems that are purposefully targeted to marginalized populations.

Separating facts and evaluation: motivation, account, and learnings from a novel approach to evaluating the human impacts of machine learning

Ryan Jenkins, Kristian Hammond, Sarah Spurlock, and Leilani Gilpin
AI & Society (March 2022) DOI
In this paper, the authors outline a new method for evaluating the human impact of machine-learning (ML) applications. In partnership with Underwriters Laboratories Inc., the collaborators developed a framework to evaluate the impacts of a particular use of machine learning that is based on the goals and values of the domain in which that application is deployed.

A Taxonomy of Human and ML Strengths in Decision-Making to Investigate Human-ML Complementarity

Charvi Rastogi, Liu Leqi, Kenneth Holstein, and Hoda Heidari
AAAI Conference on Human Computation and Crowdsourcing (HCOMP) (November 2023) DOI
The goal in this work is to provide conceptual tools to advance the way researchers reason and communicate about human-ML complementarity. Drawing upon prior literature in human psychology, machine learning, and human-computer interaction, the authors propose a taxonomy characterizing distinct ways in which human and ML-based decision-making can differ.

Toward Supporting Perceptual Complementarity in Human-AI Collaboration via Reflection on Unobservables

Kenneth Holstein, Maria De-Arteaga, Lakshmi Tumati, and Yanghuidi Cheng
ACM on Human-Computer Interaction (April 2023) DOI
In this work, the authors conducted an online experiment to understand whether and how explicitly communicating potentially relevant unobservables influences how people integrate model outputs and unobservables when making predictions. Their findings indicate that presenting prompts about unobservables can change how humans integrate model outputs and unobservables, but do not necessarily lead to improved performance. Furthermore, the impacts of these prompts can vary depending on decision-makers' prior domain expertise. The authors conclude by discussing implications for future research and design of AI-based decision support tools.


Training Towards Critical Use: Learning to Situate AI Predictions Relative to Human Knowledge

Anna Kawakami, Luke Guerdan, Yanghuidi Cheng, Kate Glazko, Matthew Lee, Scott Carter, Nikos Arechiga, Haiyi Zhu, and Kenneth Holstein
ACM Collective Intelligence Conference (CI) (November 2023) DOI
In this paper, the authors introduce a process-oriented notion of appropriate reliance called critical use that centers the human’s ability to situate AI predictions against knowledge that is uniquely available to them but unavailable to the AI model. To explore how training can support critical use, the researchers conduct a randomized online experiment in a complex social decision-making setting: child maltreatment screening.

"Working from the Middle Out: A Domain-Level Approach to Evaluating the Human Impacts of Machine Learning"

Ryan Jenkins, Kristian Hammond, Sarah Spurlock, and Leilani Gilpin
Ryan Jenkins presented the paper as part of the AAAI 2022 Spring Symposia Series - Approaches to Ethical Computing: Metrics for Measuring AI’s Proficiency and Competency for Ethical Reasoning virtual meeting hosted by Stanford University, Palo Alto, California, March 21-23, 2022.


Back to top