Skip to main content

Outcomes

As CASMI research initiatives progress, we are committed to producing deliverables that can be utilized by researchers and practitioners. These outputs will include publications and articles, as well as data sets, open source code, and other materials.

A Machine Learning Evaluation Framework for Place-based Algorithmic Patrol Management

Duncan Purves and Ryan Jenkins
arXiv (2023) DOI

This report aims to provide a comprehensive framework, including many concrete recommendations, for the ethical and responsible development and deployment of place-based algorithmic patrol management (PAPM) systems. Targeting developers, law enforcement agencies, policymakers, and community advocates, the recommendations emphasize collaboration among these stakeholders to address the complex challenges presented by PAPM.

Large Language Models Need Symbolic AI

Kristian Hammond and David Leake
CEUR Workshop Proceedings (2023) DOI

This position paper argues that both over-optimistic views and disppointments reflect misconceptions of the fundamental nature of large language models (LLMs) as language models. As such, they are statistical models of language production and fluency, with associated strengths and limitations; they are not—and should not be expected to be—knowledge models of the world, nor do they reflect the core role of language beyond the statistics: communication.

Evaluating Metrics for Impact Quantification

Ryan Jenkins and Lorenzo Nericcio
Center for Advancing Safety of Machine Intelligence (2023) DOI

This project proposes concrete metrics to assess the human impact of machine learning applications, thus addressing the gap between ethics and quantitative measurement. The authors identify relevant metrics to measure an application’s impact on human flourishing and propose a “Human Impact Scorecard” that can include both qualitative and quantitative metrics. These scorecards allow for comparisons between applications, thus enabling informed decision-making. The authors illustrate this approach by applying it to three real-world case studies.

Is Fine-tuning Needed? Pre-trained Language Models Are Near Perfect for Out-of-Domain Detection

Rheeya Uppaal, Junjie Hu, and Yixuan Li
Association for Computational Linguistics (ACL) (2023) DOI

The authors in this work present a study investigating the efficacy of directly leveraging pre-trained language models for out-of-distribution (OOD) detection, without any model fine-tuning on the ID data. They compare the approach with several competitive fine-tuning objectives, and offer new insights under various types of distributional shifts. The researchers show that using distance-based detection methods, pre-trained language models are near-perfect OOD detectors when the distribution shift involves a domain change. Furthermore, they study the effect of fine-tuning on OOD detection and identify how to balance ID accuracy with OOD detection performance.

Automated Data Cleaning Can Hurt Fairness in Machine Learning-based Decision Making

Shubha Guha, Falaah Arif Khan, Julia Stoyanovich, and Sebastian Schelter
International Conference on Data Engineering (ICDE) (2023) DOI

In this paper, the authors interrogate whether data quality issues track demographic characteristics such as sex, race and age, and whether automated data cleaning — of the kind commonly used in production ML systems — impacts the fairness of predictions made by these systems. To the best of the researchers' knowledge, the impact of data cleaning on fairness in downstream tasks has not been investigated in the literature.

Ground(less) Truth: A Causal Framework for Proxy Labels in Human-Algorithm Decision-Making

Luke Guerdan, Amanda Coston, Zhiwei Steven Wu, and Kenneth Holstein
ACM FAccT (2023) DOI

In this paper, the authors identify five sources of target variable bias that can impact the validity of proxy labels in human-AI decision-making tasks. They develop a causal framework to disentangle the relationship between each bias and clarify which are of concern in specific human-AI decision-making tasks. Researchers demonstrate how their framework can be used to articulate implicit assumptions made in prior modeling work, and they recommend evaluation strategies for verifying whether these assumptions hold in practice. The authors then leverage their framework to re-examine the designs of prior human subjects experiments that investigate human-AI decision-making, finding that only a small fraction of studies examine factors related to target variable bias. They conclude by discussing opportunities to better address target variable bias in future research.

Counterfactual Prediction Under Outcome Measurement Error

Luke Guerdan, Amanda Coston, Kenneth Holstein, and Zhiwei Steven Wu
ACM FAccT (2023) DOI

In this work, the authors study intersectional threats to model reliability introduced by outcome measurement error, treatment effects, and selection bias from historical decision-making policies. Researchers develop an unbiased risk minimization method which, given knowledge of proxy measurement error properties, corrects for the combined effects of these challenges. They also develop a method for estimating treatment-dependent measurement error parameters when these are unknown in advance. The authors demonstrate the utility of their approach theoretically and via experiments on real-world data from randomized controlled trials conducted in healthcare and employment domains. As importantly, they demonstrate that models correcting for outcome measurement error or treatment effects alone suffer from considerable reliability limitations. This work underscores the importance of considering intersectional threats to model validity during the design and evaluation of predictive models for decision support.

Toward Supporting Perceptual Complementarity in Human-AI Collaboration via Reflection on Unobservables

Kenneth Holstein, Maria De-Arteaga, Lakshmi Tumati, and Yanghuidi Cheng
ACM on Human-Computer Interaction (2023) DOI

In this work, the authors conducted an online experiment to understand whether and how explicitly communicating potentially relevant unobservables influences how people integrate model outputs and unobservables when making predictions. Their findings indicate that presenting prompts about unobservables can change how humans integrate model outputs and unobservables, but do not necessarily lead to improved performance. Furthermore, the impacts of these prompts can vary depending on decision-makers' prior domain expertise. The authors conclude by discussing implications for future research and design of AI-based decision support tools.

CASMI White Paper: Toward a Safety Science in Artificial Intelligence

Alexander Einarsson, Andrea Lynn Azzo, and Kristian Hammond
Center for Advancing Safety of Machine Intelligence (CASMI), May 2023

Inspired by discussions and thoughts shared by a diverse group of researchers and stakeholders in a workshop on the topic, this work presents a series of steps for the AI community to define safety based on potential harms different AI systems may cause. The steps of 1) identification 2) mapping 3) quantification 4) remediation and 5) prevention will dampen the harms caused by existing AI systems in the short term, while also serving to inform all stakeholders in AI safety on how to minimize harm caused by novel systems in the long term. This work will be a resource for various groups, from developers of AI systems, to legislative groups and bodies, to anyone who has a vested interest in minimizing harm AI systems cause to the public.

The Possibility of Fairness: Revisiting the Impossibility Theorem in Practice

Andrew Bell, Lucius Bynum, Nazarii Drushchak, Tetiana Herasymova, Lucas Rosenblatt, and Julia Stoyanovich
ACM FAccT (2023) DOI

The “impossibility theorem” — which is considered foundational in algorithmic fairness literature — asserts that there must be trade-offs between common notions of fairness and performance when fitting statistical models, except in two special cases: when the prevalence of the outcome being predicted is equal across groups, or when a perfectly accurate predictor is used. However, theory does not always translate to practice. In this work, the authors challenge the implications of the impossibility theorem in practical settings.

Automated Data Cleaning Can Hurt Fairness in Machine Learning-based Decision Making

Shubha Guha, Falaah Arif Khan, Julia Stoyanovich, and Sebastian Schelter
ICDE Special Track (2023) DOI

In this paper, the authors interrogate whether data quality issues track demographic characteristics such as sex, race and age, and whether automated data cleaning — of the kind commonly used in production machine learning (ML) systems — impacts the fairness of predictions made by these systems. To the best of the authors' knowledge, the impact of data cleaning on fairness in downstream tasks has not been investigated in the literature.

An External Stability Audit Framework to Test the Validity of Personality Prediction in AI Hiring

Alene K. Rhea, Kelsey Markey, Lauren D’Arinzo, Hilke Schellmann, Mona Sloane, Paul Squires, Falaah Arif Khan, and Julia Stoyanovich
Data Mining and Knowledge Discovery (2022) DOI

AI-based automated hiring systems are seeing ever broader use and have become as varied as the traditional hiring practices they augment or replace. In this paper, the authors focus on automated pre-hire assessment systems, as some of the fastest-developing of all high-stakes uses of AI. The authors interrogate the validity of such systems using stability of the outputs produced and developed a socio-technical framework for auditing system stability. This contribution is supplemented with an open-source software library that implements the technical components of the audit, and can be used to conduct similar stability audits of algorithmic systems.

Github Repository for System Stability Audit Framework

This is the open-source library developed to support the paper, "An External Stability Audit Framework to Test the Validity of Personality Prediction in AI Hiring." This repository contains the code base used to audit the stability of personality predictions made by two algorithmic hiring systems, Humantic AI and Crystal. The application of this audit framework demonstrates that both listed systems show substantial instability with respect to key facets of measurement, and hence cannot be considered valid testing instruments. 

'Explanation' is Not a Technical Term: The Problem of Ambiguity in XAI

Leilani H. Gilpin, Andrew R. Paley, Mohammed A. Alam, Sarah Spurlock, and Kristian J. Hammond
arXiv (2022) DOI

In this paper, the authors explore the features of explanations and how to use those features in evaluating explanation utility. The focus is on the requirements for explanations defined by their functional role, the knowledge states of users who are trying to understand them, and the availability of the information needed to generate them. Further, the authors discuss the risk of XAI enabling trust in systems without establishing their trustworthiness and define a critical next step for the field of XAI to establish metrics to guide and ground the utility of system-generated explanations.


Separating facts and evaluation: motivation, account, and learnings from a novel approach to evaluating the human impacts of machine learning

Ryan Jenkins, Kristian Hammond, Sarah Spurlock, and Leilani Gilpin
AI & Society (2022) DOI

In this paper, the authors outline a new method for evaluating the human impact of machine-learning (ML) applications. In partnership with Underwriters Laboratories Inc., the collaborators developed a framework to evaluate the impacts of a particular use of machine learning that is based on the goals and values of the domain in which that application is deployed.


"Working from the Middle Out: A Domain-Level Approach to Evaluating the Human Impacts of Machine Learning"

Ryan Jenkins, Kristian Hammond, Sarah Spurlock, and Leilani Gilpin

Ryan Jenkins presented the paper as part of the AAAI 2022 Spring Symposia Series - Approaches to Ethical Computing: Metrics for Measuring AI’s Proficiency and Competency for Ethical Reasoning virtual meeting hosted by Stanford University, Palo Alto, California, March 21-23, 2022.


A Framework for the Design and Evaluation of Machine Learning Applications

Northwestern University Machine Learning Impact Initiative, September 2021

The framework document was compiled by Kristian J. Hammond, Ryan Jenkins, Leilani H. Gilpin, and Sarah Spurlock with assistance from Mohammed A. Alam, Alexander Einarsson, Andong L. Li Zhao, Andrew R. Paley, and Marko Sterbentz. The content reflects materials and meetings that were held as part of the Machine Learning Impact Initiative in 2020 and 2021, with the participation of a network of researchers and practitioners.

Back to top