Workshop to Explore Safety Measurement in Artificial Intelligence
Artificial intelligence (AI) has captivated the public’s attention. Its popularity has recently been fueled by powerful generative tools that can craft essays and convert text to photos or videos. As companies race to release these AI-based systems, many are asking a vital question: how do we determine how safe these systems are?
To delve into this topic, the Northwestern Center for Advancing Safety of Machine Intelligence (CASMI) is hosting a workshop on July 18-19 entitled, “Sociotechnical Approaches to Measurement and Validation for Safety in AI.” The goal is to bring together experts from different fields to discuss ways to assess AI and to explore methods to quantify a technology’s reliability.
The motivation behind the workshop came in January, when CASMI convened interdisciplinary thought leaders for a workshop on how to develop a safety science for the field of artificial Intelligence. One recurring theme from the two-day event was the complexity of developing measurement structures for evaluating machine learning (ML) models.
“Our last workshop was held to explore the notion of harm,” said Kristian Hammond, Bill and Cathy Osborn professor of computer science and director of CASMI. “The next step is to start considering how we measure the possible harms, given that some of them are around individuals, groups, and society. This ends up being a challenge, but this workshop is aimed at addressing that challenge and coming up with answers.”
Holistic Approaches to Measuring Safety
Experts say a holistic approach is necessary to measure and to validate AI safety.
Computational social scientist Abigail Jacobs, an assistant professor of information at the University of Michigan, is working with CASMI to coordinate the workshop. She studies how the structure and governance of technical systems is fundamentally social. She believes measurement modeling – identifying the hidden ways that ideas about the world get encoded as data and models – provides a useful framework for governments to understand concepts like fairness in AI systems.
“Focusing on measurement and validation in AI lets us take a holistic approach to reveal how systems are designed, if they even work, and what impacts they can have,” Jacobs said. “Measurement and validation help us uncover the hidden assumptions that get built into AI systems. We can then describe how systems actually work and how harms actually play out. But this is an interdisciplinary problem–we need social scientists to uncover how these technical decisions aren’t just technical.”
Potential harms from AI should not be considered in isolation, according to the National Institute of Standards and Technology’s AI Risk Management Framework. The voluntary set of guidelines details an integrated approach toward evaluating AI systems and emphasizes measurement as a key component of that.
The framework says measuring risk in an AI system should involve independent assessors, internal and external experts, people who use the systems, and impacted communities. It says, ideally, AI-related organizations and their employees will be multidisciplinary and diverse.
Lessons Learned from Social Media
Social media platforms were initially released without regulation, allowing the industry to build the systems freely. That also led to unintended consequences. As discussed in the documentary “The Social Dilemma,” engineers and executives who worked for social media companies had little foresight that the recommendation algorithms they created could lead to mental health issues, political polarization, and misinformation.
Kristian Lum, research associate professor at the University of Chicago Data Science Institute, previously worked at a social media company and currently studies fairness, accountability, and transparency. In a presentation to the January CASMI workshop, Lum discussed how reducing inequality can help improve user experience on social media platforms. However, doing this can be challenging because sometimes, different notions of inequality are in conflict with one another. Lum emphasized the need to have a system-level approach to evaluate models.
“There’s a tendency that ethical ML can be pigeon-holed into an ML problem, but it’s a whole system problem,” Lum said. “It’s important to have metrics available because intuition does not often match reality.
“Measurement is really hard,” Lum continued. “It’s hard to agree on what the underlying concept is.”
Safety Engineers Offer Guidance
Developing measurement and validation for AI safety doesn’t need to start from scratch. Other fields, such as medicine and engineering, have well-defined approaches that could be adapted. Safety engineers rely on a set of processes to calculate a level of reliability.
Computer scientist Alwyn Goodloe has years of experience researching safety-critical aerospace systems. In the January workshop, he reflected on this experience and how its lessons may offer insight for building ML safety practices.
Dr. Goodloe said the machine learning community needs a way to create requirements that can be used to design, build, test, and maintain models. Those metrics are called an actionable specification. He said the challenge in machine learning is that large datasets, which are opaque, are the specification.
“It’s hard to figure out what constitutes a specification,” Goodloe said. “How do I assure a system when I don’t know what it’s supposed to do? How do I know what it’s not supposed to do?”
Safety engineers must demonstrate that systems are designed to do what they are intended to. They also need to document what has happened at every step of the way, from design to implementation.
“I’m working on ultra-critical systems. We have to show they never fail in the operating lifetime of the system,” Goodloe said. “The question that people like us are going to have to address is until we can build ML systems that don’t fail, can we use ML in these systems?”
“Sociotechnical Approaches to Measurement and Validation for Safety in AI” is the third CASMI workshop, which aims to convene diverse groups of experts and practitioners to explore a critical area of research. To learn more about previous events, visit our website.