Research Explores How a Wikipedia-Like Approach Could Improve AI Evaluation
Whenever you use Wikipedia, chances are that the article you are reading has been revised at least a dozen times. Editors with diverse backgrounds and knowledge review all the changes to the crowdsourced encyclopedia with the help of artificial intelligence (AI) tools. These volunteers also assess the AI tools to ensure the reliability of the seventh most visited website in the world.
Ongoing work from the Northwestern Center for Advancing Safety of Machine Intelligence (CASMI) explores how a similar approach could help to improve the way AI systems are evaluated. The research team — Tzu-Sheng Kuo, PhD student in the Carnegie Mellon University (CMU) Human-Computer Interaction Institute (HCII); Aaron Halfaker, Microsoft principal applied research scientist; Zirui Cheng, undergraduate student at Tsinghua University; Jiwoo Kim, undergraduate student at Columbia University; Meng-Hsin Wu, MS student at the CMU HCII; Tongshuang Wu, assistant professor at the CMU HCII; Kenneth Holstein, assistant professor at the CMU HCII; and Haiyi Zhu, associate professor at the CMU HCII — developed a tool that supports community-driven data curation for AI evaluation.
The work is detailed in the research paper, “Wikibench: Community-Driven Data Curation for AI Evaluation on Wikipedia,” and will be presented during the Association of Computing Machinery’s Conference on Human Factors in Computing Systems (ACM CHI) on May 11-16 in Honolulu, Hawaii.
In their research, the team focused on improving the evaluation of AI systems that are currently used on Wikipedia. Since there are more than 6.8 million articles in the English Wikipedia containing more than 4.5 billion words, the editors who patrol the site rely heavily on AI tools to maintain the site’s accuracy, neutrality, and completeness.
“[Wikipedia editors] have already been auditing [the performance of] AI systems on Wikipedia in an ad hoc way, but Wikibench makes this process easier and provides more scaffolding for the labeling and discussion surrounding this process,” Kuo said.
What Is Wikibench?
The research team developed Wikibench, a tool that allows editors from different Wikipedia language communities to collaboratively curate AI evaluation datasets that reflect their communities’ needs and values. Crucially, Wikibench is designed so that editors can easily communicate with one another and capture levels of consensus, disagreement, and uncertainty.
“One of the unique contributions of our work is that our system was designed to be seamlessly integrated into the editors’ everyday workflow,” Zhu said.
“For this research, we’re focusing on AI tools that support Wikipedia editors in content moderation, specifically for vandalism detection,” Holstein said. “This project is helping to bring in the communities of editors in shaping AI evaluations, rather than these being shaped by external experts.”
Wikibench has three main features (called interfaces) that Wikipedia editors can use. The plug-in feature allows them to select new data points while they patrol edits and label new data points. For example, when they label a new data point, editors might mark an edit as “damaging” or “not damaging” to the article’s quality. When a Wikipedia editor submits these labels, one of two boxes appears: either a green box telling them their submission was received, or a yellow box telling them that their submission differed from the majority opinion.
Discussions happen on the entity page, where Wikipedia editors can also label data points already in the dataset and revise labels. This feature shows previous edits and the consensus (such as whether an edit was damaging and acted in bad/good faith). The entity page also shows Wikipedia editors their own individual label and breaks down how many other editors labeled an edit as damaging (or not) and how many labeled the user intent as either good or bad faith. Individual comments are also shown.
The third feature on Wikibench, the campaign page, lets users select data points already in the dataset and discuss the overall data curation process. The campaign page has bar charts showing the percentages of majority opinions (for example, 52% say user intent acted in good faith, and 38% say that edits were not damaging), as well as a table showing how every single data point was labeled. The table also has buttons which allow Wikipedia editors to organize the data.
The research team interviewed 12 Wikipedia editors as part of their study. The feedback on how Wikibench works was positive.
“People find it very useful to have different people with different expertise to work together on identifying and labeling these data,” Kuo said. “Otherwise, this process is usually done independently. Because of this collaborative process, they're able to learn from one another’s experiences. For example, someone said that it allows them to discover things they might not have thought of. It also allows the community to establish better consensus and focus on what might require more care.”
How This Approach Could Be Applied Beyond Wikipedia
CASMI researchers plan to join a Wiki Workshop to discuss their findings with the host of Wikipedia, the Wikimedia Foundation, and to encourage a community-driven approach to AI evaluation.
Wikipedia has a long history of using AI tools to edit pages. However, prior research has shown that one early AI tool which was designed to reject low-quality edits tended to treat all newcomers as low-quality edits.
“One of our collaborators who has studied this for a long time noticed that new people left Wikipedia because they were just not treated well by the community,” Zhu said. “AI did play a role in that, as it frequently misclassified new people’s edits as damaging. Our work is asking, how can we consider the community's needs, values, and perspectives and anticipate the impacts of AI on these communities? We want to consider them early on in the design stages.”
The research team is exploring how to extend this research to other websites. For example, they said that Reddit and Stack Overflow currently support user collaboration, so those sites could benefit from empowering their users for data curation. Ultimately, the goal is to support effective AI-augmented decision-making that is responsive to the needs of different community contexts.
“Longer term, we’re interested in changing the way that people think about how AI evaluation and data curation should be done,” Holstein said. “This work provides a case study of how this could work in one online community. Maybe it can work in other places as well.”