Skip to main content

Testing Large Language Models

Robots in classroom

Testing and ranking large language models (LLMs) has become critical in the fast-paced world of artificial intelligence. These models require rigorous evaluation to ensure they meet high standards of accuracy, fluency, and reasoning. However, as a recent article in The Markup argues, our current approach of using benchmarks to judge AI tools seems to be flawed. 

A major issue with current testing methods is that they don’t focus enough on integration— the ability to combine various types of knowledge and reasoning to produce coherent and appropriate responses. True intelligence involves understanding actions, causation, and impacts. It’s not enough for a model to trust that it already has that information; it must integrate that information into its generation. 

LLMs, designed to think fast, lack the reflective, adaptive processes humans have. Developers need to create tests that challenge these models’ capabilities to assess how well they respond as they integrate new information into their generative capabilities. 

Another major challenge is that LLMs tend to learn how to perform well on specific tests without necessarily understanding abstractions behind them. Knowing that 7+5=12 is not the same as knowing how math works in general. This specialization makes results less reliable when what is needed are abstractions that go beyond the specifics. 

Additionally, models often fall prey to recency and framing effects—they may give answers based on the way questions are framed, which can lead to ethical and decision-making problems. In effect, biases in the questions end up creating biases in the answers. 

One development in evaluating LLMs is the Massive Multitask Language Understanding (MMLU) test. This test assesses a model’s language capabilities by focusing on comprehensive understanding rather than mere fluency. While fluency is important, true understanding means knowing what to say and when. This is where many tests fall short, highlighting the need for more sophisticated tests that delve deeper than surface-level fluency.  

Testing LLMs is a complex task that demands a comprehensive approach. As we move towards developing artificial general intelligence, it’s crucial to create benchmarks and tests that go beyond fluency. We need to evaluate how well these models integrate knowledge, reason through problems, and make well-reasoned decisions. By tackling these challenges, we can ensure that future LLMs are not just advanced tools but true extensions of human intelligence, capable of both fast and slow thinking. This way, we can build AI that genuinely complements and enhances human capabilities in meaningful ways. 

Kristian Hammond
Bill and Cathy Osborn Professor of Computer Science
Director of the Center for Advancing Safety of Machine Intelligence (CASMI)
Director of the Master of Science in Artificial Intelligence (MSAI) Program

Back to top