Skip to main content

Study Finds Simple Method Can Improve Safety for Language Models

Researchers with the Northwestern Center for Advancing Safety of Machine Intelligence (CASMI) have found a simple and effective way to identify unknown and anomalous data for language models, which improves safety and reliability.

The findings were presented at the Association for Computational Linguistics (ACL) conference on July 12 in a paper entitled, “Is Fine-tuning Needed? Pre-trained Language Models Are Near Perfect for Out-of-Domain Detection.” Its authors are affiliated with the University of Wisconsin-Madison: Rheeya Uppaal, PhD student in computer science; Junjie Hu, assistant professor of biostatistics & medical informatics and of computer science; and Sharon Li, assistant professor in computer science and principal investigator of the CASMI-funded project, “Understanding and Reducing Safety Risks of Learning with Large Pre-Trained Models.”

Researchers evaluated eight dataset pairs using a commonly used pre-trained language model called RoBERTa. They discovered that fine-tuning – a process in which a model is tweaked for a particular task – was not necessary for out-of-domain detection. In other words, pre-trained language models are already almost perfect at detecting data that doesn’t belong in the same group (for example, restaurant reviews versus movie reviews). The study says fine-tuning worsens out-of-domain detection performance.

Sharon Li“It's very natural to assume that language models adapted to or fine-tuned on downstream tasks can be more useful for out-of-distribution detection, but in this paper, we actually challenge this perspective,” Li said.

This doesn’t mean fine-tuning can be skipped altogether. It’s still needed for any classification task. For example, a model that isn’t fine-tuned might be able to tell the difference between restaurant reviews and movie reviews, but it won’t be able to distinguish positive movie reviews from negative ones. To balance this tradeoff, researchers recommend using a technique called early stopping. 

“You can do this fine-tuning for a little bit, but not excessively,” Li said. “This allows the model to adjust its capability to fit to this downstream task, but not so much that it would lose or significantly deteriorate its out-of-domain detection performance.”

The authors of the study say their findings are especially important in high-risk settings such as finance and healthcare.

“There are machine learning models which read patient medical reports before presenting a prognosis or Rheeya Uppaaldiagnosis,” Uppaal said. “If someone has come into the hospital with rather unusual symptoms, there is a chance that the model will create a diagnosis that is completely wrong, but it might be really confident about its prediction regardless.

“So rather than having the doctor trust the model in isolation and possibly lead to some terrible consequences for the patient, it would be better if the model were to say, ‘Here is a prediction, but I think this data was out-of-distribution, so don't take my prediction too seriously,’” Uppaal continued.

The study found that in most cases, when a pre-trained language model was used without fine-tuning for out-of-distribution detection, there were no false positives. However, with fine-tuning, false positive rates could be as high as 85.1%.

Li said it’s important for everyone to understand the limitations of these technologies. Even large language models like ChatGPT, which researchers did not study, are prone to making errors.

“Despite all the technical advances we've made, there are still safety considerations that we need to pay attention to and try to understand where they’re capable and where they’re incapable,” Li said. “In situations where they're not supposed to make a certain prediction, do they have that safety feature ready so that they don’t blindly make an error?”

Uppaal said this research furthers crucial work in trying to make language models more reliable and accurate.

“This method provides a really nice way for a model to flag when it can’t be relied upon,” Uppaal said. “Eventually, we want to generate models that, even when encountering vastly different data, can continue to provide accurate and reliable predictions.”

Back to top