OpenAI’s Strawberry: A Step Towards Advanced Reasoning in AI
OpenAI is foreshadowing a step forward in its ambitious five-step plan to achieve artificial general intelligence (AGI) with a new reasoning technology under the code name "Strawberry." Strawberry is designed to extend the capabilities of conversational models to support reasoning that goes beyond the “predict the next word” capabilities of ChatGPT.
The goal is to enable reasoning capable of problem-solving and planning, surpassing merely responding to text. The idea is to provide organizations with the ability to go beyond parroting their documents and the current prompts to consider how to think through a problem even when the information needed is not directly presented.
For example, investment groups might leverage Strawberry to look at a startup opportunity, evaluate the industry and market size, dive deep into the skill set of the team, and estimate the effort required to launch a successful product. This is a task that, right now, demands human guidance, but the aspiration is that this could be addressed directly by Strawberry.
As exciting as this is, I continue to be surprised by the absence of any direct comparison to human cognitive skills in these discussions of reasoning. Human cognition is marvelously rich and, although we do not have a complete model of it, can provide us with some insights as to how to build AI equivalents. But the current AI discourse often ignores the fundamental processes of our own intelligence, even as we are building systems that are inspired by it.
In particular, it is striking that there is almost no discussion of the ideas related to Type 1 and Type 2 reasoning—what psychologist Daniel Kahneman referred to colloquially as fast and slow thinking.
Type 1 reasoning is swift and instinctual, much like how current language models operate. They excel at rapid responses to collections of features that might not even be identifiable. But human reasoning is not just about quick answers; it also involves Type 2 reasoning: a slower, more deliberate thought process. The kind of thinking we use when we analyze and strategize.
One of the more interesting aspects of these two modes of thought is that we tend to use Type 2 reasoning to both debug and train ourselves. For example, if you decide to learn a new language, your Type 2 side might put together a schedule of exercises, generate flash cards, and walk through vocabulary and conversational interactions. All of this is in service of training your Type 1 system by providing examples that will build up the appropriate reflexive responses.
Because of the nature of how the generative pre-trained transformers (GPTs), and other large language models (LLMs), function, the focus has always been on the power of what is essentially Type 1 reasoning. Reflexive, responsive, and well into the realm of the intuitive. Even when other capabilities are added, the focus remains on the final system rather than the thinking that went into training it.
For instance, in the initial release of ChatGPT, it struggled with multi-hop questions, questions that require answering one part of the question in order to answer the other. Like, “Who was president when George Clooney was born?” To answer this, you need to first answer the question of when George Clooney was born (May 6th, 1961) and then use the date to answer the real question of who was president then (JFK). Unfortunately, ChatGPT would just respond to the text and, given that there is not a lot of text linking George Clooney bios to various presidents, its guesses were mediocre at best. It was not trained to 1) recognize multi-hop questions nor 2) respond to them by generating known components first so that it can use the answers as part of its own input.
Of course, OpenAI responded to this problem by putting together a training corpus that allowed post ChatGPT-3.5 models to recognize multi-hop questions and respond to them by generating one component in service of the other.
The current version gives you:
If you ask it not to show the answer to the first question, it still gets quite confused:
At the end of the process, ChatGPT was able to do a much better job on these questions. But for me, the question is, what was the reasoning that went into unpacking the problem, deciding on a solution, and then collecting and curating the right training data? That is, what did the Type 2 reasoning look like, and how can we draw that sort of reasoning into these systems so that they can train themselves? Unless we consider this, we might benefit from the Type 1 systems that result from the work, but why not strive to bring the thoughtful, more mindful consideration of what needs to be fixed and how to fix into play as well?
Strawberry could push AI beyond just mimicking human language into realms of thoughtful analysis. The challenge, however, isn’t purely scientific—it’s deeply human. Perhaps Strawberry will be a step towards an AI that doesn’t just answer questions but contemplates them, unpacks them, and then merges multiple functionalities and data sources into a cohesive whole to solve them. The pivot point will be when machine learning models begin to teach themselves how to think, and that’s when we’ll truly be entering the realm of AGI.
Kristian Hammond
Bill and Cathy Osborn Professor of Computer Science
Director of the Center for Advancing Safety of Machine Intelligence (CASMI)
Director of the Master of Science in Artificial Intelligence (MSAI) Program