Home Money Large Language Models’ Emergent Abilities Are a Mirage

Large Language Models’ Emergent Abilities Are a Mirage

0 comments
Large Language Models’ Emergent Abilities Are a Mirage

The original version by this story appeared in Quanta magazine.

Two years ago, in a project called the Beyond the imitation game benchmark, or BIG-bench, 450 researchers compiled a list of 204 tasks designed to test the capabilities of large language models that power chatbots like ChatGPT. On most tasks, performance improved predictably and smoothly as the models became larger: the larger the model, the better it became. But on other tasks, the jump in skill wasn’t smooth. Performance remained near zero for a while, but then performance went up. Other studies found similar jumps in skill.

The authors describe this as ‘breakthrough behavior’; Other researchers have compared it to a phase transition in physics, such as when liquid water freezes into ice. In a newspaper published in August 2022, researchers noted that this behavior is not only surprising but also unpredictable, and should fuel evolving conversations about the safety, potential, and risks of AI. They called the skills “coming up”, a word that describes collective behavior that only occurs when a system reaches a high level of complexity.

But things may not be that simple. A new paper by a trio of researchers from Stanford University argues that the sudden emergence of these skills is merely a consequence of the way researchers measure LLM performance. The capabilities, they argue, are neither unpredictable nor sudden. “The transition is much more predictable than people think,” he says Sanmi Koyejo, a computer scientist at Stanford and senior author of the paper. “Strong claims about turnout have as much to do with how we choose to measure as with what the models do.”

We are only now seeing and studying this behavior because of the size of these models. Train large language models through massive analysis datasets of text–words from online sources, including books, internet searches and Wikipedia – and finding connections between words that often appear together. Size is measured in terms of parameters, roughly analogous to all the ways in which words can be connected. The more parameters, the more connections an LLM can find. GPT-2 had 1.5 billion parameters, while GPT-3.5, the LLM that powers ChatGPT, uses 350 billion parameters. GPT-4, which debuted in March 2023 and now underpins Microsoft Copilot, reportedly uses 1.75 trillion.

That rapid growth has led to an astonishing increase in performance and effectiveness, and no one disputes that large enough LLMs can perform tasks that smaller models cannot handle, including tasks for which they are not trained. The trio at Stanford who labeled attendance a “mirage” recognize that LLMs become more effective as they scale; in fact, the added complexity of larger models should make it possible to get better at more difficult and diverse problems. But they argue that whether this improvement looks smooth and predictable, or erratic and sharp, arises from the choice of metric (or even from a lack of test samples) rather than from the inner workings of the model.

You may also like