For some time now, companies like OpenAI and Google have been promoting advanced “reasoning” capabilities as the next big step in its latest artificial intelligence models. Now, however, a new study by six Apple engineers shows that the mathematical “reasoning” displayed by advanced large language models can be extremely brittle and unreliable in the face of seemingly trivial changes to common benchmark problems.
The fragility highlighted in these new results helps support previous research suggesting that LLMs’ use of probabilistic pattern matching lacks the formal understanding of the underlying concepts necessary for truly reliable mathematical reasoning capabilities. “Current LLMs are not capable of genuine logical reasoning,” the researchers argue based on these results. “Instead, they try to replicate the reasoning steps observed in their training data.”
mix it up
In “GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models”, currently available as pre-printed paper—the six Apple researchers begin GSM8K standardized set of over 8,000 math problems posed at the elementary school levelwhat is It is often used as a reference point. for the complex reasoning capabilities of modern LLMs. They then take the novel approach of modifying a portion of that test suite to dynamically replace certain names and numbers with new values, so that a question about Sophie getting 31 building blocks for her nephew in GSM8K could become a question about that Bill gets 19 building blocks for his brother in the new GSM-Symbolic assessment.
This approach helps avoid any potential “data pollution” that may result from GSM8K static questions being fed directly into an AI model’s training data. At the same time, these incidental changes do not alter the actual difficulty of the inherent mathematical reasoning at all, meaning that in theory the models should perform as well when tested in GSM-Symbolic as they do in GSM8K.
Instead, when the researchers tested more than 20 state-of-the-art LLMs on GSM-Symbolic, they found that average accuracy dropped across the board compared to GSM8K, with performance drops between 0.3 percent and 9.2 percent. cent, depending on the model. The results also showed high variation in 50 separate runs of GSM-Symbolic with different names and values. Accuracy gaps of up to 15 percent between the best and worst runs were common within a single model, and for some reason, changing the numbers tended to result in worse accuracy than changing the names.
This type of variation, both within different GSM-Symbolic runs and in comparison to the GSM8K results, is more than surprising since, as the researchers point out, “the general reasoning steps required to solve a question remain the same.” “. The fact that such small changes lead to such variable results suggests to researchers that these models are not doing any “formal” reasoning, but rather “trying to perform a kind of pattern matching on the distribution, aligning given questions and steps of the solution with similar ones seen in the training data.”
Don’t get distracted
Still, the overall variation shown for GSM-Symbolic tests was often relatively small in the grand scheme of things. OpenAI’s ChatGPT-4o, for example, dropped from 95.2 percent accuracy on GSM8K to a still impressive 94.9 percent on GSM-Symbolic. That’s a pretty high success rate using any of the benchmarks, regardless of whether or not the model itself uses “formal” reasoning behind the scenes (although the overall accuracy of many models dropped precipitously when researchers added just one. or two additional logical steps to the problems). ).
However, the LLMs tested fared much worse when Apple researchers modified the GSM-Symbolic benchmark by adding “seemingly relevant but ultimately inconsequential statements” to the questions. For this “GSM-NoOp” (short for “no operation”) reference set, a question asking how many kiwis someone picks over several days could be modified to include the incidental detail that “five of them (the kiwis) were a little more smaller than average.”
Adding these red herrings led to what the researchers called “catastrophic performance drops” in accuracy compared to GSM8K, ranging from 17.5 percent to a whopping 65.7 percent, depending on the model tested. These massive drops in accuracy highlight the limits inherent in using simple “pattern matching” to “convert statements into operations without truly understanding their meaning,” the researchers write.