A new research paper from Apple has created a buzz in the technology sector by revealing that artificial intelligence (AI) models may not be as smart as they are often believed to be. The study, which was carried out by Apple’s leading scientists, questions the real reasoning power of well-known AI systems and suggests that their abilities are often overstated.
AI Struggles with Complex Reasoning
Apple’s research team decided to move away from traditional maths and coding tests that AI models frequently encounter. Instead, they chose classic puzzle games such as Tower of Hanoi, River Crossing, and Blocks World. These puzzles are easy to understand but become much harder as the number of steps required increases. The aim was to see if AI models could truly think ahead and solve unfamiliar, complex problems in the way a human might.
The findings were striking. When the puzzles were simple, the AI models performed well, solving them quickly and accurately. This showed that the models are strong in recognising patterns and recalling solutions they have seen before. However, as soon as the puzzles became more complicated, the performance of the AI systems dropped sharply. Even the most advanced AI models could not solve the tougher puzzles, and their accuracy fell to almost zero.
Even when the researchers provided the correct method or algorithm for solving the puzzles, the AI models did not improve. This suggested that the AI systems do not actually understand the problems or develop genuine reasoning skills. Instead, they rely mainly on memorising patterns and examples from their training data.
Key Limitations in Reasoning AI Models
Apple’s recent research has brought several critical limitations of current reasoning AI models to the forefront. The most prominent issue is that these models rapidly lose accuracy as the complexity of problems increases. While they perform well on simple tasks, their accuracy collapses to zero when faced with high-complexity challenges, regardless of having enough computational resources or clear instructions.
A major finding is that these AI models do not genuinely reason; instead, they rely heavily on memorising patterns from their training data. When questions are reworded or irrelevant information is added, their performance drops sharply. This sensitivity to context and vulnerability to distractions highlights that they struggle to distinguish important details from noise, unlike humans who can adapt and filter information effectively.
The study also reveals that, rather than following logical steps, these models tend to match patterns they have seen before. When faced with unfamiliar or more complex puzzles, such as Tower of Hanoi or River Crossing, their reasoning does not scale up. In fact, their ability to reason declines as tasks become harder, even if they are provided with the correct solution algorithm. This counterintuitive scaling limit means that their reasoning effort increases only up to a certain point, after which it declines despite having the capacity to process more information.
Another limitation is inconsistency. The models often fail to generalise reasoning across different types of problems and may stick to incorrect early answers, showing poor adaptability in multi-step or multi-turn tasks. In some cases, they also exhibit “overthinking,” where they continue unnecessary reasoning even after reaching a correct solution, which can lead to errors and less effective problem-solving.
Implications for the Future of AI
Apple’s study also points out a major problem with how AI is currently tested. Many standard benchmarks and exams are too simple and do not show the real weaknesses of AI models. As a result, people may get the wrong impression that these systems are much smarter than they really are.
The researchers warn that, despite the rapid progress in AI, the technology is still far from achieving human-like intelligence. Today’s AI models can handle simple and familiar tasks, but they struggle with new and complex challenges that require true reasoning and adaptability.