Apple’s AI Study Reveals the Truth About Reasoning in Large Language Models

Overview

Apple’s latest research into artificial intelligence models has put the spotlight on a critical question—do AI models actually reason, or do they merely mimic reasoning through pattern recognition? By testing popular models like Claude and DeepSeek on controlled puzzles like Tower of Hanoi and Blocks World, Apple dives deep into the mechanics of what they call “reasoning.” The findings offer an eye-opening look at the limitations and behavior of large language models (LLMs), especially when it comes to tasks requiring step-by-step logic and symbolic reasoning.

The Illusion of Thinking: What Are Reasoning Models?

Traditional AI models operate in a simple question-and-answer format, while newer ones—known as Large Reasoning Models (LRMs)—display a visible chain of thought. These reasoning sequences give the illusion that the model is thinking logically through each step. However, Apple’s research suggests that these sequences might not be genuine reasoning at all. Instead, they could be cleverly stitched-together patterns based on training data, designed to look like deliberate problem-solving.

Designing Experiments That Truly Test Reasoning

To examine how real this step-by-step reasoning is, Apple chose symbolic puzzles commonly used in computer science education: Tower of Hanoi, Checkers Jumping, River Crossing, and Blocks World. These puzzles allow for difficulty escalation without altering the underlying logic, offering a clean framework to assess AI reasoning across increasing complexity.

What’s critical is that these environments were built from scratch, eliminating the risk of training data overlap. The performance of models—like Claude 3.7 and DeepSeek R1—was compared in both reasoning and non-reasoning modes under uniform parameters, including a generous 64,000-token budget. Each variation was tested 25 times to suppress randomness, and fine-grained metrics such as pass rate and token usage were meticulously recorded.

Findings: The Reasoning Breakdown

Apple’s results painted a complex picture:

  • In simpler problems, non-reasoning models often outperformed their reasoning-enabled counterparts as they avoided wasting tokens on lengthy deliberation.
  • For moderate complexity, reasoning models began to excel but at a heavy cost in token usage.
  • At high complexity levels—like Hanoi with 10 or more discs—all models collapsed. Performance and reasoning effort alike dropped zero.

Interestingly, the models often reduced their internal reasoning when faced with harder problems, a phenomenon Apple calls a “counterintuitive scaling limit.” For instance, DeepSeek R1 used 15,000 tokens on easier puzzles but only 3,000 on harder ones, even though it still had ample room left in its token budget. This drop was directly correlated with reduced accuracy, meaning AI doesn’t ramp up effort when needed—it shuts down.

Execution Is the Bottleneck, Not Logic

Apple even provided the models with full solution algorithms to see if just following instructions made things better. It didn’t. Claude 3.7 Thinking could only complete around 100 out of 1,023 required steps in a 10-disc Hanoi problem before failing. In a river crossing scenario, it broke after just 5 moves—despite needing only 11 for a full solution. This suggests a deep issue with sustained symbolic execution, not memory or capability.

Expert Reactions: A Divide in Interpretation

Reactions from the AI community were swift and polarized:

  • Gary Marcus called the results “devastating,” underscoring that symbolic reasoning problems like these were solved back in the 1950s with traditional computing logic.
  • Kevin Bryan and others suggested that this may reflect a model design choice, not a limitation—that these models are trained to stop early for efficiency.
  • Simon Willis argued that evaluating LLMs on problems like Hanoi might miss the mark since they aren’t inherently designed for computational logic.

Even Apple acknowledged the limitations. These puzzles cover only one aspect of human reasoning. Yet, because every move can be objectively measured, they remain valuable tools for testing logical robustness.

The Danger of Pattern-Based Illusions

Another important takeaway came from performance differences on math benchmarks. On familiar datasets like Math 500, models performed about equally well whether or not reasoning was used. But on newer tests like AIME 24 and 25, reasoning-based models performed better—highlighting that older test answers may have been included in training data, skewing results. Essentially, models succeed when they’ve seen something similar before—but falter when novelty is truly novel.

Claude 3.7, for instance, could make 100 correct moves in a well-documented 10-disc Hanoi sequence but failed within six steps in river crossing scenarios with less online presence. This underscores the risk of conflating pattern recognition with genuine understanding.

WWDC: A Surprising Shift Away from Deep AI

Apple’s Worldwide Developers Conference (WWDC) surprisingly downplayed AI, opting instead to spotlight iOS design upgrades like their “liquid glass” interface. While some AI features—like real-time translation and integrated ChatGPT functionality—were demonstrated, the anticipated groundbreaking Siri update was noticeably absent.

This contrast between Apple’s public product updates and its behind-the-scenes AI research is telling. While Apple might not advertise it, they are deeply concerned about the true capabilities—and limitations—of today’s AI models.

Conclusion

Apple’s in-depth study has triggered an important industry conversation: Are today’s artificial intelligence models really reasoning, or just imitating it? By using symbolic puzzles and structured tests, Apple showed that most LLMs begin to break down at a predictable complexity point. Whether this limitation stems from design choices, training data gaps, or fundamental architectural flaws is still up for debate. One thing is certain: the glowing promise of reasoning AI needs a reality check. Future models either need better training, longer context understanding, or perhaps, an entirely new approach to thinking.

The line between intelligent reasoning and intelligent mimicry is thinner than many realize—and this research leaves us questioning which side current AI really falls on.

https://in.linkedin.com/in/rajat-media

Helping D2C Brands Scale with AI-Powered Marketing & Automation 🚀 | $15M+ in Client Revenue | Meta Ads Expert | D2C Performance Marketing Consultant

Leave a Reply

Your email address will not be published. Required fields are marked *