Small models can't learn search tasks despite more data

Small transformer models struggle with search tasks on larger graphs even when given unlimited training data and increased parameters, according to new research that challenges scaling assumptions for AI development.

AI network visualization showing transformer models struggling with complex graph search tasks as connections break down
AI network visualization showing transformer models struggling with complex graph search tasks as connections break down

Research published in the International Conference on Learning Representations (ICLR) 2025 demonstrates fundamental limitations in transformer architectures when learning search algorithms. According to the study "Transformers Struggle to Learn to Search," published December 6, 2024, with final revisions on March 16, 2025, small transformer models can master search tasks on simple graphs but fail consistently as input complexity increases, regardless of additional training data or model parameters.

The research team from Purdue University, New York University, Google, and Boston University used graph connectivity problems as a controlled testbed to train small transformers with effectively limitless data. Their findings reveal that transformers learn search through parallel computation across vertices, expanding reachable sets exponentially with each layer, but this approach breaks down systematically on larger inputs.

"When given the right training distribution, the transformer is able to learn to search," according to the paper's abstract. However, "as the input graph size increases, the transformer has greater difficulty in learning the task. This difficulty is not resolved even as the number of parameters is increased, suggesting that increasing model scale will not lead to robust search abilities."

The study emerges amid broader questions about AI capabilities and limitations. According to the study, transformers perform search simultaneously across all vertices in input graphs, storing reachable vertex sets in embeddings. Each layer progressively expands these sets, theoretically enabling exponential growth in searchable vertices relative to layer count.

Through mechanistic interpretability analysis, researchers identified this "exponential path-merging algorithm" where embeddings contain information about vertex reachability. The model copies information between source and target vertices, computing unions of reachable sets at each layer. This parallel approach allows searching over vertex counts exponential in transformer layers.

Testing revealed striking performance differences across graph sizes. Models trained on graphs with maximum 41 vertices achieved near-perfect accuracy on small inputs but degraded severely on larger graphs. With 14 different random initialization seeds, the fraction of models successfully learning decreased dramatically as graph size increased from 8 to 50 vertices.

Advertise on ppc land

Buy ads on PPC Land. PPC Land has standard and native ad formats via major DSPs and ad platforms like Google Ads. Via an auction CPM, you can reach industry professionals.

Learn more

The researchers also tested whether chain-of-thought prompting could overcome these limitations. Using depth-first search and selection-inference approaches, they found that while intermediate steps required fewer layers to learn, models still struggled on larger graphs. "Even if the model is permitted to generate intermediate tokens, it is challenging to learn to search on larger graphs," the paper states.

Graph connectivity represents a fundamental reasoning task equivalent to proof search in simplified logic systems. The researchers selected this domain specifically because "the model must solve this task if there is any chance to generalize to more complex search and reasoning tasks." Their findings suggest broader implications for AI system capabilities in planning, reasoning, and navigation tasks that require systematic search.

Training distribution design proved critical for learning success. The researchers developed three different graph generation approaches: naive, star, and balanced distributions. Only the balanced distribution, which carefully prevented heuristic shortcuts and maintained uniform difficulty levels, enabled robust learning. Models trained on naive distributions showed exponentially declining accuracy as search depth increased.

The study included natural language proof search experiments where graph edges were expressed as conditional sentences. Performance patterns remained consistent, suggesting limitations apply broadly across input representations rather than specific symbolic encodings.

Current developments in search and reasoning align with these findings. As PPC Land reported in July 2025, researchers argue that large language models lack true reasoning capabilities, functioning through "universal approximate retrieval" rather than logical processing. The transformer search limitations research provides specific evidence for these broader theoretical concerns.

The economic implications extend beyond academic interest. Industry data shows $57 billion in cloud infrastructure investment during 2024 to support large language model deployment, creating a ten-fold disparity between infrastructure costs and market revenue. Understanding transformer limitations becomes crucial for evaluating return on AI investments.

The research methodology involved streaming training with continuously generated examples rather than fixed datasets. Models used single attention heads per layer and concatenated rather than additive positional embeddings to facilitate mechanistic interpretation. Training continued until models reached near-perfect accuracy on training distributions or demonstrated clear failure to converge.

Scaling experiments tested models with varying parameter counts on fixed graph sizes and varying graph sizes with fixed parameters. Both approaches revealed consistent patterns: larger models learned faster but showed no improvement in ultimate performance on challenging tasks. This contradicts assumptions that additional parameters automatically enable more complex reasoning.

The findings carry particular relevance for marketing technology development. Industry data shows 80% of companies blocking AI language models from accessing their websites, reflecting growing skepticism about AI capabilities. Understanding specific AI limitations helps marketing professionals make informed decisions about technology adoption and resource allocation.

Search capabilities matter for numerous marketing applications including campaign optimization, customer journey mapping, and competitive analysis. Tasks requiring systematic exploration of solution spaces may encounter similar scaling limitations, suggesting caution when deploying AI systems for complex strategic planning.

The research also reveals important insights about AI evaluation. Models can achieve perfect performance on training distributions while failing completely on slightly modified test cases. This pattern appears in recent studies showing AI models fake understanding while failing basic tasks, where language models define concepts correctly but cannot apply them consistently.

Mechanistic interpretability techniques developed for this research enable extraction of computation graphs from trained models. The methodology involves identifying important attention operations through perturbation analysis and reconstructing causal pathways from inputs to outputs. These tools may prove valuable for understanding other AI system behaviors.

Future research directions include investigating curriculum learning approaches, alternative architectures beyond transformers, and hybrid systems combining neural networks with symbolic reasoning. The authors suggest that different training procedures or architectural modifications might overcome current limitations.

The study's implications extend to broader AI development strategies. Rather than assuming scaling will solve fundamental capability gaps, researchers and practitioners may need alternative approaches for complex reasoning tasks. This aligns with growing industry recognition that different model sizes serve different purposes effectively.

For marketing professionals evaluating AI tools, these findings suggest distinguishing between pattern recognition tasks suitable for current models and reasoning tasks that may require different approaches. Understanding specific AI limitations enables more strategic technology deployment decisions.

The research contributes to evolving understanding of AI capabilities and constraints. As transformer architectures continue dominating AI applications, identifying their fundamental limitations becomes essential for realistic expectations and effective system design.

Timeline

Summary

Who: Research team from Purdue University, New York University, Google, and Boston University led by Abulhair Saparov investigating transformer search capabilities.

What: Study demonstrating that transformer models cannot learn robust search algorithms on larger graphs despite unlimited training data and increased parameters, challenging fundamental assumptions about AI scaling.

When: Research paper initially published December 6, 2024, with final version March 16, 2025, and acceptance at ICLR 2025 conference.

Where: Academic research conducted across multiple institutions with findings published in premier machine learning conference and implications for AI development globally.

Why: Investigation addresses critical questions about whether transformer architecture limitations, insufficient data, or inadequate parameters explain AI struggles with search and reasoning tasks essential for planning and navigation applications.

PPC Land explains

Transformer Architecture: The foundational neural network design that powers most modern AI language models, including GPT and similar systems. Transformers process information through attention mechanisms that allow models to focus on relevant parts of input data simultaneously rather than sequentially. The research demonstrates that despite their success in language tasks, transformers have fundamental limitations when learning algorithmic search procedures, suggesting architectural constraints that scaling alone cannot overcome.

Graph Connectivity: A mathematical problem involving determining whether paths exist between vertices in directed networks, used as a controlled testbed for evaluating search capabilities. This domain represents the simplest form of logical reasoning, equivalent to proof search in basic logic systems. The researchers selected graph connectivity specifically because mastery of this task forms a prerequisite for more complex reasoning abilities in planning, navigation, and strategic decision-making applications.

Mechanistic Interpretability: Advanced techniques for understanding how neural networks process information internally by extracting computation graphs from trained models. These methods reveal the specific algorithms that transformers learn, showing how attention operations transfer information between network components. The research developed novel interpretability tools that identify causal pathways from inputs to outputs, enabling precise analysis of how models succeed or fail at search tasks.

Search Algorithms: Systematic procedures for exploring solution spaces to find optimal paths or answers, fundamental to reasoning and planning tasks. The study reveals that transformers learn parallel search strategies, computing reachable vertex sets simultaneously across all graph positions. However, this approach breaks down as problem complexity increases, highlighting limitations in how current AI systems handle systematic exploration compared to traditional algorithmic approaches.

Scaling Laws: Empirical observations about how AI model performance improves with increased parameters, training data, or computational resources. The research challenges conventional scaling assumptions by demonstrating that larger transformers show no improvement on search tasks beyond a certain complexity threshold. These findings suggest fundamental architectural limitations that additional scale cannot resolve, contradicting industry expectations about linear capability improvements through resource investment.

Chain-of-Thought: A prompting technique where models generate intermediate reasoning steps before reaching final conclusions, designed to improve complex problem-solving capabilities. The study tested whether allowing transformers to output intermediate tokens could overcome search limitations, finding that while this approach required fewer layers to learn, models still struggled systematically on larger problems. This suggests that current reasoning enhancement techniques may not address fundamental architectural constraints.

Training Distribution: The statistical properties of data used to train machine learning models, critically affecting what algorithms models learn to implement. The researchers discovered that carefully designed balanced distributions enable search learning, while naive approaches fail completely. This highlights how training data design, not just quantity, determines whether AI systems acquire robust reasoning capabilities versus brittle pattern recognition.

Large Language Models: AI systems with billions of parameters trained on vast text datasets to understand and generate human-like language. The research findings apply broadly to LLMs despite using smaller models, as the fundamental transformer architecture remains consistent across scales. Understanding these search limitations helps explain why LLMs struggle with planning and reasoning tasks that require systematic exploration of solution spaces.

Exponential Path-Merging: The specific algorithm that transformers learn for search tasks, where each layer progressively expands sets of reachable vertices by merging information from connected nodes. This parallel computation approach theoretically enables searching over vertex counts exponential in the number of transformer layers. However, the research shows this strategy becomes increasingly unreliable as graph size increases, revealing fundamental constraints in how transformers represent and manipulate structured information.

Graph Size Scaling: The phenomenon where transformer performance degrades systematically as input complexity increases, regardless of additional training or parameters. Models achieving near-perfect accuracy on small graphs fail catastrophically on larger inputs, suggesting hard limits on the complexity that current architectures can handle. This scaling failure pattern appears consistently across different training approaches and model configurations, indicating fundamental rather than implementation-specific limitations.