Google researchers this week published SAGE (Steerable Agentic Data Generation for Deep Search with Execution Feedback), a framework that automatically generates high-quality training data for AI agents designed to browse websites and answer complex questions requiring multiple search steps. The research paper, made available on January 26 through arXiv, addresses a fundamental challenge facing companies building AI search systems: acquiring training data for agents that must navigate across web pages, synthesize information from multiple sources, and reason through multi-step problems.

According to the paper authored by researchers from Google Cloud AI Research and New York University, collecting human annotations for training these "deep search agents" proves prohibitively expensive. Complex exploration trajectories involving multiple searches and reasoning steps make manual data creation impractical at the scale required for effective model training.

The SAGE framework employs a dual-agent architecture. A data generator agent creates question-answer pairs by iteratively searching through a corpus and gathering information, while a separate search agent attempts to solve the generated questions. The search agent provides execution feedback to the data generator, enabling refinement of questions until they satisfy both correctness and difficulty requirements. This iterative feedback loop represents a departure from static data generation approaches that produce questions without validating whether they genuinely require the intended reasoning complexity.

The system allows researchers to control difficulty by specifying target search steps - the number of times an agent must query a retrieval system before arriving at an answer. Questions requiring 3-7 search steps demonstrate substantially different characteristics than simpler queries answerable with one or two lookups. When targeting four-step questions, for example, the data generator might create: "What is the specific date of the initial event that evolved into the national book fair, pioneered by the individual who established a publishing house in Kolkata during the Bangladesh Liberation War?"

Traditional question-answering datasets focus on simpler information needs. Natural Questions, created through human annotation, requires an average of 1.3 search steps per question. HotpotQA and Musique, constructed through automatic pipelines leveraging Wikipedia's structure, average 2.1 and 2.7 steps respectively. SAGE-generated questions average 4.9 steps, with researchers successfully producing questions requiring up to 7 distinct search operations.

The research demonstrates why execution feedback proves essential. Without verification from an actual search agent, the data generator produces correct and sufficiently difficult questions only 18% of the time when targeting 3-7 step questions. After three rounds of feedback refinement, success rates climb to 50%. The paper reveals that data generators frequently misjudge difficulty because of misalignments between intended search plans and actual retrieval system behavior.

Analysis of 100 failed question generations identified four common patterns causing "easy data" - questions that require fewer steps than intended. Information co-location, where multiple required facts appear in the same document, accounts for 35% of easy questions. Multi-query collapse, where a retrieval system finds information from multiple documents with a single query, causes 21% of failures. Overly specific questions and superficial complexity contribute 31% and 13% respectively.

For incorrect questions, search agent retrieval failures represent 54% of problems, followed by reasoning errors at 20% and data generator hallucinations at 19%. These findings suggest that substantial portions of initially rejected data reflect search agent limitations rather than fundamental question flaws, pointing toward potential improvements in verification approaches.

The training data's quality manifests in downstream performance. Researchers trained Qwen-2.5-3B and Qwen-2.5-7B models using reinforcement learning with SAGE-generated data, comparing results against models trained on Natural Questions combined with HotpotQA, as well as Musique alone. On in-domain evaluation averaging questions requiring 2-7 search steps, the 3B model improved from 15.9% accuracy (Natural Questions + HotpotQA baseline) and 22.4% (Musique baseline) to 28.5% - a 27% relative improvement. The 7B model jumped from 29.1% and 29.6% to 38.1%, representing 29% relative improvement.

These gains transferred to out-of-domain datasets. On FRAMES, a human-annotated benchmark for retrieval-augmented generation, the 7B model achieved 32.3% accuracy after training on SAGE data compared to 26.2% for Natural Questions + HotpotQA and 25.0% for Musique - a 23% relative improvement over the strongest baseline. Performance on Musique itself reached 22.3%, surpassing the 21.6% achieved by models trained directly on Musique's own training data.

Reasoning strategy analysis reveals that SAGE produces questions requiring more diverse cognitive operations than existing benchmarks. While inference appears in 77% of Musique questions and 81% of SAGE questions, calculation and temporal reasoning show stark differences. Calculation appears in 5% of Musique questions versus 35% of SAGE questions. Temporal reasoning jumps from 8% to 32%. Hypothesis generation, conflict resolution, and self-correction also appear more frequently in SAGE data, creating a more balanced distribution across reasoning categories.

The research demonstrates that agents trained on Wikipedia-based retrieval transfer effectively to Google Search at inference time. On GAIA, a benchmark requiring web search, the 7B model trained on SAGE data achieved 24.0% accuracy compared to 15.6% for Musique-trained models - a 50% relative improvement. Similar patterns emerged on Browsecomp, though improvements on Humanity's Last Exam proved more modest, likely reflecting the specialized scientific focus of that benchmark.

Google's framework operates on the 2018 Wikipedia dump using E5 as the retrieval system. The data generator and search agent both run on gemini-2.5-flash with temperature set to 1. The generator receives an input document randomly sampled from Wikipedia and a target difficulty level specified as number of search steps. It then iteratively issues search queries while gathering comprehensive information before outputting a question-answer pair grounded in retrieved evidence.

If the generator exhausts its search budget without producing a question, the system forces output by appending a prompt directing the model to formulate a question using existing information. The search agent receives only the generated question, without access to the original input document, and must independently search to find the answer. Researchers collect multiple execution traces from the search agent to account for variation in problem-solving approaches.

The feedback mechanism provides both correctness signals and difficulty estimates. Correctness derives from pass@K performance - whether any of K attempts (K=4 in the research) produces an answer matching the data generator's proposed answer using LLM-as-a-judge evaluation. Difficulty measures as the minimum number of search steps among correct attempts. If this minimum equals or exceeds the target, the question passes difficulty requirements.

For downstream training, researchers generated 20,000 question-answer pairs for each experimental condition, filtering out questions requiring fewer than two search steps. Training employed Proximal Policy Optimization with outcome-based rewards evaluated by gemini-2.0-flash as judge. The training process applied loss masking to retrieved document content, focusing optimization on the model's reasoning and query formulation rather than memorization of specific passages.

The research acknowledges limitations. The framework relies on a fixed search agent for verification rather than co-evolving both the generator and verifier, potentially missing opportunities for enhanced data quality through iterative agent improvement. The pass@K=1 correctness criterion serves as a practical approximation but may admit hallucinated content. Questions where the search agent achieves pass@K=0 get filtered entirely, potentially discarding valid questions that simply exceed current agent capabilities.

The implementation focuses exclusively on generating question-answer pairs for reinforcement learning rather than complete supervised fine-tuning trajectories including intermediate reasoning steps and retrieved documents. Experiments cover only Wikipedia as source corpus and models up to 7B parameters, leaving domain-specific applications and larger model scales unexplored. The approach also hasn't been tested with alternative RL algorithms like GRPO.

The timing of this research coincides with broader industry movement toward agentic AI systems. Google executives have discussed fundamental transformations in web interaction patterns, with CEO Sundar Pichai describing an "agent first web" during a May 2025 interview. The company expanded AI Mode in November 2025, introducing agentic features capable of completing tasks like restaurant reservations directly within search results.

Marie Haynes, a prominent SEO expert, explained in December 2025 how Google increasingly functions as an AI agent making decisions on behalf of users rather than simply presenting ranked links. Her analysis emphasized that "the web the way we know it - I think the web had to exist for like Google's been around for what 25 years or so. I think that we've been working for Google in populating content so that AI could learn."

The research arrives as concerns mount about AI features reducing traffic to publisher websites. Google's Web Guide, launched in July 2025, uses query fan-out techniques similar to AI Mode to reorganize search results through AI-powered grouping - functionality that Cloudflare CEO Matthew Prince characterized as continuing to "break publishers' business models."

Google published technical documentation for AI agent architectures in September 2024, detailing how agents leverage tools to extend beyond traditional language model capabilities through three core components: the model layer, orchestration layer, and tools layer. The September whitepaper emphasized that AI agents fundamentally differ from language models in their ability to perceive, reason about, and influence the external world.

The SAGE research builds on this foundation by addressing a critical bottleneck: acquiring training data at the scale and quality needed to create capable search agents. While Google has introduced various agent-powered features including autonomous checkout and Business Agent for retail, the underlying question of how to train these systems efficiently remains central to their deployment.

Training data quality directly impacts whether AI agents can handle the complex, multi-step reasoning that characterizes real-world information needs. The research demonstrates that synthetic data generation, when paired with proper verification mechanisms, can produce training sets rivaling or exceeding those created through expensive human annotation or existing automatic pipelines.

The paper notes that concurrent work from other research teams explores similar challenges. WebDancer and WebShaper, both announced in 2025, tackle synthetic training data generation for search agents using browsing tools rather than retrieval APIs. These approaches focus on actual web navigation, which proves more expensive due to API costs and more complex to reproduce than corpus-based retrieval.

The methodology SAGE introduces - using execution traces to refine generated questions through iterative feedback - represents a general approach applicable beyond search agents. Any task where difficulty proves hard to specify upfront but easy to measure through execution could potentially benefit from similar verification-driven generation.

The research makes code and data available through GitHub, enabling other researchers to reproduce findings and build on the framework. This open approach contrasts with some concurrent industry work where large-scale training data remains proprietary despite published papers describing generation methods.

For marketing professionals and SEO practitioners, the research offers insights into how AI systems will navigate and synthesize web content. The finding that useful internal links should help agents "jump to another page but that jump should add to the reasoning process" suggests that traditional pillar-cluster content architecture may prove beneficial for AI agent navigation when anchors provide contextually relevant information filling gaps in the model's reasoning.

The emphasis on providing "broad context when mentioning entities or facts" aligns with recommendations that content should explain not just what information means but why it matters within the specific context. If mentioning "10%" in content, the research suggests properly explaining "of what and why it matters" rather than assuming readers or AI systems will infer context from surrounding text.

These technical insights from Google's research infrastructure provide a window into how the company approaches building the AI systems that increasingly mediate between users and web content. The SAGE framework demonstrates that creating effective AI agents requires not just powerful models but sophisticated data generation pipelines that can produce training examples matching the complexity of real-world tasks.

Timeline

Summary

Who: Google Cloud AI Research and New York University researchers including Fangyuan Xu, Rujun Han, Yanfei Chen, Zifeng Wang, I-Hung Hsu, Jun Yan, Vishy Tirumalashetty, Eunsol Choi, Tomas Pfister, and Chen-Yu Lee.

What: SAGE (Steerable Agentic Data Generation for Deep Search with Execution Feedback) is an automated pipeline that generates high-quality, difficulty-controlled training data for AI search agents through a dual-agent framework where a data generator creates question-answer pairs and a search agent provides execution feedback for iterative refinement. The framework produces questions requiring an average of 4.9 search steps compared to 1.3-2.7 steps in existing datasets, with success rates improving from 18% to 50% through feedback iterations. Training on SAGE-generated data produces 23-29% relative performance improvements over existing training datasets on both in-domain and out-of-domain benchmarks.

When: The research paper was submitted to arXiv on January 26, 2026, with the work conducted throughout 2025. Code and data release is planned through GitHub at https://github.com/carriex/sage.

Where: The research was conducted at Google Cloud AI Research facilities in collaboration with New York University. The framework operates on the 2018 Wikipedia dump using E5 as the retrieval system, though trained agents demonstrate effective transfer to Google Search at inference time. The methodology applies to any corpus-based retrieval system.

Why: The research addresses the prohibitively expensive and time-consuming challenge of collecting human annotations for training AI agents that must perform complex, multi-step reasoning across multiple documents. Existing training datasets focus primarily on simpler questions requiring 1-3 search steps, creating a gap in available data for training agents capable of handling more complex information needs. The automated generation approach with execution feedback enables production of high-quality training data at scale while maintaining control over difficulty levels, advancing development of AI systems that can browse websites, synthesize information across sources, and answer questions requiring sophisticated reasoning strategies including calculation, temporal analysis, conflict resolution, and hypothesis generation.

Share this article
The link has been copied!