Google's PaperBanana: AI agent beats PhD experts at scientific diagrams

Google Cloud AI Research last month published PaperBanana, a multi-agent framework designed to automate the creation of academic illustrations for research papers. According to a paper submitted to arXiv on January 30, 2026, the system employs five specialized artificial intelligence agents that work together to generate methodology diagrams and statistical plots, achieving a 72.7% win rate against baseline AI models in blind human evaluation.

The research addresses a persistent bottleneck in academic publishing. Creating publication-quality diagrams requires significant time and specialized skills that many researchers lack. PaperBanana transforms text descriptions from methodology sections into visual representations that conform to academic standards observed across 292 papers from NeurIPS 2025. The system's architecture draws on principles Google Cloud established in its November 2025 agentic AI framework, which defined five levels of agent sophistication from basic reasoning systems to collaborative multi-agent architectures.

Dawei Zhu from Peking University led the research alongside colleagues from Google Cloud AI Research including Rui Meng, Yale Song, Xiyu Wei, Sujian Li, Tomas Pfister, and Jinsung Yoon. The team built PaperBanana using Gemini-3-Pro as the underlying vision-language model backbone, combined with Nano-Banana-Pro and GPT-Image-1.5 for image generation. This technical stack positions PaperBanana within Google's broader artificial intelligence ecosystem, which has expanded rapidly throughout 2025.

Five agents working in concert

The PaperBanana architecture divides illustration generation across five specialized components, each handling distinct aspects of the creative process. The Retriever Agent searches through a pool of reference examples to identify relevant visual patterns. The Planner Agent translates methodology text into detailed descriptions suitable for image generation. The Stylist Agent applies aesthetic guidelines automatically synthesized from 292 reference papers. The Visualizer Agent renders the actual images using either Nano-Banana-Pro for methodology diagrams or generates Python code for statistical plots. Finally, the Critic Agent performs iterative refinement through three rounds of self-critique.

This multi-agent approach reflects broader industry adoption of specialized AI systems. Google Cloud's April 2025 survey found 88% of early adopter organizations implementing AI agents report positive return on investment across multiple business applications. PaperBanana demonstrates how this pattern applies to research workflows - each agent contributes specific expertise rather than attempting to solve the entire problem through a single model.

The system operates through a coordinated workflow. When a researcher inputs methodology text, the Retriever Agent first identifies the two most relevant reference diagrams from a curated pool of 292 examples. These serve as visual anchors. The Planner Agent then generates a comprehensive description including spatial relationships, component layouts, and visual hierarchies. The Stylist Agent overlays aesthetic guidelines derived from analyzing thousands of published diagrams - specifying color palettes, font choices, icon styles, and layout principles that characterize high-quality academic illustrations.

The Visualizer Agent receives these combined inputs and produces an initial draft. For methodology diagrams, it uses Nano-Banana-Pro's image generation capabilities. For statistical visualizations requiring numerical precision, it generates executable Python code using matplotlib or similar libraries. This dual approach acknowledges different requirements: methodology diagrams prioritize conceptual clarity while statistical plots demand exact data representation.

The Critic Agent evaluates each generated image across four dimensions. Faithfulness measures whether all components from the text description appear in the visualization. Conciseness assesses whether the diagram avoids unnecessary elements. Readability examines text legibility and visual organization. Aesthetics evaluates overall visual appeal and professional presentation quality. The agent provides specific feedback, which the Visualizer incorporates through iterative refinement - up to three revision cycles.

Benchmark performance exceeds baselines

The research team constructed PaperBananaBench, a specialized evaluation dataset containing 292 test cases extracted from NeurIPS 2025 methodology sections. Each test case includes the original text description and serves as ground truth for evaluation. The benchmark divides into four categories: Agent & Reasoning systems, Vision & Perception applications, Generative & Learning methods, and Science & Applications domains.

PaperBanana achieved measurable improvements across all evaluation dimensions compared to vanilla Nano-Banana-Pro without the agentic framework. Faithfulness increased 2.8 percentage points. Conciseness jumped 37.2 percentage points - the largest gain, indicating the Critic Agent successfully eliminates extraneous visual elements. Readability improved 12.9 percentage points. Aesthetics gained 6.6 percentage points. Overall scores rose 17.0 percentage points.

The blind human evaluation provided more compelling evidence. Reviewers saw pairs of images - one generated by PaperBanana and one by the baseline system - without knowing which was which. They selected their preferred illustration based on quality, clarity, and appropriateness for academic publication. PaperBanana won 72.7% of comparisons, tied in 20.7%, and lost only 6.6%. This 75% combined win-or-tie rate suggests the system produces illustrations that human experts judge superior in three of every four cases.

Anecdotal validation emerged through social media. A Twitter thread announcing PaperBanana by user @hasantoxr garnered 551,600 views and claimed human preference rates reaching 75%. The thread highlighted specific examples where PaperBanana-generated diagrams matched or exceeded the quality of figures created by professional designers. Multiple researchers commented on the potential time savings, noting that creating a single high-quality methodology diagram often requires 4-8 hours of work.

Statistical plots and code generation

Beyond methodology diagrams, PaperBanana extends to statistical visualization through a code-based approach. The system analyzed 240 test cases from the ChartMimic dataset, which contains diverse plot types including line charts, bar graphs, scatter plots, and multi-panel figures. Rather than generating plot images directly, the Visualizer Agent produces executable Python code that recreates the visualization.

This architectural choice addresses a fundamental limitation of image generation models. Dense numerical data often gets corrupted when rendered through AI image generators - axis labels become garbled, data points shift from their correct positions, and legends display incorrect values. Code generation solves this by treating visualization as a programming task rather than an image synthesis problem. The generated matplotlib or seaborn code exactly reproduces specified data relationships.

The approach aligns with how data scientists actually work. Professional researchers rarely create statistical plots by manually drawing pixels. They write code that transforms data into visual representations. PaperBanana automates this coding process, converting natural language descriptions like "create a bar chart showing accuracy across five models with error bars" into functioning Python scripts. Users can then modify the code to adjust colors, add annotations, or change formatting - something impossible with static generated images.

Performance on statistical plots matched the strong results from methodology diagrams. The system successfully generated code for complex multi-panel figures, handled diverse chart types, and incorporated proper axis labels and legends. Evaluation focused on code correctness - whether the generated script executed without errors and produced a visualization matching the input description. PaperBanana achieved high success rates across all tested plot categories.

Commercialization and pricing structure

Google launched PaperBanana as a commercial service alongside the research publication. The pricing page at paperbanana.org establishes a credit-based subscription model with four monthly tiers. The Basic plan costs $14.90 USD and includes 10 credits per month. The Pro plan at $29.90 USD provides 30 credits monthly. The Premium tier costs $59.90 USD for 100 credits. The Enterprise option runs $119.90 USD monthly and delivers 250 credits.

Each illustration generation consumes one credit regardless of complexity. This means a Basic subscriber can create 10 diagrams monthly while an Enterprise subscriber receives 250. The pricing structure targets different user segments. Individual researchers conducting occasional work might select Basic or Pro tiers. Academic labs producing multiple papers simultaneously could justify Premium subscriptions. Large research institutions or commercial R&D departments represent the Enterprise market.

The commercialization strategy differs from typical academic research releases. Most papers demonstrate techniques through proof-of-concept implementations but leave productization to others. Google instead launched a functional service immediately upon publication, complete with payment processing and subscription management. This reflects Google Cloud's growing emphasis on converting research advances into revenue-generating products. The company reported $17.6 billion in cloud revenues during Q4 2025, with AI products driving substantial growth.

The service currently supports English and Japanese language prompts only, according to documentation on the project website at https://dwzhu-pku.github.io/PaperBanana/. This limitation mirrors patterns seen across Google's AI deployments, which typically launch in English first before expanding to additional languages. The restriction may pose challenges for researchers working in other languages, particularly in European and Latin American academic communities.

Technical limitations and future development

The research paper acknowledges several constraints that limit PaperBanana's current capabilities. Output format represents the most significant limitation. The system generates raster images at 4K resolution rather than vector graphics. This matters because raster formats become pixelated when scaled, while vector graphics maintain quality at any size. Academic publishers increasingly prefer vector formats like SVG or PDF for figures that need to reproduce clearly in both digital and print editions.

Editability suffers as a consequence of the raster output. Researchers who receive a PaperBanana-generated diagram cannot easily modify specific elements. Changing a single label, adjusting a connection arrow, or recoloring a component requires regenerating the entire image through new prompts. Vector formats support precise editing of individual elements through tools like Adobe Illustrator or Inkscape. The authors note this limitation but suggest that iterative refinement through the Critic Agent partially compensates.

Style standardization creates another tradeoff. PaperBanana deliberately synthesizes aesthetic guidelines from 292 NeurIPS papers to establish consistent visual language. This produces diagrams that look professional and publication-ready. However, it reduces diversity and may impose a house style that doesn't match all journals or disciplines. Computer science papers favor certain diagrammatic conventions that differ from biology, physics, or social science publications. The current implementation doesn't adapt style guidelines across disciplines.

Faithfulness gaps persist despite the multi-agent architecture. The most common error type involves connection mistakes - arrows pointing to wrong components, missing relationships between elements, or spurious connections that don't appear in the text description. According to the paper's error analysis, these fine-grained connectivity issues often escape detection by the Critic Agent, which evaluates images through vision-language models that may miss subtle topological errors. The 72.7% win rate in human evaluation suggests room for improvement toward higher accuracy.

Implications for research workflows

Academic publishing involves numerous time-consuming tasks beyond writing. Literature review, experimental design, data collection, statistical analysis, and manuscript preparation all compete for researchers' limited time. Figure creation represents a particularly frustrating bottleneck because it requires visual design skills orthogonal to scientific expertise. Many researchers struggle with layout principles, color theory, and typography - producing functional but aesthetically mediocre diagrams.

PaperBanana addresses this mismatch by automating the design aspects while preserving researcher control over content. Scientists specify what their methodology involves through text descriptions. The system handles visual representation, applying professional design principles automatically. This division of labor mirrors patterns emerging across creative automation tools, where AI handles technical execution while humans focus on conceptual direction.

The time savings compound across a research career. A typical computer science researcher might publish 3-5 papers annually, each containing 4-8 figures. If PaperBanana reduces figure creation time from 6 hours to 30 minutes per diagram, a single researcher saves approximately 100-200 hours yearly. Multiplied across research teams, departments, and institutions, the aggregate productivity impact becomes substantial. Those hours can redirect toward experimental design, data analysis, or writing - activities that directly advance scientific knowledge.

Quality standardization represents another benefit. Human-created diagrams vary dramatically in quality depending on the creator's design skills. PaperBanana establishes a consistent baseline by applying the same aesthetic principles across all outputs. This consistency matters for journals maintaining visual standards across published articles. Editors spend less time requesting figure revisions for formatting issues. Peer reviewers can focus on scientific content rather than visual presentation.

Connection to automation trends

PaperBanana fits within broader automation patterns transforming knowledge work. Agentic AI systems increasingly handle tasks requiring judgment, creativity, and domain expertise rather than just rote execution. Google Cloud projects this market could reach $1 trillion by 2035-2040, with over 90% of enterprises planning integration within three years. Academic illustration automation represents one application among thousands.

The multi-agent architecture specifically echoes patterns documented in Google's comprehensive framework guidelinepublished November 2025. That 54-page technical document outlined five levels of agent sophistication, from basic reasoning systems to collaborative multi-agent workflows. PaperBanana operates at Level 3 - the "Collaborative Multi-Agent Systems" tier - where specialized agents function as a team-of-specialists mirroring human organizational structures.

Marketing automation shows parallel evolution. Major platforms have launched agent-based systems for campaign management, creative generation, and performance optimization. These tools divide complex workflows across specialized components: audience analysis agents, creative generation agents, bidding optimization agents, and performance evaluation agents. The architectural pattern - decomposing complex tasks into specialized sub-problems handled by purpose-built agents - appears consistently across domains.

Data visualization automation has expanded beyond academic contexts. Google's Looker Studio introduced modern chartsin January 2025, bringing enhanced styling options and sophisticated settings that automate much of the design work. Business intelligence platforms increasingly generate visualizations automatically from data, applying principles of effective data communication without requiring users to manually configure every aspect. PaperBanana extends this automation to the specialized context of academic methodology diagrams.

The code generation approach for statistical plots connects to broader AI-powered content creation trends. Rather than generating final outputs directly, systems increasingly produce structured intermediate representations that users can modify. Generated code provides transparency - researchers see exactly how the visualization gets constructed - and flexibility - they can adjust parameters or add custom elements. This contrasts with black-box image generation where the creation process remains opaque.

Research methodology and evaluation challenges

The PaperBananaBench dataset construction required careful curation. The research team extracted methodology sections from 292 papers published at NeurIPS 2025, one of computer science's premier conferences. They selected papers containing clear visual methodology descriptions suitable for automated illustration. This sampling strategy ensures test cases represent real academic writing rather than artificially constructed examples. However, it also introduces selection bias toward computer science and machine learning domains.

The four evaluation dimensions - faithfulness, conciseness, readability, and aesthetics - attempt to quantify subjective qualities of visual design. Faithfulness measures completeness: does the diagram include all elements mentioned in the text? This dimension admits relatively objective assessment. A methodology description mentioning "encoder, decoder, and attention mechanism" requires all three components to appear in the diagram. Missing any element constitutes a faithfulness failure.

Conciseness proves harder to evaluate objectively. What counts as an unnecessary element versus a helpful visual aid? PaperBanana scored 37.2 percentage points higher than baselines on conciseness, suggesting the Critic Agent successfully eliminates clutter. But optimal conciseness varies by discipline and audience. Introductory textbook diagrams benefit from additional explanatory elements that expert-oriented research figures should omit.

Readability encompasses multiple factors: text legibility, visual hierarchy, spatial organization, and cognitive load. Small, pixelated text reduces readability. Cluttered layouts increase cognitive load. Unclear visual hierarchies make it difficult to identify primary versus secondary elements. The research team used vision-language models to assess readability, but these models may not capture all factors that affect human comprehension.

Aesthetics represents the most subjective dimension. Beauty varies across individuals, disciplines, and cultural contexts. The research team grounded aesthetic evaluation in the specific visual language of NeurIPS 2025 papers. Diagrams matching that aesthetic received higher scores. This approach provides consistency but may not generalize to other conferences, journals, or fields with different visual conventions.

The blind human evaluation addressed some subjectivity concerns by collecting direct human judgments. Reviewers saw diagram pairs without knowing which system generated each image. They selected their preference based on overall quality for academic publication. This methodology approximates the real decision researchers face when choosing between figure options. The 72.7% win rate suggests human judges consistently prefer PaperBanana outputs when compared against baseline alternatives.

Competitive landscape and alternatives

Academic illustration tools have existed for years, though most require manual operation. Microsoft Visio, OmniGraffle, and Adobe Illustrator serve as industry standards for creating publication-quality diagrams. These provide full control and produce vector outputs, but demand substantial time and design expertise. Researchers must manually position every element, choose colors, design layouts, and refine typography.

Specialized scientific illustration software like BioRender targets specific domains. BioRender provides pre-designed icons for biological structures, making it easier to create molecular diagrams, cellular processes, and anatomical illustrations. Users drag and drop components rather than drawing from scratch. However, these tools still require human decision-making about composition, layout, and style. They accelerate illustration creation but don't automate it.

AI-powered diagram generation has emerged through various experimental systems. Some research projects have explored automatic flowchart generation from text descriptions or algorithm visualization from code. These typically handle narrower domains than PaperBanana's methodology diagrams, which span diverse scientific fields and illustration types. General-purpose image generation models like DALL-E, Midjourney, and Stable Diffusion can create diagrams from text prompts, but they lack the specialized architectural components - reference retrieval, style guidelines, iterative critique - that PaperBanana employs.

The comparison reveals PaperBanana's positioning. It automates more than traditional tools while specializing more than general AI image generators. The multi-agent architecture provides domain-specific capabilities that generic image generation lacks. The reference retrieval ensures consistency with academic conventions. The style guidelines enforce professional presentation standards. The critic agent catches errors that single-pass generation misses. This combination of automation and specialization differentiates PaperBanana from existing alternatives.

Commercial viability depends on whether researchers value time savings enough to justify subscription costs. A Basic plan at $14.90 monthly for 10 illustrations costs $1.49 per diagram. If creating a diagram manually requires 6 hours at an effective hourly rate of $50 (typical for PhD students or postdocs), the manual cost reaches $300. The 200x price difference suggests substantial value creation. However, adoption also depends on quality thresholds - researchers must trust that automated diagrams meet publication standards.

Future trajectory and open questions

The paper identifies several directions for future development. Vector output represents the most frequently requested enhancement, according to the researchers. Converting the current raster generation pipeline to produce SVG or PDF would improve scalability and editability. This likely requires architectural changes, since current image generation models produce pixel arrays rather than structured geometric representations.

Domain adaptation could expand PaperBanana beyond computer science into biology, physics, chemistry, and social sciences. Each field has distinct visual conventions for representing concepts. Biology uses particular icon sets for cellular structures. Physics employs specific notation for force diagrams. Chemistry follows strict conventions for molecular representations. The current system synthesizes style guidelines from NeurIPS papers, but extending this to other disciplines requires curated reference pools from those fields.

Interactive refinement might improve the current batch-oriented workflow. Instead of describing a complete diagram upfront, researchers could iterate through conversational exchanges: "add a connection between the encoder and decoder," "make the attention mechanism more prominent," "use warmer colors for the data flow." This mirrors how designers work with clients through iterative feedback cycles. Implementing such interaction requires maintaining conversation context and enabling precise, localized edits.

Quality control mechanisms need strengthening to catch remaining errors. The 72.7% win rate in human evaluation implies 27.3% of cases where either PaperBanana tied with or lost to baselines. Even a 6.6% loss rate means roughly one in fifteen diagrams fails to meet baseline quality. For researchers preparing manuscripts, this error rate may prove unacceptable if it requires manual diagram creation as fallback. Higher reliability through improved critic agents or additional validation steps could increase adoption.

Integration with writing workflows represents another opportunity. Researchers currently must extract methodology text, paste it into PaperBanana, generate diagrams, download results, and import them into manuscript documents. Tighter integration with tools like Overleaf (LaTeX editor) or Microsoft Word could streamline this process. A plugin might allow in-document diagram generation without leaving the writing environment.

The training data question looms over all AI systems. PaperBanana learned from 292 NeurIPS 2025 papers to synthesize style guidelines and train its retrieval system. This dataset size seems small compared to billion-image training sets used for general image generation. Does the specialized, high-quality academic context compensate for limited quantity? Would expanding to thousands of papers across multiple conferences improve performance? The research doesn't extensively explore these scaling questions.

Timeline

January 30, 2026: PaperBanana research paper submitted to arXiv by Google Cloud AI Research team led by Dawei Zhu
January 30, 2026: Commercial service launched at paperbanana.org with subscription-based pricing
January 30, 2026: Project website published at https://dwzhu-pku.github.io/PaperBanana/ with examples and documentation
November 2025: Google Cloud published comprehensive agentic AI framework guideline establishing standards for multi-agent systems that inform PaperBanana's architecture
September 2025: Google Cloud survey revealed 88% ROI among AI agent early adopters, documenting enterprise adoption patterns
July 2025: Google Cloud projected $1 trillion agentic AI market by 2040 with 90% enterprise adoption expected
January 2025: Looker Studio introduced modern charts with enhanced visualization capabilities demonstrating Google's broader data visualization initiatives

Summary

Who: Google Cloud AI Research team including Dawei Zhu from Peking University, alongside Rui Meng, Yale Song, Xiyu Wei, Sujian Li, Tomas Pfister, and Jinsung Yoon developed and published the PaperBanana system. The multi-agent framework uses Gemini-3-Pro as its underlying vision-language model with Nano-Banana-Pro and GPT-Image-1.5 for image generation.

What: PaperBanana is an automated academic illustration system employing five specialized AI agents (Retriever, Planner, Stylist, Visualizer, and Critic) working collaboratively to generate methodology diagrams and statistical plots from text descriptions. The system achieved 72.7% win rate against baseline models in blind human evaluation and demonstrated improvements across faithfulness (+2.8%), conciseness (+37.2%), readability (+12.9%), and aesthetics (+6.6%) dimensions. Google launched it as a commercial service with monthly subscription plans ranging from $14.90 to $119.90 offering 10 to 250 credits.

When: The research paper was submitted to arXiv on January 30, 2026, with simultaneous launch of the commercial service and project website. The system builds on Google Cloud's November 2025 agentic AI framework guideline and leverages architecture patterns documented throughout 2025.

Where: PaperBanana operates as a web-based service accessible at paperbanana.org, with research documentation at https://dwzhu-pku.github.io/PaperBanana/ and the full technical paper available on arXiv under identifier 2601.23265. The system currently supports English and Japanese language prompts, with generated outputs at 4K resolution raster format.

Why: Academic illustration creation represents a significant bottleneck in research publishing, often requiring 4-8 hours per high-quality diagram and demanding visual design skills orthogonal to scientific expertise. PaperBanana addresses this by automating the design aspects while preserving researcher control over content, potentially saving 100-200 hours annually per researcher across 3-5 publications. The system positions within Google Cloud's broader strategy of converting AI research into revenue-generating products as the agentic AI market approaches projected $1 trillion valuation by 2040.