Google reveals Gemini multimodal advances in July 2025 podcast
Enhanced video processing capabilities and spatial understanding unlock new AI applications.

Google showcased advanced multimodal capabilities in Gemini during a detailed technical podcast released on July 3, 2025. According to Ani Baddepudi, multimodal Vision product lead for Gemini, "Gemini from the beginning was built to be a multimodal model" with the goal that "these models should be able to see and perceive the world like we do."
The announcement demonstrates Gemini 2.5's enhanced video understanding performance. According to Baddepudi, "previous Gemini models, they were pretty good at video, but robustness was a bit of an issue. So one of the issues that we had, for example, was if you fed in an hour-long video to a model, the model would focus in on the first 5 and 10 minutes and then trail off for the rest of the video."
Subscribe the PPC Land newsletter ✉️ for similar stories like this one. Receive the news every day in your inbox. Free of ads. 10 USD per year.
Summary
Who: Ani Baddepudi, multimodal Vision product lead for Gemini and newly appointed product lead for Gemini model behavior, representing Google's extensive multimodal research team led by JB with workstream leads across image, video, and spatial understanding capabilities.
What: Google unveiled Gemini 2.5's advanced multimodal capabilities including enhanced video understanding with improved robustness for hour-long content, tokenization efficiency reducing frame representation from 256 to 64 tokens, spatial understanding with reasoning integration, video-to-code generation, document processing with layout preservation, and proactive assistance paradigms for natural AI interaction through the "everything is Vision" concept.
When: The comprehensive technical podcast was released on July 3, 2025, following systematic rollouts of AI Mode capabilities beginning March 5, 2025, and culminating in Workspace integration announced July 2, 2025, representing ongoing development of multimodal capabilities since Gemini 1.0.
Where: The capabilities are deployed across Google's ecosystem including AI Studio, Gemini API, AI Mode in search, Google Workspace accounts in the United States, and international markets including India as of June 24, 2025, with plans for continued global expansion.
Why: The multimodal advancement supports Google's vision for artificial general intelligence capable of understanding and interacting with the world through multiple sensory modalities like human perception, enabling natural human-AI interaction while addressing enterprise productivity needs, maintaining competitive positioning in the AI market with $75 billion infrastructure investments planned for 2025, and creating new opportunities for developers and businesses to build innovative applications using advanced visual understanding capabilities.
Technical architecture enables unified multimodal processing
Gemini's native multimodal design processes multiple data types simultaneously. "We have a single model that's trained to be multimodal from the ground up," Baddepudi explained. "At a high level, what this means is text, images, video, audio, all these modalities are turned into a token representation and the model is trained on all this information together."
The system addresses inherent information loss challenges during conversion. "When we turn images into token representations, we inherently lose some information from the image," according to Baddepudi. "This is a constant research question of, how do we make our image representations less lossy?"
Enhanced tokenization efficiency in Gemini 2.5 reduces computational requirements while maintaining performance quality. "We've now released more efficient tokenization. So these models can do up to six hours of video with 2 million contexts," Baddepudi noted. The optimization represents each frame with 64 tokens instead of 256 previously used.
Video understanding reaches state-of-the-art performance
Gemini's enhanced video capabilities enable new applications across education and development sectors. "One really cool example that we highlight in the blog post is the ability to turn videos into code," according to Baddepudi. "You can turn videos into animations, you can turn videos into websites."
Practical applications include recipe conversion and educational content transformation. "I fed in a YouTube video of a recipe and turn that into a step-by-step recipe. A use case that we see people using a lot is videos of lectures and turning that into lecture web pages and lecture notes and stuff," Baddepudi described.
The system demonstrates remarkable capability transfer between different functions. "One of the cool things about this 2.5 launch is things like video to code work really well because the 2.5 model's just a lot stronger at code," according to the technical discussion.
Spatial understanding combines reasoning with visual detection
Gemini's spatial capabilities extend beyond traditional computer vision through integrated reasoning. "The cool thing about Gemini being able to do detection is you have this reasoning backbone and world knowledge as well," Baddepudi explained. "Some of the cool things Gemini can do, this is a very simple example, but just Gemini to detect the person that's the furthest to the left in this image."
Advanced spatial understanding enables complex contextual analysis. Baddepudi provided an example: "I took an image of the fridge in our micro kitchen and I was like, which drink has the fewest calories? And it generated bonding box around the bottle of water."
The technology supports robotics and embodied AI development. "If Robots and self-driving cars have AI systems like Gemini powering embodied reasoning, that unlocks a ton of use cases," according to Baddepudi. These capabilities represent long-term development priorities for artificial general intelligence applications.
Document processing transforms enterprise workflows
Document understanding capabilities address fundamental business information processing challenges. "A ton of information is stored in documents. So it's very clear that documents is a powerful medium of information that Gemini should be really good at analyzing and reasoning over," Baddepudi stated.
Traditional document workflows required separate optical character recognition before AI processing. "In the past with documents, the way these workflows worked was users would OCR a document and then feed that as text into AI model, AI systems," according to the presentation. Gemini's approach enables direct document understanding while preserving formatting and visual elements.
Complex document analysis demonstrates the system's enterprise capabilities. "I fed in earnings reports from companies over the last 10 quarters with a million token contacts, which is 10s of 1,000s of pages with, yeah, 2 million tokens and got it to do a bunch of analysis on these companies for me," Baddepudi described.
The technology preserves document structure during processing. "Gemini is amazing, and what we call layout preserving transcription. So it can transcribe a document and preserve layout, style, structure," according to the technical discussion.
Proactive assistance represents future interaction paradigm
Google envisions AI systems that move beyond current turn-based interactions toward proactive assistance. "Today, most AI products are turn based. So you query the model or even a system, you get back an answer, query the model again, you get back an answer and you repeat that process," Baddepudi observed.
The proposed model resembles natural human assistance patterns. "One way that I like to think of what new products could look like is imagine you had an expert human looking over your shoulder and seeing what you can see and helping you with things," Baddepudi explained.
Practical demonstrations include cooking assistance through visual monitoring. "I was cooking and previously I would've had to follow a step-by-step recipe and try and pattern match what I'm doing to the recipe," Baddepudi described. "Something cool that Gemini can do is it looks at what you're doing as you're doing it and then proactively, based on visual cues in the video, suggests for things to do."
Interface development represents the primary implementation challenge. "I think the core problem is developing the interfaces. And I think we've moved towards this world of glasses, and we're working on these things at Google as well," according to Baddepudi.
Development strategy balances immediate needs with future capabilities
Google's multimodal development encompasses three distinct categories according to Baddepudi's framework. "The first are use cases that we see are critical today for users and customers. So folks using the API, so developers, Google products that use Gemini for, yeah, multimodal Vision use cases."
Long-term aspirational capabilities focus on artificial general intelligence foundations. "These are things that people aren't asking us for Gemini to be able to do today, but we think are very critical for building powerful AI systems, AGI and so on," Baddepudi explained. Visual reasoning represents a key example: "we see early signs of this with the Gemini 2.5 model, is this ability to reason over pixels."
Emergent capabilities arise from general model improvements rather than targeted development. "We didn't plan particularly for these models to be this amazing at image to code and video to code, but this turned out to be a super strong capability with 2.5," according to Baddepudi.
Unified architecture enables capability transfer across modalities
Traditional systems required separate models for different vision tasks. "In the past, a ton of these, you would've had separate models for Vision capabilities, a separate OCR model, a separate detection, segmentation model and so on," Baddepudi noted. "The cool thing now is all of this is bundled into Gemini and that results in a ton of cool use cases."
Complex applications demonstrate integrated capabilities. "One use case that, we're, yeah, super excited about, and Gemini is using Gemini as a peer programmer. So we stream in a video of your IDE to Gemini, ask it questions about your code base, get answers and so on," Baddepudi described. This requires "strong coding capabilities, strong just core Vision, which is spatial understanding OCR, but then also the ability to understand a video and information in a video across time horizons."
Marketing implications for visual search transformation
The technical advances reflect broader changes in search behavior patterns. AI Mode implementations demonstrate users adapting to more complex multimodal queries when interfaces support natural expression.
Baddepudi emphasized the paradigm shift: "we see a world where everything is Vision and these models can see your screen, see the world just like we can, but they're also domain experts in every field, which I think is a future that I'm super excited by."
The "everything is Vision" concept transforms content strategy requirements for marketing professionals. Visual search capabilities mean traditional text optimization may prove insufficient as users increasingly express information needs through multiple modalities.
Enterprise adoption accelerates through practical applications
Google Workspace integration brings multimodal capabilities to business environments where visual analysis provides immediate productivity benefits. Document processing, visual cataloging, and automated analysis address fundamental enterprise workflow challenges.
Information accessibility represents a core value proposition aligned with Google's mission. "I think it unlocks Vision as a store of information and makes visual information so much more accessible and useful to Google's mission," Baddepudi explained.
Practical applications include library cataloging and inventory management. "I took a video of all the books in this library and asked Gemini to catalog these books by genre, by author. And because Gemini has this world knowledge, reasoning backbone, but then also these visual capabilities in a single model, I was able to do this super well," according to Baddepudi.
Model behavior evolution toward natural interaction
Baddepudi's transition to model behavior development addresses interaction naturalness challenges. "Something that I think is a very important problem is having these models feel like they're natural to interact with," he stated. "Something that I'm passionate about is building AI systems that feel likable, that you can interact naturally with."
The approach involves developing personality while maintaining technical capabilities. "Going into more detail on the model behavior stuff, how this translates is giving the model skills like empathy and being able to understand the user, understand implied intent, giving the model a personality whilst striking the balance of all of these, yeah, amazing raw capabilities that Gemini has," Baddepudi explained.
Timeline
- December 6, 2023: Google announces Gemini AI model as most capable multimodal system to date
- December 20, 2024: Google DeepMind unveils Project Astra as next-generation AI assistant with visual, auditory, and language capabilities
- February 13, 2025: Google adds cross-chat memory capabilities to Gemini Advanced for paying subscribers
- March 5, 2025: Google reveals AI Mode as experimental feature for complex queries using Gemini 2.0 technology
- April 15, 2025: Google expands multimodal search capabilities allowing image-based queries in AI Mode
- May 20, 2025: Google launches AI Mode with virtual try-on using personal photos and advanced shopping features
- June 5, 2025: Google announces $75 billion AI spending plan for 2025 infrastructure investments
- June 24, 2025: Google launches AI Mode in India with multimodal search capabilities
- July 2, 2025: Google extends AI Mode to Workspace accounts in the United States
- July 3, 2025: Google releases detailed technical podcast featuring Ani Baddepudi discussing Gemini 2.5 multimodal capabilities and future development priorities