Gracenote, the content intelligence business unit of Nielsen, this week published research showing that a leading large language model, operating without access to external data, fabricated every measured attribute for nearly one in five of the 2,600 movie and TV titles it was tested against. The study is one of the largest structured comparisons yet conducted between grounded and ungrounded AI responses in the entertainment sector.
The report, titled "Plot Holes in AI: Why Ungrounded LLMs Can't Fix Content Discovery," was released on June 10, 2026. It covers titles from 13 countries and evaluates six core metadata attributes: title, description, actors, genres, release year, and runtime. Those are the fields streaming services most commonly display when presenting a movie or TV show to a viewer - the information a person uses to decide whether to watch.
What the study tested and how
Gracenote used two instances of Claude Sonnet 4.0 to generate responses. Both clients received identical instructions on how to find information; the difference was the data source each was permitted to use. The ungrounded client drew exclusively from its training data. The grounded client queried Gracenote's global video dataset via a Video MCP Server - an implementation of the Model Context Protocol that connects an LLM to Gracenote's continuously updated entertainment knowledge graph.
The study sourced the top 100 movies and one episode from the top 100 TV shows in each market, using a grounded Gemini Pro 3.1 client to compile the list in March 2026. The 13 countries tested were Australia, Brazil, Canada, France, Germany, Japan, Mexico, the Netherlands, South Korea, Spain, Sweden, the United Kingdom, and the United States. Responses were scored on attribute-level accuracy and a composite factual quality metric, producing four bands: zero quality, low quality, medium quality, and high quality.
According to Gracenote, less than one-third of all ungrounded LLM responses across the 2,600 titles qualified as high quality. In Mexico and the Netherlands, fewer than 10% did. For the United States specifically, nearly half of all responses scored at low or zero quality.
The hallucination count: 506 titles
The headline figure is stark. According to Gracenote, the ungrounded model hallucinated all measured metadata for 506 of the 2,600 titles tested - that is 19.5% of the total sample. These were not cases of partial error or minor factual slip; the model generated entirely fabricated content for every measured field.
The country-level breakdown shows significant variation. The Netherlands produced the highest rate of total hallucinations, at 28.3% of its titles - 56 out of the country's 200-title sample. Germany had the lowest, at 9.7%, with 19 titles fully hallucinated. Australia saw 52 titles with 100% hallucinated information, representing 26.5% of its sample. The United States figure sat at 21.5%, covering 43 titles.
These numbers matter because streaming services are beginning to integrate LLMs into their search and recommendation interfaces. The question is not whether AI will be used for content discovery - that transition is already under way - but whether the models powering those interfaces will have reliable information to work with.
Cast accuracy: 53% for top U.S. movies
Actor matching produced some of the lowest accuracy scores in the study. According to Gracenote, the ungrounded model correctly matched primary actors for only 53% of the top 100 U.S. movies when compared against the grounded data. For the broader U.S. title set, the match rate was 56%. The lowest actor match rate across all 13 markets was 34%, recorded in the Netherlands. The highest was 71%, in South Korea.
Genre accuracy was generally higher. Match rates ranged from 73% in Spain to 86% in the United Kingdom. Even so, the figures represent a meaningful error rate for any system expected to help viewers navigate a catalog.
The discrepancy between actor and genre accuracy reflects a structural difference in how LLMs process these categories. Genre labels tend to be broader and more stable; the model may identify a film as a thriller or a comedy with reasonable probability even when it has pulled the wrong content. Cast details are specific and individuated - a wrong actor is simply wrong, with no partial credit available.
Why similar titles cause failures
According to Gracenote, recency is not always the primary source of error. Titles with similar names proved equally disruptive.
The report documents a case involving the 2025 thriller "Heel," about a couple who kidnap a 19-year-old criminal. The ungrounded model matched the title and year correctly, but returned a description, cast, and genre drawn from "Heels," a drama series that ran on Starz from 2021 to 2023. The composite accuracy score for the Heel response was 50%; the factual assessment score was 10%. The actors attributed to the 2025 film - Stephen Amell, Alexander Ludwig, and Alison Luff - are cast members from the Starz series. The actual cast of the 2025 movie included Stephen Graham, Andrea Riseborough, and Anson Boon.
A second example involved the 2024 horror-thriller "Trucker," directed by Errol Sack. The film was released 16 years after a James Mottern-directed movie of the same name from 2008. The ungrounded model returned the year correctly but provided a description, cast, genre classification, and runtime derived from the 2008 film. Its cast response named Michelle Monaghan, Nathan Fillion, and Benjamin Bratt - none of whom appear in the 2024 title. The actual cast comprised Katherine Gibson, Dare Taylor, and Chuck Cirino. The composite accuracy was 35%; the factual assessment 20%.
These examples illustrate a specific failure mode: the model recognizes the title but selects the wrong version of the content from its probability distribution. Because LLMs synthesize responses rather than retrieving discrete records, they have no native mechanism to distinguish between two films sharing a name across a 16-year gap.
Recent releases expose training cutoff limits
The ungrounded model's blind spots extended to films released close to or after its training cutoff. According to Gracenote, the model was unable to provide any information about "GOAT," a 2026 animated film starring Caleb McLaughlin and Gabrielle Union that earned nearly 200 million dollars globally before arriving on Netflix on May 14, 2026. The model's explicit response when queried about the film was: "I don't have reliable information about a U.S. movie titled 'GOAT' from 2026."
Other titles the model could not address included the 2026 Chris Pratt thriller "Mercy," the 2026 Rachel McAdams-led horror film "Send Help," the 2025 Iranian drama "It Was Just an Accident," the 2025 German science fiction film "Good Luck, Have Fun, Don't Die," and the 2025 Canadian horror movie "Whistle."
The structural explanation, according to Gracenote, is rooted in how frontier models are built and deployed. A new model's training cutoff is never the date it is released. Data curation, deduplication, synthetic data generation, safety alignment, and post-training procedures introduce a multi-month lag between the latest data a model was exposed to and the moment it becomes available to users. After release, the same model may remain in production deployment for months longer. According to Gracenote, at minimum a six-month delay should be assumed before any recently published content can influence a frontier model's training weights.
Hallucination rates tracked closely with content age in English-speaking countries. According to Gracenote, for titles released before 2025, hallucination rates in Australia, Canada, the United Kingdom, and the United States ranged from 11% to 23%. For titles from 2025, those rates climbed considerably across most markets. For 2026 titles, rates reached 96% in South Korea, 95% in the Netherlands, and 86% in Sweden.
Non-English markets showed a different pattern. Hallucination rates for older titles were higher in countries where those languages are less represented in the foundational training data. Spain, for instance, showed a hallucination rate of 70% even for pre-2025 titles, compared with 12% for the United States.
Quality score breakdown
Gracenote applied a four-tier quality scoring system to the complete output for each of the 2,600 titles: zero quality, low quality, medium quality, and high quality. The scores combined attribute matching accuracy with an overall factual assessment.
Across all markets, the aggregated zero, low, and medium quality results ranged from 77% to 91% of responses, meaning genuinely high-quality outputs were a minority everywhere in the study. The United States breakdown showed 37% of responses at zero quality, 11.5% at low quality, 24% at medium quality, and 27.5% at high quality.
Australia placed 25.5% of responses in the zero quality band and 23% in the high quality band. The United Kingdom performed comparatively better on zero quality at 18.4%, with 27.6% reaching high quality. Brazil and Spain registered some of the lowest high-quality rates at 18.2% and 14.9% respectively. Mexico and the Netherlands were the weakest overall performers, with high-quality results at 8.7% and 7.6%.
These figures matter specifically for streaming platforms deploying conversational search or recommendation interfaces. A viewer who asks an AI assistant whether a particular actor appears in a given film, or what genre it belongs to, has a meaningful probability of receiving an incorrect answer from an ungrounded system.
The MCP Server as the proposed fix
Gracenote's study is also, explicitly, a case for its own product. The company launched its Video MCP Server in September 2025, which PPC Land covered at the time. The server connects an LLM to Gracenote's entertainment knowledge graph, which covers 40 million titles across 260 streaming catalogs in 70 languages and 80 countries, allowing the model to retrieve factual content data rather than generate probabilistic guesses from training alone.
The grounding approach used in the study is the same architecture Gracenote is offering commercially. According to Gracenote, platforms can access it either through direct data licensing agreements or by connecting their LLMs to the Video MCP Server.
The commercial stakes around this position became clearer in February 2026. On February 10, Gracenote renewed its multi-year strategic partnership with Google to provide entertainment metadata for Google's products, including AI and Gemini experiences. On February 25, Gracenote signed an agreement with Samsung to power LLM-enabled search, conversational content discovery, and lean-back curation across Samsung's global smart TV platform. Both deals position Gracenote's structured content data as the grounding layer for AI-powered entertainment interfaces at scale.
That context predates an earlier Gracenote finding. A report published on April 8, 2026, based on a survey of more than 4,000 U.S. consumers conducted in January and February, found that three in four Americans verify AI chatbot answersbefore acting on them for TV, movie, or sports content. The June 10 accuracy study provides a quantified explanation for that skepticism.
Why this matters for advertising and media buying
"Viewers don't care where a bad answer comes from. If it's wrong, they blame the service," said Tyler Bell, senior vice president of product at Gracenote. "That's why grounding matters. For companies building the next generation of entertainment discovery, generative AI will only deliver on its promise when it is grounded in verified content intelligence that replaces plausible guesses with accurate facts - reducing friction, deepening engagement and strengthening loyalty."
The connection to advertising is direct. Gracenote has been building out a content targeting infrastructure alongside its AI grounding work. In December 2025, it launched Content Connect, a platform enabling agencies, brands, SSPs, and DSPs to execute program-level CTV ad targeting using standardized show metadata. In September 2025, Index Exchange became the first SSP to embed Gracenote contextual intelligence directly into its platform. A report published by Gracenote on May 14, 2026 and covered by PPC Land found that 86% of U.S. media planners would shift linear TV budgets to CTV if show-level data were available to them.
When AI interfaces become the primary entry point into a streaming catalog - and the early adoption signals suggest that transition is accelerating - the accuracy of the metadata those interfaces return determines not just the viewer experience but the commercial performance of the content itself. A hallucinated cast list or a wrong genre classification affects not only whether a viewer finds the right film, but whether advertisers can reach the intended audience in the right context. According to a 2026 Gracenote survey, 66% of consumers believe AI will be important in providing good entertainment experiences.
The MCP server format that Gracenote uses in this study has become a recognized infrastructure layer across advertising technology more broadly. PPC Land has tracked Amazon's MCP Server rollout, FreeWheel's MCP deployment for premium video, and Meta's own AI connectors as part of a broader shift toward agent-accessible advertising infrastructure.
According to a separate 2025 Gracenote global streaming consumer survey, 84% of streaming video viewers say the overall user experience of a streaming service is important to them. Another 2025 Gracenote survey found that only 64% of streaming video viewers know what they want to watch when they turn on their TVs, placing enormous weight on what the search and discovery layer returns.
Findings scheduled for StreamTV Show presentation
Gracenote will present the report's findings at the StreamTV Show on June 18, 2026, in Denver. Nandita Arora, senior director of product at Gracenote, will join a panel session titled "Reimagining Content Discovery," which will address how AI, personalization, unified search, and new user experience approaches are shaping how streaming services connect viewers with content.
Timeline
- September 3, 2025: Gracenote launches its Video MCP Server, connecting LLMs to its entertainment knowledge graph to mitigate hallucinations in TV platform search
- September 16, 2025: Index Exchange becomes the first SSP to integrate Gracenote contextual intelligence, embedding brand safety segments and Do-Not-Air controls
- October 2025: Gracenote publishes research highlighting the contextual targeting gap in CTV advertising, based on a survey of 600 U.S. brand and agency executives
- December 4, 2025: Nielsen's Gracenote launches Content Connect, enabling agencies, brands, SSPs, and DSPs to execute program-level CTV ad targeting
- December 17, 2025: Gracenote expands its On Sports platform, linking sports documentaries and shoulder content across 160 leagues in more than 50 countries
- February 10, 2026: Gracenote renews its multi-year strategic partnership with Google to support entertainment metadata for AI and Gemini experiences
- February 25, 2026: Gracenote and Samsung Electronics announce an agreement to power LLM-enabled search and discovery across Samsung's global smart TV platform
- March 2026: Gracenote uses a grounded Gemini Pro 3.1 client to compile the list of top titles across 13 markets for the hallucination study
- April 8, 2026: Gracenote publishes "TV Search and Discovery in the AI Era," finding three in four U.S. consumers verify AI chatbot answers about entertainment content
- May 14, 2026: Gracenote publishes "TV Audiences Have Shifted. Ad Dollars Have Not," with 86% of U.S. media planners saying they would shift linear budgets to CTV with show-level data
- June 10, 2026: Gracenote releases "Plot Holes in AI: Why Ungrounded LLMs Can't Fix Content Discovery," documenting 19.5% full hallucination rates across 2,600 titles in 13 countries
- June 18, 2026: Nandita Arora, senior director of product at Gracenote, is scheduled to present findings at the StreamTV Show in Denver
Summary
Who: Gracenote, the content intelligence business unit of Nielsen, led by Tyler Bell, senior vice president of product. Nandita Arora, senior director of product, will present findings publicly on June 18, 2026.
What: A study comparing the accuracy of an ungrounded large language model with one grounded in Gracenote's entertainment data via an MCP server, across 2,600 movie and TV titles in 13 countries. The ungrounded model hallucinated all measured metadata for 506 titles - 19.5% of the total. For the top 100 U.S. movies, actor accuracy stood at 53%. Less than one-third of all responses across the study qualified as high quality.
When: The report was published on June 10, 2026. The title list was compiled in March 2026. The StreamTV Show presentation is scheduled for June 18, 2026, in Denver.
Where: The study covered 13 countries: Australia, Brazil, Canada, France, Germany, Japan, Mexico, the Netherlands, South Korea, Spain, Sweden, the United Kingdom, and the United States. The grounded client accessed Gracenote's global video data through an MCP server. The ungrounded client was restricted to training data only.
Why: Streaming services are deploying LLMs for conversational search and content recommendations, but ungrounded models do not have reliable access to current or accurate entertainment metadata. The study quantifies the accuracy gap and argues that grounding - connecting LLMs to verified external data sources - is a prerequisite for reliable AI-powered content discovery at scale.
Discussion