LLM tracking tools face accuracy crisis from personalization features
Personalization in ChatGPT and other AI platforms creates major challenges for LLM tracking tools as responses vary by user location and preferences.
SEO professional Lily Ray revealed on November 3, 2025, that personalization features in large language models create fundamental accuracy problems for tracking tools designed to monitor brand visibility across AI platforms. The issue stems from how ChatGPT and similar systems rewrite queries based on individual user data, producing responses that differ substantially from what tracking tools report.
Ray, Vice President of SEO Strategy & Research at Amsive, shared her findings in a detailed post on X at 11:15 PM on November 3. She documented how ChatGPT transforms user prompts based on stored information about location, interests, and previous interactions. This personalization mechanism causes the same query to generate different results for different users.
"Personalization in LLMs definitely presents a major challenge to LLM tracking tools. (On top of other inconsistencies, like non-deterministic responses.)," Ray wrote in the post, which accumulated 11,700 views within hours of publication.
The technical challenge Ray identified occurs when tracking tools test queries to measure brand visibility. ChatGPT processes the prompt "Where to shop for trusted brands in electronics and computing?" differently depending on who asks. For Ray's account, the system converted the query to "electronics retail trusted brands NYC store" based on her New York City location data. The resulting brand recommendations and citations differed entirely from what tracking tools reported for the same original prompt.
This geographic customization represents one layer of a complex personalization system. ChatGPT accesses user profiles built from conversation history, custom instructions, and location data collected through IP addresses. OpenAI's shopping features, which launched for German users on April 28, 2025, rely heavily on these personalization mechanisms to provide relevant product recommendations.
The system analyzes past conversations to understand user preferences. Custom instructions serve as persistent user-defined preferences across sessions. These systems create powerful personalization capabilities, allowing ChatGPT to internally rewrite generic queries into highly specific searches without requiring users to repeatedly specify preferences. Location data processing further enhances product relevance.
Ray observed that tracking tools provide visibility reports based on broad queries tested from their own accounts. Marketing teams using these reports assume the data represents what their customers see. The reality proves more complicated. Each user receives customized responses influenced by dozens of factors the tracking tools cannot replicate.
Buy ads on PPC Land. PPC Land has standard and native ad formats via major DSPs and ad platforms like Google Ads. Via an auction CPM, you can reach industry professionals.
"I'm currently spot checking various prompt/answer combinations from LLM tracking tools against the responses I receive for the same prompts in ChatGPT," Ray explained in her post. "The answers almost never match up, in large part because ChatGPT is personalizing and rewriting the the prompts based on what it knows about me."
The tracking tool market has expanded rapidly throughout 2025. Amplitude introduced AI Visibility on October 30, enabling marketers to monitor brand presence across ChatGPT, Claude, and Google AI Overview with conversion tracking. Adobe launched LLM Optimizer on October 14, providing businesses tools to monitor, measure, and enhance discoverability in generative AI interfaces. Semrush documented tripling its AI share of voice in one month through systematic optimization, publishing detailed case study results on October 17.
These platforms charge substantial fees for tracking services. Adobe's solution operates as both a standalone application and integrates with Adobe Experience Manager Sites. Ubersuggest offers AI search optimization through plans starting at $29 per month. The tools promise to help brands understand their visibility across AI platforms that generated 1.1 billion referral visits in June 2025, according to Similarweb estimates.
The personalization problem undermines these measurement systems. Marketing professionals invest in tracking tools expecting reliable data about brand mentions and rankings. Instead, they receive snapshots that may bear little resemblance to what actual customers experience. A brand appearing prominently in tracking tool reports might be invisible to users in different locations or with different conversation histories.

Google Gemini implemented similar personalization features on August 13, 2025, automatically enabling a "Personal Context" feature that analyzes user conversations to extract preferences, interests, and behavioral patterns. The system activates by default without requiring explicit consent. Users who prefer to maintain conversation privacy must manually disable the functionality through application settings.
The memory component tracks user activities across browsing sessions to enable personalized recommendations and automated task completion. ChatGPT's Atlas browser, announced in mid-October, uses browser memory to search web history and locate documents based on conversational queries. The system generates suggestions on the homepage based on user patterns and predicted tasks.
Industry experts responding to Ray's post acknowledged the challenge lacks straightforward solutions. Andrea Volpini, founder of WordLift, commented: "There is no reliable way to track LLM users. At best you can try with broad personas. The model reasons on a context-warped manifold, so prediction are unstable. You only get decent accuracy on obvious intents, like 'best jeans brands' as you mentioned the other day."
Gael Breton, founder of Authority Hacker, described the fundamental problem differently: "Honestly the idea of 'tracking mentions' feels backwards to me. A slight change in system prompt, memories etc completely changes the result anyway."
The technical architecture of modern AI systems explains these inconsistencies. Large language models use probabilistic methods to generate responses. The same prompt can produce different outputs even without personalization, a characteristic known as non-deterministic behavior. Personalization adds another layer of variability that compounds the measurement challenge.
Subscribe PPC Land newsletter ✉️ for similar stories like this one
Mike Fishbein noted in his response that AI chat applications pull in "memory" to personalize outputs, but similar context can be added to user message prompts to get similar responses. He highlighted local shopping as an area where personalization becomes particularly important. The geographic component Ray identified fits this pattern - tracking tools testing electronics shopping queries from their own locations cannot predict what users in other cities see.
Ben Young questioned whether tracking tools can achieve reliable predictions at scale given that the average user prompt contains 42 words. This length allows for substantial semantic variation and personalization opportunities compared to traditional short keyword queries.
Ray suggested a potential approach: "I'd love to hear how some of the tracking companies plan to tackle this challenge. I imagine there are ways to create 'personas' within LLMs to track specific audiences, but I'm not sure how much that solves the problem either, given no two users are exactly the same."
The persona concept involves creating multiple test accounts with different characteristics - varied locations, conversation histories, and custom instructions - to sample the range of possible responses. This approach increases coverage but requires exponentially more testing as the number of variables grows. A brand wanting to understand visibility across five major cities, three age demographics, and four interest categories would need 60 different test personas.
Krishna, a marketing professional who participated in the discussion, suggested the most valuable product might indeed be close to what Ray described: building personas for target customers and tracking based on those profiles. However, he acknowledged that personalization and probability issues never disappear regardless of methodology.
Jonathan Froidefond questioned whether clients are actually using LLM tracking tools, suggesting "it's too early to be worthwhile." Ray confirmed that her agency has used these tools for months, indicating some marketing organizations have committed to tracking despite the limitations.
David McSweeney proposed an alternative measurement approach: "I suspect they don't have the answer. Here's a thought though - the best 'prompt tracking panel' for any brand would be its own customers (they are the natural demographic). A brand could perhaps offer incentives (discounts, giveaways) for genuine, anonymized data sharing."
This customer-based tracking model would capture actual user experiences rather than attempting to simulate them. Brands could recruit panels of real customers to report what they see when asking questions about product categories. The approach introduces sampling challenges but avoids the personalization mismatch that undermines current tracking tools.
Masterdai took a harder stance, arguing tracking companies cannot solve the problem: "If they try to apply the old SEO approach of exhaustively crawling and tracking search results to LLMs, it simply won't work. Given that current LLMs are black boxes, I believe tracking methods from GEO companies are snake oil."
The comparison to traditional search engine optimization reveals why tracking tools developed for that environment struggle with AI platforms. Google search results vary by location and personalization factors, but the variations remain relatively constrained. A small number of test queries from different locations provides reasonable coverage of what most users see. AI platforms with extensive memory systems and custom instructions create far more variation.
Iqbal Abdullah emphasized the additional complexity from open models: "I have never thought 'LLM tracking tools' would work. On top of the personalization, We also have 'open models' now which can possibly use totally different data which can be switched on and be used in real-time. There is no one way you can always get the same answer."
The open model ecosystem includes alternatives like Claude, Llama, and Mistral that organizations can deploy with custom training data. Marketing teams tracking brand mentions across commercial platforms face limited scope compared to the actual landscape where AI-powered search occurs.
Ray concluded her assessment with pragmatic advice: "As with all things LLM tracking - the data is meant to be used directionally! And unless we get something like an Open AI Search Console or any type of AI search analytics in GSC, directional data is the best we have."
This directional approach acknowledges that tracking tools cannot provide precise measurements. Brands can identify rough trends - whether mentions are increasing or decreasing, whether competitor comparisons favor their products, whether optimization efforts correlate with visibility changes. The data offers signals rather than definitive metrics.
Research published in October 2025 examining 12 months of first-party data from 973 e-commerce websites with $20 billion combined revenue found that ChatGPT referrals generated lower conversion rates and revenue per session than traditional channels. The study analyzed over 50,000 transactions from ChatGPT referrals alongside 164 million transactions from conventional sources between August 2024 and July 2025.
These conversion metrics provide actual business impact data that personalization issues cannot distort. While brands may struggle to understand exactly what different users see in AI responses, they can measure whether AI platform referrals convert into purchases. This outcome-based approach complements visibility tracking with concrete performance indicators.
The personalization challenge Ray identified reflects broader tensions in AI platform development. Companies like OpenAI and Google tout personalization as a key advantage that makes their assistants more useful. Marketing professionals need standardized visibility that tracking tools can measure reliably. These objectives conflict at a fundamental level.
HubSpot acquired XFunnel on October 31 to strengthen its answer engine optimization capabilities. The acquisition announcement noted that AI-driven leads at HubSpot convert three times better than traditional search leads, underscoring the growing importance of visibility within AI-powered search platforms despite measurement challenges.
The technical infrastructure supporting personalization continues expanding. Amazon introduced Help Me Decide on October 23, using large language models to analyze browsing history and recommend products with one tap for millions of users. The system analyzes customer browsing activity, searches, shopping history, and preferences using AWS services including Amazon Bedrock, Amazon OpenSearch, and Amazon SageMaker.
As major platforms invest in personalization capabilities, the measurement challenge intensifies. Marketing professionals must develop strategies that acknowledge the limitations Ray documented while still extracting useful insights from available data. The alternative - abandoning AI visibility efforts entirely - becomes increasingly untenable as consumer search behavior shifts toward conversational platforms.
The discussion Ray initiated on November 3 exposed a problem the tracking tool industry has not adequately addressed. Companies selling these services emphasize their capabilities for measuring brand visibility and competitive positioning. Few acknowledge that personalization makes their measurements represent simplified snapshots rather than comprehensive pictures of actual user experiences.
Marketing organizations investing in AI visibility optimization need realistic expectations about what tracking tools can deliver. The data provides directional insights that inform strategy but lacks the precision that traditional SEO analytics offered. Brands comparing visibility across different queries or tracking changes over time can identify patterns. Claims about absolute visibility or detailed competitive comparisons should be viewed skeptically given the personalization factors Ray documented.
The path forward likely involves multiple complementary approaches. Tracking tools using broad test accounts provide baseline visibility data. Persona-based testing samples variation across key demographic and geographic segments. Customer panels report actual experiences. Referral traffic and conversion analytics measure business outcomes. Together, these methods paint a more complete picture than any single approach provides.
Ray's analysis highlights a maturing phase for AI search optimization. Early adopters implemented strategies based on limited understanding of how these platforms actually work. As practitioners like Ray document the technical realities, marketing professionals can develop more sophisticated approaches that account for platform complexities rather than assuming simplified models.
The measurement challenges do not eliminate the strategic importance of AI platform visibility. Consumer adoption continues growing - ChatGPT processes over 1 billion weekly searches according to recent estimates. Marketing professionals redirecting budgets to mobile apps as AI reshapes search behavior face strategic decisions about channel investment even when measurement capabilities remain imperfect.
The tracking tool companies Ray's discussion implicitly challenges face pressure to address these limitations transparently. Organizations paying for visibility reports deserve clear understanding of what the data represents and what it cannot capture. The persona-based approach Ray suggested offers one path toward more representative sampling, though implementation costs may prove substantial.
Industry standards for AI visibility measurement have not yet emerged. Traditional web analytics evolved over decades to establish common metrics and methodologies. AI platform measurement remains in early stages where basic questions about what to measure and how to interpret results lack consensus answers. Ray's contribution moves the discussion forward by documenting specific technical factors that undermine current approaches.
For marketing professionals, the practical implication involves treating AI visibility data as one input among many rather than a definitive metric. Brands should invest in tracking tools understanding their directional value while developing complementary measurement approaches. The conversion data from actual AI referral traffic provides ground truth that visibility reports approximate but cannot replace.
The personalization challenge Ray identified will likely persist as AI platforms continue developing more sophisticated user modeling capabilities. Marketing measurement methodologies must evolve to accommodate platforms where user experiences vary substantially based on factors tracking tools cannot easily replicate. This evolution requires both technical innovation in measurement approaches and realistic expectations about achievable precision.
Timeline
- April 28, 2025: OpenAI announces ChatGPT shopping features for German users with personalization based on location and conversation history
- June 5, 2025: Microsoft releases research showing AI-powered search increases purchase rates by 53%
- July 28, 2025: Similarweb unveils GenAI Intelligence Toolkit combining visibility tracking with traffic measurement
- August 13, 2025: Google Gemini implements Personal Context feature analyzing user conversations by default
- September 16, 2025: OpenAI announces ChatGPT search improvements with better shopping intent detection
- October 14, 2025: Adobe launches LLM Optimizer for enterprise AI visibility tracking
- October 17, 2025: Semrush publishes case study revealing tripled AI share of voice
- October 23, 2025: Amazon introduces Help Me Decide using AI to analyze browsing history for product recommendations
- October 30, 2025: Amplitude introduces AI Visibility tool with competitive tracking across ChatGPT, Claude, and Google AI Overview
- October 31, 2025: HubSpot acquires XFunnel to strengthen answer engine optimization capabilities
- November 3, 2025: Lily Ray publishes analysis documenting how personalization undermines LLM tracking tool accuracy
Subscribe PPC Land newsletter ✉️ for similar stories like this one
Summary
Who: Lily Ray, Vice President of SEO Strategy & Research at Amsive, identified fundamental accuracy problems with LLM tracking tools. Industry experts including Andrea Volpini, Gael Breton, Ben Young, Mike Fishbein, and others participated in the technical discussion about measurement limitations.
What: Personalization features in ChatGPT, Google Gemini, and similar AI platforms create substantial variation in how the same query produces different responses for different users. AI systems rewrite queries based on location data, conversation history, custom instructions, and user preferences, causing tracking tools to report visibility data that may not match actual customer experiences. The average user prompt contains 42 words and can be interpreted numerous ways depending on personalization factors.
When: Ray published her findings on X at 11:15 PM on November 3, 2025, after conducting systematic spot checks comparing tracking tool reports against responses she received for identical prompts in ChatGPT. The issue emerged as major platforms expanded personalization capabilities throughout 2025.
Where: The problem affects all major AI platforms including ChatGPT, Google Gemini, Claude, and others that implement personalization features. Ray specifically documented geographic personalization where ChatGPT converted a general electronics shopping query into a New York City-specific search based on her location. The challenge impacts tracking tools across the industry including platforms from Amplitude, Adobe, Semrush, Similarweb, and emerging competitors.
Why: AI platforms implement personalization to improve user experience and provide more relevant responses. ChatGPT accesses user profiles built from conversation history, location data collected through IP addresses, and custom instructions to internally rewrite generic queries into highly specific searches. This creates value for users but undermines tracking tools that assume standardized responses. Marketing professionals need visibility data to optimize brand presence across AI platforms, but measurement systems developed for traditional search engines cannot account for the extensive personalization that modern AI systems employ. The tracking tool market has expanded rapidly throughout 2025 despite these fundamental limitations, creating a disconnect between promised capabilities and actual measurement precision.