LLMs achieve 90% accuracy in replicating consumer purchase intent

Research demonstrates semantic similarity method enables language models to simulate consumer surveys with unprecedented fidelity while cutting billions in research costs.

LLMs achieve 90% accuracy in replicating consumer purchase intent

Large language models can now replicate human consumer purchase decisions with 90% accuracy when using a semantic similarity rating method, according to research published October 27, 2025. The findings challenge traditional market research approaches that cost corporations billions annually while offering scalable alternatives for product concept testing.

Researchers from PyMC Labs and Colgate-Palmolive Company tested 57 personal care product surveys involving 9,300 human responses against synthetic consumers powered by GPT-4o and Gemini-2.0-flash. The semantic similarity rating approach achieved correlation attainment of 90% compared to human test-retest reliability while maintaining realistic response distributions with Kolmogorov-Smirnov similarity scores exceeding 0.85.

The breakthrough addresses a fundamental limitation in using LLMs for consumer research. When asked directly for numerical ratings, language models produce unrealistic response distributions that skew heavily toward neutral middle values. The new method instead elicits textual responses from LLMs and maps these to five-point Likert scale distributions using embedding similarity to reference statements.

Technical methodology behind consumer simulation

According to the research paper, the semantic similarity rating method retrieves embedding vectors for both synthetic consumer responses and five reference statements corresponding to each Likert scale rating. OpenAI's text-embedding-3-small model calculates cosine similarity between response vectors and reference statement vectors, generating probability mass functions rather than single numerical ratings.

The approach averaged results across six different reference statement sets for each response. Synthetic consumers received demographic attributes including age, gender, income level, location, and ethnicity when available. Each survey prompted LLMs to evaluate product concepts shown as images containing descriptions and visual elements.

Direct Likert rating methods achieved only 80% correlation attainment with mean distributional similarities of 0.26 for GPT-4o and 0.39 for Gemini-2.0-flash. The models defaulted to "safe" regression toward scale centers, predominantly responding with neutral ratings while rarely selecting extreme positive or negative values. This behavior created distributions fundamentally incompatible with actual consumer response patterns.

Textual elicitation followed by semantic similarity rating increased correlation attainment to 90% for both models while dramatically improving distributional similarity. GPT-4o achieved 0.88 similarity and Gemini-2.0-flash reached 0.80, compared to 0.72 and 0.59 respectively for follow-up Likert ratings where the same LLM mapped its own textual response to numerical values.

Demographic conditioning reveals behavioral patterns

Synthetic consumers prompted with demographic information replicated human response patterns across age and income categories with notable accuracy. Mean purchase intent followed concave patterns with age, where younger and older participants rated purchase intent lower than middle-aged cohorts. GPT-4o synthetic consumers mirrored this pattern while Gemini-2.0-flash showed similar behavior for younger consumers.

Income level effects emerged clearly in both human and synthetic populations. Real survey respondents rating their income according to six reference statements showed marked increases in purchase intent only for highest income categories and "None of these" selections. Synthetic personas prompted with budgetary problems responded with appropriately reduced purchase intent, with GPT-4o demonstrating particular sensitivity to dramatic income statement wording.

Product category and price tier differences appeared consistently across human and synthetic surveys. Both populations rated Category IV products highest and Category I lowest. Products from premium price tiers received more positive ratings while entry-level tiers scored lowest. Synthetic consumers replicated negative reactions to concepts developed by specific sources.

Experiments removing demographic markers from synthetic consumer prompts revealed the importance of detailed persona construction. Gemini-2.0-flash without demographic information achieved distributional similarity of 0.91 but correlation attainment dropped to only 50% compared to 92% with full demographic details. The models rated products more uniformly positive without persona constraints, failing to leverage actual product concept information meaningfully.

Advertise on ppc land

Buy ads on PPC Land. PPC Land has standard and native ad formats via major DSPs and ad platforms like Google Ads. Via an auction CPM, you can reach industry professionals.

Learn more

Qualitative feedback exceeds human panel detail

Synthetic consumer responses provided substantially richer qualitative feedback than traditional survey participants. Human respondents typically offered brief comments like "It's good" or repeated concept information verbatim. Synthetic consumers generated detailed explanations addressing specific product features, price sensitivity, and usage concerns.

According to the research, one synthetic consumer wrote about ease of use and safety being appealing but wanting more information about effectiveness and potential side effects. Another highlighted trust in the brand alongside specific product benefits. Critical feedback emerged naturally, with responses addressing premium pricing concerns, questioning marketing claims, or expressing skepticism about new approaches.

The depth of synthetic feedback suggests applications beyond quantitative purchase intent measurement. Marketing professionals increasingly deploy AI for campaign optimization and audience targeting, with McKinsey data indicating agentic AI attracted $1.1 billion in equity investment during 2024. Synthetic consumer responses could inform product development iterations before committing to expensive production and launch activities.

Machine learning comparison highlights LLM advantages

Researchers trained 300 LightGBM classifiers on demographic and product features to benchmark zero-shot LLM performance against supervised learning approaches. The gradient-boosted decision trees achieved correlation attainment of only 65% despite training on in-sample data, substantially below 88% for semantic similarity rating.

LightGBM outperformed follow-up Likert ratings on distributional similarity with 0.80 versus 0.72, but semantic similarity rating achieved 0.88. The supervised model required access to training data from actual surveys while LLM-based methods operated without any survey-specific training. This demonstrates that language models leverage semantic understanding of product descriptions more effectively than feature-based classifiers.

The comparison reveals fundamental differences between approaches. LightGBM operated on coarse-grained features like demographics, product categories, and price tiers. Large language models processed complete product concept descriptions including detailed feature lists, positioning statements, and visual elements. This richer information enabled more nuanced evaluations matching human responses.

Consumer research industry implications

Global market research costs corporations billions annually according to ESOMAR data, yet suffers from panel biases, satisficing behavior, acquiescence effects, and limited scale. Traditional panels provide noisy demand measurements despite considerable resource investment. Positivity bias creates particular challenges, with human respondents skewing responses upward in ways that distort concept evaluation.

The research suggests synthetic consumers demonstrate less positivity bias than human panels, producing wider spreads of mean purchase intent. When products receive less favorable ratings, LLMs tend to rate them lower than human counterparts on average. This broader dynamic range may provide companies with more discriminative signals when evaluating early-stage concepts.

Scalability advantages become apparent immediately. Generating synthetic consumer responses requires only API access to commercial language models and embedding services, costing dramatically less than recruiting, screening, and compensating representative human panels. Companies could screen dozens or hundreds of product concepts quickly before committing resources to traditional research on the most promising candidates.

The approach avoids recruitment delays and panel availability constraints. Synthetic surveys can run continuously, enabling rapid iteration on concept development. Marketing teams can test variations systematically, exploring how different feature combinations, price points, or positioning statements affect purchase intent across demographic segments.

Measurement accuracy reaches test-retest reliability

Correlation attainment quantifies synthetic consumer performance against theoretical maximum accuracy. The research established this ceiling through test-retest reliability experiments, randomly splitting each of 57 surveys into equal test and control cohorts 2,000 times. Maximum attainable correlation between mean purchase intents averaged across these simulations represents the upper bound determined by human response noise and narrow purchase intent distributions.

Semantic similarity rating achieved 90% of this theoretical maximum, meaning synthetic consumers replicated concept rankings as accurately as could be expected given inherent variability in human responses. This performance level suggests the method captures genuine consumer preferences rather than merely fitting superficial statistical patterns.

Distributional similarity scores exceeding 0.85 indicate synthetic response distributions match human distributions closely across the five-point Likert scale. Real consumer surveys showed skewed distributions favoring high ratings, typically concentrated around values of 4 and 5 with mean purchase intent of 4.0 and standard deviation of 0.1. Semantic similarity rating replicated these patterns while direct Likert elicitation produced unrealistic central clustering.

Privacy and implementation considerations

Large language models raise important privacy concerns as organizations increasingly deploy them for marketing purposes. The European Data Protection Board provides structured risk management methodology for LLM implementations, addressing data protection requirements, transparency obligations, and consumer rights.

Synthetic consumer research sidesteps many traditional survey privacy issues by eliminating collection of personal data from real respondents. However, training data for commercial language models potentially includes scraped consumer discussions, product reviews, and survey responses. These privacy implications require careful consideration alongside GDPR compliance requirements.

Consumer trust research shows 59% express discomfort with data being used to train AI systems, creating challenges for marketing implementations. Transparency about synthetic consumer usage and data handling practices remains essential for maintaining consumer confidence.

The semantic similarity rating method requires manual creation of reference statement sets tailored to survey contexts. The research used six statement sets optimized across 57 personal care product surveys, but performance for other categories remains uncertain. Companies implementing the approach must develop reference statements for their specific industries, potentially requiring iterative refinement.

Cross-category applicability requires validation

Testing focused exclusively on personal care products from a single leading corporation. Whether semantic similarity rating performs comparably across other consumer categories, price ranges, or cultural contexts requires additional validation. The research acknowledged that LLM training data likely contains abundant consumer discussions about personal care products from forums and reviews, providing relevant background knowledge.

Categories with sparse training data representation may see degraded performance. Technical products requiring specialized knowledge, niche hobbies, or emerging categories might not benefit equally from the approach. Companies should validate synthetic consumer accuracy against human panels before relying exclusively on LLM-based research for high-stakes decisions.

One experiment tested semantic similarity rating for concept relevance rather than purchase intent, using three new reference statement sets constructed specifically for this question. Gemini-2.0-flash achieved 82% correlation attainment and 0.81 distributional similarity, demonstrating potential generalization beyond purchase intent measurement. Follow-up Likert ratings reached 91% correlation attainment but only 0.62 distributional similarity.

Industry adoption trajectory and limitations

The research represents proof of concept rather than production-ready system. Reference statement optimization, embedding model selection, similarity measurement approaches, and temperature parameters all require systematic investigation. Future work could tune these elements automatically to maximize alignment with held-out human data.

Marketing measurement challenges continue evolving alongside privacy regulations and browser tracking restrictions. Synthetic consumer research offers measurement capabilities independent of browser tracking limitations, operating through structured survey methodologies rather than behavioral observation.

Integration with existing research infrastructure requires workflow adaptation. Companies must determine optimal hybrid approaches balancing synthetic screening with human validation. Risk assessment frameworks should guide decisions about when synthetic research provides sufficient confidence for product launch decisions versus situations requiring traditional panel confirmation.

Cost considerations favor synthetic consumer adoption despite methodological uncertainties. Eliminating panel recruitment, compensation, and management overhead while enabling unlimited concept testing volume creates compelling economic value. Early adopters willing to validate carefully against human benchmarks may gain competitive advantages in product development velocity.

Timeline

Summary

Who: Researchers from PyMC Labs and Colgate-Palmolive Company developed and tested the semantic similarity rating method for synthetic consumer research. The team included Benjamin F. Maier, Ulf Aslak, Luca Fiaschi, Nina Rismal, Kemble Fletcher, Christian C. Luhmann, Robbie Dow, Kli Pappas, and Thomas V. Wiecki.

What: Large language models achieved 90% of human test-retest reliability in replicating consumer purchase intent across 57 personal care product surveys involving 9,300 human responses. The semantic similarity rating method maps textual LLM responses to Likert distributions using embedding similarity to reference statements, addressing fundamental limitations of direct numerical elicitation.

When: The research was published October 27, 2025, following testing conducted throughout 2025. The findings arrive as organizations face billions in annual market research costs while seeking scalable alternatives for product concept evaluation.

Where: Testing occurred across 57 consumer research surveys conducted by a leading personal care corporation using a digital consumer research platform. Each survey involved 150-400 unique U.S. participants with demographic markers including age, gender, location, income, and ethnicity. Synthetic consumers operated through GPT-4o and Gemini-2.0-flash APIs with OpenAI embedding models.

Why: Traditional consumer research suffers from panel biases, limited scale, and considerable costs while providing noisy demand measurements. The semantic similarity rating method enables scalable consumer research simulations while preserving traditional survey metrics and interpretability. Companies can screen numerous product concepts rapidly before committing resources to expensive production and launch activities, potentially transforming early-stage product development processes.