Harvard finds better surveys still cannot cut 23% flip rate

by Luis Rijo
Luis Rijo
Luís Rijo is a seasoned marketing professional with over 10 years of experience in Digital Marketing, Search, Social, Display, Video, and DOOH. Based in Europe. Also writing in the spend. Reach out via luis@ppc.land
- LinkedIn
•
July 5, 2026
•
11 min read

Same question, twice — 23% of people answer differently each time.

A Harvard study published June 17, 2026 found that roughly 23 percent of survey respondents give a different answer when asked the identical question twice within the same survey, and that neither improved question wording nor stricter respondent filtering meaningfully reduces that rate.

The paper, titled "Who's to Blame for Survey Instability: Respondents with Nonexistent Preferences or Researchers with Flawed Measures?", was written by Libby Jenke, an assistant professor of political science at the University of Houston, and Gary King, the Albert J. Weatherhead III University Professor at Harvard's Institute for Quantitative Social Science. The authors collected evidence from 59 unique surveys and argue that the source of this instability has gone largely unrecognized for three-quarters of a century of survey research.

For an industry that leans heavily on attitudinal surveys to measure brand lift, ad recall, and purchase intent, the finding lands at an uncomfortable moment. Marketing organizations have spent much of the past year publicly grappling with data reliability. PPC Land reported in September 2025 that 45 percent of marketing data used for business decisions was found to be incomplete, inaccurate, or outdated, based on Adverity research surveying 200 chief marketing officers. The Harvard paper adds a different, more fundamental layer to that conversation: even a perfectly designed, perfectly administered survey question will still draw inconsistent answers from roughly a quarter of the people asked it, according to the authors' own experiments.

What the researchers measured

Jenke and King define survey instability precisely. For a binary question asked of the same respondent at two separate points in time, instability equals one if the two answers differ and zero if they match. Averaged across a sample, that produces a single number describing how often people flip.

To isolate this figure from ordinary sources of survey error, the authors built an experimental design that asks the same question twice within a single survey, separated by several unrelated distractor questions. Because the repeat question arrives only moments after the first, respondents have almost no opportunity to forget having seen it, and nothing in the world has had time to change that would legitimately alter their answer. Across 9,472 respondents in a separate dataset the authors drew on for validation, not one mentioned noticing a repeated question.

The headline figure, drawn from 53 of the 59 surveys the researchers fielded over three years on different online survey platforms, put average instability at 23.0 percent. That number is not new in the sense of being unprecedented. The paper traces a nearly identical estimate back to the 1948 Elmira Study, the earliest documented panel survey, where instability in reported vote choice a month apart fell in a range with a midpoint of 22 percent. Three-quarters of a century separates the two estimates, and they are, in the authors' words, indistinguishable.

Two old explanations, tested and found wanting

The survey research literature has offered two competing explanations for this pattern. The first, associated with political scientist Philip Converse's 1964 work, argues that respondents simply lack genuine preferences on many issues and answer somewhat randomly, a phenomenon Converse termed "nonattitudes." The second, more common explanation attributes instability to measurement error in the survey instrument itself, meaning poorly worded questions, confusing response scales, or other correctable design flaws.

Jenke and King ran nine direct tests of the measurement-error explanation, focused on the 2008 American National Election Studies questionnaire. They compared original question wordings against versions redesigned to follow current best practices, fixing problems such as missing response labels and double-barreled phrasing. Four separate randomized experiments comparing old and new wordings showed a small increase in instability for two of the four fixes and a slightly larger decrease for the other two, with none of the four differences statistically distinguishable from zero. The paper concludes that these improved wordings do not account for most of the observed instability.

The researchers also tested priming, the theory that a preceding question can activate a particular consideration that then colors how a respondent answers the target question. They randomly assigned some respondents to see identical priming content before both the first and second asking of a question, reasoning that if priming drove much of the instability, a consistent prime should produce more consistent answers. Across five different primes, covering a conjoint question and manipulations suggesting voters should favor candidates of particular ages, personal qualities, races, or genders, the difference in instability between primed and unprimed groups came out near zero.

A third explanation: intrinsic human stochasticity

Having tested both standard explanations and found neither sufficient, the authors propose what they call intrinsic human stochasticity as a third source of instability, distinct from both nonexistent preferences and instrument error. The term draws an analogy to fields outside survey research where measurement science already separates two sources of variability: the error introduced by an imperfect instrument, and the fundamental, irreducible variability of the thing being measured. Biologists call this distinction technical versus biological variation. Physicists studying particles under a microscope distinguish diffraction limits from Brownian motion. The authors argue survey research has conflated these two sources under a single umbrella term, measurement error, without adequately separating them.

To test whether decision-making itself carries an irreducible random component, the researchers designed experiments in which respondents were told, explicitly, the exact probability that one option was better for them. In one such experiment, respondents were informed there was a three-in-four chance that a hypothetical Candidate A was better for them and the country, then asked to make four separate choices between Candidate A and Candidate B across different elections. Under a fully rational, deterministic decision rule, every respondent should have chosen Candidate A every time. Instead, an average of 45 percent of respondents violated that prediction. Across eight separate variations of this experiment, the average share of respondents making a stochastic rather than deterministic choice reached 50.5 percent.

The paper also traces cognitive and psychological contributors to this stochasticity. Cognitively complex questions, measured by how long respondents take to answer them, produced far more instability than simple ones. A question asking respondents to place themselves on a seven-point scale describing environmental regulation and business burden produced instability above 50 percent, compared to essentially zero for a simple gender question. Time spent on a question mattered independently: instability started near 50 percent when respondents were given only two seconds to answer and dropped to 26 percent once they were allotted six seconds, based on a randomized experiment assigning different time limits.

Four psychological states also predicted instability, according to the paper: preoccupation with matters outside the survey, mind-wandering, adopting a rapid-response persona rather than a conscientious one, and failing attention checks. Preoccupation stood out because it satisfied criteria the other three did not. Respondents who reported being very preoccupied before starting the survey were, on average, 13 percentage points more likely to give inconsistent answers than unpreoccupied respondents, and the effect held up as a stable, non-intrusive filter that did not itself bias other survey responses. By contrast, the paper found that traditional attention-check questions, the kind asking respondents to ignore an instruction or identify an obvious fact, showed pass rates ranging from roughly 30 percent to nearly 100 percent depending on which specific check was used, a level of inconsistency the authors argue makes them unreliable as filters.

A synthetic-respondent comparison

One section of the paper compares human survey responses against those generated by simulated respondents. The authors generated 1,500 synthetic respondents using Claude, the AI system built by Anthropic, designed to mirror the demographic and political profile of the human sample on measures including location, race, age, education, income, party identification, ideology, and 2024 vote choice. When the same battery of questions was administered to these simulated respondents, instability levels came in well below those recorded for humans, a result the authors attribute to known homogeneity problems in large language model outputs. The comparison was used primarily to argue that bot contamination was unlikely to explain the human-sample results, since a bot-driven sample would be expected to show artificially low variability, not the levels the researchers actually observed.

That finding intersects with a debate already playing out inside market research. PPC Land covered research in November 2025 showing that large language models achieved 90 percent accuracy in replicating human purchase intent in a study by PyMC Labs and Colgate-Palmolive, while also finding that synthetic consumers displayed less positivity bias than human panels. Separately, Meta and Toluna reported that AI-driven synthetic personas replicated the directional findings of human ad testing panels across four markets, though divergence between synthetic and human results was most pronounced in the middle tier of ad performance rankings. The Jenke and King paper does not evaluate synthetic respondents as a research tool in the way those studies do; its use of Claude-generated respondents was a methodological control, not an endorsement or critique of synthetic panels as a substitute for human samples. Still, the demonstrated homogeneity of the synthetic outputs, relative to the documented stochasticity of human ones, is a data point worth setting alongside that broader industry conversation.

What changes with age

The paper also examined how individual characteristics relate to instability, using age as an illustrative case. Older respondents took more time answering questions, consistent with well-documented declines in processing speed. Yet older respondents also reported substantially less preoccupation and less mind-wandering than younger ones. The combined effect was a 40 percent drop in instability, equivalent to 11 percentage points, moving from the youngest age bracket studied to the oldest. The authors are careful to note that this finding describes a pattern rather than license to exclude any age group from analysis, since doing so without appropriate statistical adjustment would introduce its own bias.

Practical recommendations

The paper closes with three recommendations for survey practice. First, researchers should measure the cognitive complexity of their own questions, using average response time in a small pretest as an inexpensive proxy, and simplify wherever the complexity is not intrinsic to what is being measured. Second, researchers should measure their own survey's instability directly, by asking the key outcome variable a second time within the same instrument, since doing so provides information about data quality that most surveys currently do not collect at all. Third, when filtering out inattentive respondents is necessary, the authors recommend combining a single preoccupation question with time-on-task thresholds, rather than relying on conventional attention-check questions, because preoccupation proved stable across the course of a survey and did not measurably influence answers to other questions, properties the paper found lacking in mind-wandering probes, persona self-reports, and attention checks alike.

None of this, the authors stress, means people lack real preferences or that survey researchers have failed at their craft. The paper's own review of decades of replicated survey findings across religiosity, immigration attitudes, consumer choices, and dozens of other domains argues against the nonattitudes explanation just as firmly as it argues against attributing all instability to flawed instruments. The remaining, and largest, component is something the authors describe as intrinsic to being human: a fundamental stochasticity in decision-making that surveys can measure and partially manage, but not eliminate.

Why this matters for marketing measurement

The marketing industry has spent much of the past year publicly reckoning with the reliability of its own measurement infrastructure. PPC Land reported in October 2025 that marketing measurement confidence had stalled despite growing data volumes, with 26.5 percent of surveyed marketers dissatisfied with their measurement technology stack and platform-provided attribution still the dominant methodology at 65.8 percent adoption. A separate study covered by PPC Land in March 2026, produced jointly by the Coalition for Innovative Media Measurement and the American Association of Advertising Agencies, found that advertisers were struggling to trust the measurement data available to them despite having more of it than ever before, a pattern the study's authors called a paradox of plenty.

Brand lift measurement, in particular, sits close to the mechanics this paper examines. Brand lift studies typically compare survey responses from an exposed group against a control group to estimate whether an ad shifted awareness, favorability, or purchase intent. PPC Land's coverage of Cint's June 2026 platform update noted that brand lift has traditionally relied on attitudinal survey research fielded to panels, a methodology distinct from the behavioral transaction data increasingly paired alongside it. If roughly a quarter of survey respondents would answer a repeated question differently even under ideal conditions, that baseline noise sets a practical floor on how precisely any single-wave brand lift study can detect a true shift in attitude, independent of sample size or panel quality.

The paper's finding on attention checks carries a related implication for panel-based research generally. Market research panels commonly use instructed-response items, tasks asking a respondent to select a specific answer to demonstrate they are reading carefully, as a quality-control gate before including a response in reported results. Jenke and King's own data found pass rates on such checks varying from around 30 percent to nearly 100 percent depending on which specific check was used, a spread wide enough that the authors argue it is not clear what threshold, if any, a research buyer should trust. For agencies and brands purchasing panel-based research, that raises a practical question about which quality-control claims from a given panel provider reflect a meaningful bar and which reflect an arbitrarily chosen one.

Timeline

1948: The Elmira Study, the earliest well-documented panel survey, records survey instability in reported vote choice with a midpoint estimate of 22 percent.
1964: Philip Converse publishes his "nonattitudes" theory, arguing some respondents lack genuine preferences on major political issues.
2025: Jenke and King begin fielding the 59 surveys underlying the paper, using online survey platforms over a three-year data collection period.
2025: Katherine Clayton and coauthors, cited throughout the paper as a methodological foundation, publish the conjoint-based repeated-question protocol the authors build on, administered to 9,472 respondents.
June 17, 2026: Libby Jenke and Gary King publish "Who's to Blame for Survey Instability: Respondents with Nonexistent Preferences or Researchers with Flawed Measures?"
May 21, 2026: The paper's public listing page on Gary King's website is last updated, according to the site's own timestamp.

Marketing data quality crisis reveals 45% of business decisions based on unreliable information: Adverity research surveying 200 chief marketing officers found nearly half of marketing data used for decisions was inaccurate, incomplete, or outdated.
Marketing measurement confidence stalls despite data growth: A survey of US marketing professionals found measurement confidence had not improved year over year despite expanding data access.
Drowning in data: why U.S. advertisers can't trust their own measurement: A joint CIMM and 4As study of 197 senior marketers found advertiser trust in measurement lagging behind the volume of data available to them.
Cint merges brand and sales lift into one live dashboard: Cint combined attitudinal brand lift survey data with behavioral transaction data in a single reporting dashboard for US advertisers.
LLMs achieve 90% accuracy in replicating consumer purchase intent: Research from PyMC Labs and Colgate-Palmolive found large language models replicated human purchase intent ratings with high accuracy while showing less positivity bias than human panels.
Meta and Toluna study finds emojis lift Reels purchase intent 2.5x: A validation exercise comparing synthetic AI personas against 100 human respondents per ad found directional alignment between the two groups, with divergence concentrated in mid-tier ad rankings.

Summary

Who: Libby Jenke, assistant professor of political science at the University of Houston, and Gary King, University Professor at Harvard's Institute for Quantitative Social Science, authored the study. It draws on 59 surveys and cites earlier work by Katherine Clayton and coauthors involving 9,472 respondents.

What: A study finding that approximately 23 percent of survey respondents give inconsistent answers when asked an identical question twice, a rate that neither improved question wording nor priming manipulation meaningfully reduced. The authors attribute much of this instability to what they term intrinsic human stochasticity, a fundamental randomness in decision-making distinct from either nonexistent preferences or flawed survey instruments.

When: The paper was published June 17, 2026, drawing on data collected across 59 surveys fielded over approximately three years on online survey platforms.

Where: The research was conducted through Harvard University's Institute for Quantitative Social Science and the University of Houston's Department of Political Science, using online survey platforms including Lucid, Mechanical Turk, and Prolific.

Why: The finding matters for any research practice built on single-wave attitudinal surveys, including brand lift studies, ad recall measurement, and purchase intent tracking, because it establishes a baseline level of respondent inconsistency that exists independent of survey design quality or panel selection, while also identifying preoccupation, rather than conventional attention checks, as a more reliable filter for excluding inattentive respondents from analysis.

Luis Rijo

Luís Rijo is a seasoned marketing professional with over 10 years of experience in Digital Marketing, Search, Social, Display, Video, and DOOH. Based in Europe. Also writing in the spend. Reach out via luis@ppc.land