Your A-B tests may not be telling you what you think they are! Read about the dangers of divergent delivery in a my new paper, soon to be published in the Journal of Marketing.
Eric Schwartz (Michigan) and I have written “Where A-B Testing Goes Wrong: How Divergent Delivery Affects What Online Experiments Cannot (and Can) Tell You About How Customers Respond to Advertising,”, which last week was accepted for publication in the Journal of Marketing.
This article will be of interest to anyone who is considering using ad platforms’ freely available experimentation tools to compare the effectiveness of different creative elements (images, copy, messaging) in online advertising. Divergent delivery occurs when a platform targets different users to different ads, based on the content of those ads. This makes it impossible for an advertiser to separate the effect of the ad from the effect from how an online platform’s targeting algorithm decides which users see those ads. We take the perspective of the practicing marketer who uses A-B test results to make strategic decisions based on which creative elements of ads are most effective.
And there is a lot to say about how targeting policies, user heterogeneity, and data aggregation conspire to bias the magnitude, and even the sign of A-B test results. We provide evidence that platforms engage in divergent delivery even during the course of a seemingly randomized experiment. And we also explain why platforms have no incentive to fix the problem.
Here’s the full abstract of the paper:
Abstract
Marketers use online advertising platforms to compare user responses to different ad content. But platforms’ experimentation tools deliver different ads to distinct and undetectably optimized mixes of users that vary across ads, even during the test. Because exposure to ads in the test is non-random, the estimated comparisons confound the effect of the ad content with the effect of algorithmic targeting. This means experimenters may not be learning what they think they are learning from ad A-B tests. The authors document these “divergent delivery” patterns during an online experiment for the first time. They explain how algorithmic targeting, user heterogeneity, and data aggregation conspire to confound the magnitude, and even the sign, of ad A-B test results. Analytically, the paper extends the potential outcomes model of causal inference to treat random assignment of ads and user exposure to ads as separate experimental design elements. Managerially, the authors explain why platforms lack incentives to allow experimenters to untangle the effects of ad content from proprietary algorithmic selection of users when running A-B tests. Given that experimenters have diverse reasons for comparing user responses to ads, the authors offer tailored prescriptive guidance to experimenters based on their specific goals.