Test & Roll: Why Smaller A/B Tests Can Make More Money

Wait 5 sec.

[This article was first published on Florian Teschner, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.Short practical advice on A/B testing: Stop sizing tests only for statistical significance – In finite campaigns, your goal is profit, not perfect inference. Treat testing as a trade-off – Every extra test exposure buys learning but also burns revenue if that exposure gets the weaker treatment. Use smaller tests when outcomes are noisy – This paper shows profit-maximizing test sizes rise much more slowly than classical power-based sizes. Scale test size with reachable audience – If your population is limited, test size should reflect that constraint directly. Allow unequal splits when priors differ – If one treatment is likely better a priori (e.g., treatment vs holdout), asymmetric test cells can be optimal. Shiny App to test the implications:Test and Roll Shiny AppLong VersionI just read Test & Roll: Profit-Maximizing A/B Tests by Elea McDonnell Feit and Ron Berman (2019), and it challenges one of the default habits in marketing experimentation: planning tests as if the main objective were statistical significance.Their point is simple: in most real marketing experiments, you have a finite population (email list, campaign budget, limited traffic window). In that setting, the right objective is total expected profit across test + rollout, not p-values.The core ideaA classic A/B setup has two stages: Test stage: expose n1 users to treatment A and n2 users to treatment B. Roll stage: deploy the winner to the remaining N - n1 - n2 users.Bigger tests improve certainty, but they also create opportunity cost: more users in test means more users potentially seeing the weaker treatment before rollout.The paper formalizes this as a decision problem and derives profit-maximizing sample sizes. Under Normal priors and Normal outcomes, they get closed-form solutions.Why this matters in practiceIf you use classical hypothesis-test sizing, recommended n can be huge, especially when effect sizes are small and response is noisy (which is exactly what we see in advertising).Their framework produces much smaller test sizes because it optimizes business outcomes, not Type I/II error control.Two important takeaways: Optimal test sizes grow sub-linearly with response noise, while classical sample size rules grow much faster. Optimal test sizes scale with the square root of population size N, which makes them workable for smaller markets and finite campaigns.Comparison with banditsThe authors benchmark against Thompson sampling (multi-armed bandit). Bandits usually win on pure optimization, but the gap is often modest in their examples.That is useful operationally: a two-stage “test then roll” process is far easier to implement, explain, and govern than a continuously-adapting bandit, especially in organizations with approval and reporting constraints.The applications are the best partThey test the approach in three contexts: Website design experiments Display advertising decisions Catalog holdout testsAcross cases, profit-maximizing designs use substantially smaller test cells than classical power calculations and produce higher expected profit.A particularly practical result: small holdout groups (common in catalog and CRM practice) can be fully rational when priors are asymmetric. In other words, “unequal splits” are not always bad design; they can be the optimal design.What I changed in my own thinkingBefore this, I treated “underpowered” mostly as a red flag. After this paper, I think a better question is:Underpowered for what objective?If the objective is publication-grade inference, classical power logic is right.If the objective is campaign profit in a finite horizon, a smaller test can be the better business decision.Practical implementation checklistIf you run tactical tests (email, paid media, landing pages), this paper suggests a better workflow: Define total reachable population N for the decision horizon. Set priors for treatment means from past similar experiments. Estimate response variance from historical data. Compute profit-maximizing n1, n2. Pre-commit the rollout decision rule (posterior expected profit winner). Report expected regret alongside expected upside.That last point is underrated: decision-makers usually understand “expected dollars at risk” better than p-values.Bottom lineFor many real marketing tests, “smaller than textbook” is not bad science. It is better decision design.If your experiment exists to drive a business action on a finite audience, Test & Roll gives a rigorous way to choose sample sizes that maximize profit instead of statistical purity.Paper: Feit, E. M., & Berman, R. (2019). Test & Roll: Profit-Maximizing A/B Tests. SSRN: https://ssrn.com/abstract=3274875To leave a comment for the author, please follow the link and comment on their blog: Florian Teschner.R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.Continue reading: Test & Roll: Why Smaller A/B Tests Can Make More Money