InteractiveResearch · Testing·9 min read

The ad test significance calculator

Most teams call winners on noise. Plug in baseline conversion, the lift you'd care about, and your traffic - get sample size and time-to-significance.

Reading mode
Start here

Why ad tests are mostly fake

Imagine flipping two coins. After 10 flips, one has 6 heads and the other has 4. Did you discover that the first coin is "better"? No - you flipped them too few times. The difference is just luck.

Most ad tests work the same way. Teams declare a winner after 3 days because the numbers look different, but they haven't run long enough to know if the difference is real or just luck.

Real winners come from real tests. The calculator below tells you how long you actually need to run.

In one line: if your test runs less than a week with normal traffic, you're probably rewarding noise.

0%

power - the right default for ad tests

0%

realistic confidence threshold for creative

0 days

max test runtime under Andromeda fatigue

0%

MDE most ad tests can realistically detect

Significance calculator

Plug in your numbers. See if the test reads.

Two-proportion z-test, 80% power, two-tailed. Adjust baseline conversion rate, the lift you'd care about, your confidence threshold, and daily traffic - get sample size and time-to-significance.

2.5%
15%

The smallest relative lift you'd want to detect. Smaller MDE = bigger sample needed.

Confidence

95% is the academic standard; 90% is the realistic creative-testing standard. 99% is overkill for ad creative.

2,000

Sample per variant

29,160

conversions or visitors

Total sample

58,320

across both variants

Time to significance

30days

At 2,000 visitors/variant/day, this test hits 95% confidence with 15% MDE in roughly 30 days.

Verdict

Too slow

Test takes longer than the fatigue cycle. Either widen your MDE (look for bigger lifts), drop confidence to 90%, or accept that you can't read this reliably with current traffic.

Anti-patterns

Five testing mistakes

The common ways ad teams produce confident-looking but meaningless test results.

Most teams declare a winner the moment they see a 'big enough' difference. That's variance, not signal. Set your sample size before you start; don't peek before you hit it. Otherwise you're rewarding noise.
How Shuttergen handles testing

Test more concepts, test them properly.

Shuttergen's variation engine produces 25 ship-ready variants per concept, which means you can test concepts head-to-head at the level of structural difference (where MDEs are 30%+) rather than headline tweaks (where MDEs are 3% and untestable).

Lineage tracking from inspiration → variation → result also means wins compound: when a concept beats baseline, the engine generates more variations along its winning structural axes, refining the library with each cycle instead of starting over.

The playbook

Eight rules for honest creative testing

0/8

your team's coverage

Sources

What we read to build this

Stop calling tests on three days of variance.

Shuttergen produces enough structural variation that tests have meaningful MDEs - and lineage tracking compounds wins instead of resetting them.

Get started free