A/B testing has a credibility problem in startups, and it is self-inflicted. Tests get called after three days because the variant "is clearly winning." Metrics get swapped mid-test. Losing variants get rerun until they win. The result is a growth roadmap built on noise, defended with the word "significant."

This guide is about running honest experiments at startup scale — including knowing when not to run one.

The core idea, in plain language

An A/B test asks: if I show version A to one random half of users and version B to the other, is the difference in outcomes bigger than what random chance would produce? Because users vary wildly, small samples produce large random differences. Statistics is the discipline of not being fooled by that.

Two numbers frame every test:

The p-value / significance level. Convention sets the threshold at 0.05: accept a 5% risk of declaring a winner when there is none (a false positive). This is a minimum bar, not proof.
Statistical power. The probability of detecting a real effect of a given size. The convention is 80%. Underpowered tests mostly return "no significant difference" even when the variant genuinely helps — the most common and least diagnosed failure in startup testing.

The sample size reality check

Before launching any test, compute the required sample. The inputs: your baseline conversion rate, the minimum lift you care about (MDE — minimum detectable effect), significance (5%) and power (80%).

Illustrative orders of magnitude for a conversion baseline of 5%:

To detect a relative lift of 50% (5% → 7.5%): roughly 1,000–1,500 users per variant.
To detect a 20% lift (5% → 6%): roughly 8,000 per variant.
To detect a 10% lift (5% → 5.5%): roughly 30,000+ per variant.

Now the uncomfortable arithmetic: if your signup page gets 2,000 visitors a month, a 10%-lift test needs over two years. The honest conclusions:

At startup traffic, only test big swings — different value proposition, different flow, different pricing page — not button colors.
Test high-traffic, high-baseline surfaces (onboarding steps with thousands of users and 40% baselines need far smaller samples than 1% landing-page conversions).
Sometimes the right answer is don't A/B test: ship, watch the metric with before/after judgment, and reserve formal testing for decisions that are expensive to get wrong.

The seven ways teams lie to themselves

1. Peeking (optional stopping). Checking daily and stopping the moment significance flashes. Continuous checking at a 5% threshold produces false positives at several times the nominal rate — a null test checked every day for a month has far more than a 5% chance of crossing the line at least once. Fix: fix the sample size in advance and evaluate once, or use sequential testing methods explicitly designed for repeated looks.

2. HARKing the metric. The primary metric (signup rate) did not move, but "time on page" did, so victory is declared on that. Fix: one pre-registered primary metric per test; everything else is hypothesis-generating.

3. The multiple comparisons trap. Testing 5 variants against control on 4 metrics = 20 comparisons; at a 5% false-positive rate, one spurious "win" is expected. Fix: fewer variants, one primary metric, and corrections (or at least suspicion) when you slice.

4. Ignoring sample ratio mismatch (SRM). You configured 50/50 but got 55/45. That imbalance usually means the assignment or tracking is broken, which invalidates the test. Fix: check the split before reading the result.

5. Segment fishing. "It didn't win overall, but look at mobile users from organic in Europe!" With enough segments, noise always produces a winning one. Fix: segments are for generating the next test's hypothesis, not for rescuing this one.

6. Stopping losers early, letting winners run. Asymmetric stopping rules bias your recorded history toward false wins. Fix: the same stopping rule for every outcome.

7. Novelty and seasonality blindness. A redesign wins in week one because it is new; a test run across a holiday measures the holiday. Fix: run tests over full weekly cycles (2+ weeks), and be suspicious of effects that decay.

A test protocol that fits on an index card

Hypothesis: "Because [evidence], we believe [change] will improve [primary metric] by at least [MDE]."
Sample size: computed in advance from baseline, MDE, 5% significance, 80% power. Derive the run time; if it is more than ~4 weeks, redesign the test (bigger swing or bigger surface) or don't test.
Run: full weeks, no mid-test changes, no peeking-based stops. Check SRM.
Read: primary metric first. Report the effect size with its confidence interval, not just "significant" — "signup rate 4.1% → 4.9%, 95% CI [+0.2, +1.4 points]" says what "p < 0.05" hides.
Decide and log: ship, kill, or iterate — and record the result either way. A searchable log of past experiments (including losers) is the most underrated growth asset a team builds.

Low-traffic alternatives that are still honest

When the sample-size math says "no," you still have rigorous options:

Painted-door tests: a button for the unbuilt feature, measuring click intent. High baselines (10–30%) mean small samples suffice — a real decision from a week of traffic.
Before/after with guardrails: ship the change, compare 2–3 full weeks before versus after on the same metric, and explicitly list confounders (campaigns, seasonality, press). Weaker than a randomized test; far stronger than opinion — if you write down the expected effect before shipping.
User testing at n=5–10: watching five users fail to find the signup button is more informative than an underpowered test involving five thousand. Qualitative methods find the reasons quantitative tests can only confirm.
Bandit-style rollouts for ephemeral decisions (promo copy, campaign variants) where learning fast matters more than certainty.

The honest hierarchy: randomized test when powered, painted door or before/after when not, qualitative always in parallel.

What "significant" does and does not mean

Statistical significance means: unlikely under pure chance. It does not mean: big, durable, or causal-for-the-reason-you-think. A significant 2% relative lift on a low-value metric can be worth less than an insignificant-but-large result that warrants a rerun with proper power. Pair every significance statement with the effect size and the business value, and let that drive the ship/kill decision.

A note on what to test at all: the expected value of a test is (probability it wins) × (value of the win) − (cost of running it, including the traffic it consumes and the decisions it delays). At startup scale, traffic is your scarcest experimental resource — spending four weeks of it to decide between two nearly identical headlines is a losing trade even if the test is statistically impeccable. Reserve formal tests for decisions that are genuinely reversible-but-expensive: pricing pages, onboarding flows, core value propositions. Ship the small stuff on judgment.

Culture beats mathematics

The statistics are learnable in an afternoon. The hard part is the culture: pre-registering metrics when it would be convenient not to, publishing losing tests, letting the sample size veto impatience. Teams that keep that discipline compound learning; teams that don't compound noise.

Tooling helps the discipline stick — when test setup forces a primary metric, the dashboard shows required versus collected sample, and the verdict is computed rather than argued. That is exactly how the A/B testing module in Growth Pilot works: declare the hypothesis, let the math referee the outcome, and keep every verdict in the same cockpit as the metrics it moved.