The sweet spot is four.
Teams running fewer than 4 experiments per month learn too slowly to compete. Teams running more than 8 burn out and ship sloppy tests. After managing 2,000+ experiments across our portfolio, we've found the sweet spot: 4 experiments monthly, in 2-week cycles. Here's the exact playbook.
What 4-per-month produces.
of experiments produced a stat-sig winner among teams running the 4/month cadence — vs. 12% for ad-hoc teams.
documented learnings per year, including losses. Compounds across cycles.
burnout flags on engineer satisfaction surveys at the 4/month cadence. Eight per month produced 14% drop in satisfaction.
A repeatable cycle.
Mon W1 · Hypothesis review
Score the candidate hypotheses from the backlog. Pick the top two for this cycle. Each gets a one-page brief — problem, hypothesis, primary metric, MDE, sample size.
Tue–Wed W1 · Build & instrument
Two days max. If it takes longer, scope was wrong — go back to the brief and trim.
Thu W1 → Wed W2 · Run
7-day run window minimum. Don't peek before stat-sig is reached. Pre-commit to the analysis plan.
Thu W2 · Read the result
Stat-sig win, stat-sig loss, or no signal. Document all three outcomes equally.
Fri W2 · Decision + cleanup
Roll out winners, kill losers, archive the learning. Cleanup includes removing the experiment flag.
Most experiments fail at the brief.
| Marker | Win Rate | Operational Tell |
|---|---|---|
| Has a primary metric | +22 pp | Pick one. Tie-breaks come second. |
| Has a minimum detectable effect (MDE) | +18 pp | Forces sample-size honesty |
| Has a stop rule | +14 pp | When do we kill it before stat-sig? |
| Has a roll-back plan | +11 pp | Mostly mechanical with feature flags |
| Pre-committed analysis plan | +9 pp | Prevents post-hoc storytelling |
What separates healthy programs.
✓DO
- Run 4 experiments per month, in 2-week cycles
- Pre-commit to MDE, sample size, and analysis plan
- Document all three outcomes (win / loss / no signal) equally
- Tie experiments to existing flag infrastructure
- Quarterly retrospective on the experiment program itself
✗DON'T
- Run more than 4 in parallel — interaction effects swamp results
- Peek at results before stat-sig
- Tell post-hoc stories about marginal effects
- Roll out a winner without removing the experiment flag
- Ship 'wait and see' experiments without a stop rule
Before you start the cycle.
- Hypothesis backlog scored and ranked
- Top 2 hypotheses have one-page briefs
- Primary metric + MDE + sample size pre-committed
- Stop rule documented
- Flag and instrumentation in place before launch
- Analysis plan committed to repo
- Win / loss / no-signal templates ready
- Quarterly retrospective scheduled