Experimentation (A/B)
Upstream: Martech Stack & Automation — instrumentation, tooling, and data plumbing. This page covers measurement execution.
The decision this page enables: how to run experiments that produce trustworthy learnings — not dashboards full of false positives and opinions dressed up as data.
What A/B experimentation is (and why it matters)
Section titled “What A/B experimentation is (and why it matters)”A/B experimentation is the practice of comparing two (or more) variants of a customer experience — a headline, an email subject line, a pricing page layout, a lifecycle sequence — to see which performs better on a pre-defined metric, under controlled conditions.
Without experimentation, marketing optimization is storytelling. Someone on the team has a strong opinion about the headline; someone else has a different opinion; the loudest voice wins. With experimentation, the customer decides — and you document what you learned so the next person doesn’t re-litigate the same debate.
Experimentation matters because:
- Small lifts compound. A 5% improvement on signup rate, repeated across homepage, onboarding, and lifecycle email, can double effective acquisition over a year.
- It kills bad ideas cheaply. A pricing-page redesign that feels better but converts 12% worse is a six-figure mistake at scale. A two-week test catches it before you ship.
- It builds institutional memory. A documented experiment backlog is a library of what your customers actually respond to — more valuable than any brand guidelines doc.
- It pairs with everything else in Analytics & Measurement. Baselines from KPIs & Metrics, funnel stage context from Funnel, and budget decisions from ROI / ROAS all feed into what you test next.
Use experimentation when you have enough traffic or send volume to reach statistical significance in a reasonable window (usually 2–4 weeks), a clear hypothesis, and a single primary metric. Skip it when you’re pre-PMF with 200 visitors/week, or when the change is so small that even a win won’t move the business.
Core concepts
Section titled “Core concepts”Before you run your first test, lock these definitions. Ambiguity here produces arguments after the test, not before.
| Concept | Definition | Common mistake |
|---|---|---|
| Hypothesis | A falsifiable prediction: “If we change X, metric Y will improve by Z% because [mechanism]." | "Let’s test a new headline” — not a hypothesis |
| Control | The current experience (variant A). Always include one. | Running two new variants with no baseline |
| Treatment | The changed experience (variant B, C…). | Changing 5 things at once — you won’t know what worked |
| Primary metric | The one number the test decides on. Pick exactly one. | Reporting 12 metrics and calling whichever improved “the winner” |
| Guardrail metric | A metric that must not degrade (e.g., revenue, unsubscribe rate). | Ignoring guardrails until a “winning” test tanks LTV |
| Sample size | How many subjects (visitors, emails sent, users) each variant needs. | Peeking at Day 3 and calling it early |
| Statistical significance | The probability the observed difference is real, not random noise. Convention: p < 0.05 (95% confidence). | Treating 94% confidence as “close enough” when the decision is high-stakes |
| MDE (minimum detectable effect) | The smallest lift you care about detecting. Smaller MDE = longer test. | Testing for a 0.5% lift when you need 10% to justify the engineering cost |
| Holdout | A group that receives neither variant — used to measure true incremental lift of a program. | Measuring lifecycle email “lift” without a holdout (you’re measuring correlation, not causation) |
The experiment loop
Section titled “The experiment loop”Every test follows the same loop. Skipping a step is how teams ship losers or kill winners.
flowchart LR
H["Hypothesis<br/>falsifiable prediction"] --> D["Design<br/>metric, sample size, variants"]
D --> R["Run<br/>no peeking, guardrails live"]
R --> A["Analyze<br/>significance + segments"]
A --> S{"Decision"}
S -->|"Winner + guardrails OK"| Ship["Ship<br/>document learning"]
S -->|"No winner or guardrail hit"| Kill["Kill<br/>document learning"]
Ship --> H
Kill --> H
How to run an experiment — step by step
Section titled “How to run an experiment — step by step”- Start from a baseline. Before writing a hypothesis, know the current conversion rate, open rate, or activation rate for the surface you’re testing. If you don’t have a baseline, instrument first — see Martech Stack & Automation.
- Write the hypothesis. Use the format: “We believe that [change] will [improve/decrease] [primary metric] by [MDE]% because [customer insight or mechanism]. We’ll know we’re wrong if [guardrail metric] moves adversely.”
- Pick one primary metric and 1–2 guardrails. Primary: signup rate, activation rate, click-through rate, trial-to-paid rate. Guardrails: bounce rate, unsubscribe rate, average order value, support tickets.
- Calculate sample size before launch. Use an online calculator (Evan Miller, Optimizely, or your experimentation platform). Input: baseline conversion rate, MDE, significance level (95%), power (80%). Output: required visitors per variant.
- Design the variants. Change one meaningful thing per test. Multi-variant tests (A/B/C) need 3× the sample size. Document exactly what differs between control and treatment — screenshots, copy diffs, config flags.
- Set the runtime and traffic split. Default: 50/50 split, run until sample size is reached. Don’t peek and stop early. If you must peek, use sequential testing methods (your platform may support this) — don’t apply standard significance math to peeked data.
- Launch with QA. Verify tracking fires for both variants. Check that the experiment tool, analytics, and warehouse all see the same assignment. Broken tracking = wasted test.
- Analyze at full sample. Check primary metric significance, guardrail metrics, and key segments (mobile vs desktop, new vs returning, channel source). A winner overall that loses on mobile is a segment insight, not necessarily a ship.
- Ship, kill, or iterate — and write it up. Every test gets a one-page results doc in the experiment log. Winners ship; losers get archived with the learning. Neither outcome is failure — undocumented outcomes are.
- Feed learnings into the backlog. Update your ICE-scored backlog (see Templates). A winning headline test informs the next ad creative test; a losing pricing layout informs the next Pricing Model review.
The six rules of trustworthy experimentation
Section titled “The six rules of trustworthy experimentation”These rules exist because most “A/B test results” in marketing are wrong. Follow all six, or don’t claim you ran an experiment.
Rule 1: One primary metric, decided before launch
Section titled “Rule 1: One primary metric, decided before launch”The primary metric is the contract between you and your stakeholders. If signup rate is primary, the test is decided on signup rate — not “well, time-on-page also went up.” Changing the primary metric after the test is p-hacking.
In practice: Write the primary metric in the experiment brief (see Templates) and get one stakeholder to sign off before launch. If they want a different metric, change the brief — don’t change it after results arrive.
Rule 2: Calculate sample size before you start
Section titled “Rule 2: Calculate sample size before you start”Running until “it looks significant” is the most common source of false positives. Pre-calculate the sample size based on your baseline rate and the minimum lift you care about.
Rule of thumb: At a 5% baseline conversion rate, detecting a 10% relative lift (5.0% → 5.5%) needs roughly 30,000 visitors per variant. At 20% baseline, the same relative lift needs roughly 7,000 per variant. Low-traffic surfaces need longer runtimes or higher MDE targets.
Rule 3: Don’t peek (or use proper sequential methods)
Section titled “Rule 3: Don’t peek (or use proper sequential methods)”Peeking at results daily and stopping when p < 0.05 inflates your false-positive rate from 5% to 20–30%. If your organization can’t resist peeking, use a platform with sequential testing or agree on a fixed end date before launch.
Acceptable peeking: Checking guardrail metrics for catastrophic harm (unsubscribe rate 3× control) and stopping for safety — not for declaring victory.
Rule 4: Change one thing at a time
Section titled “Rule 4: Change one thing at a time”A test that changes headline + hero image + CTA + social proof tells you something improved, not what. Single-variable tests are slower but produce actionable learnings. Multi-variable tests belong in later-stage optimization with factorial design and much larger sample sizes.
Exception: “Radical redesign” tests (completely new page vs current page) are valid when you’re willing to learn “old vs new overall” and iterate inside the winner in subsequent tests.
Rule 5: Segment after, not during
Section titled “Rule 5: Segment after, not during”Deciding “mobile users are the real audience” after seeing that mobile won and desktop lost is cherry-picking. Run the test on all traffic; then analyze segments. If a segment-specific winner emerges, follow up with a segment-targeted confirmatory test.
Rule 6: Document every test — wins and losses
Section titled “Rule 6: Document every test — wins and losses”An experiment program that only publishes wins is a program that repeats mistakes. The results write-up (see Templates) is the product of experimentation, not the variant that shipped. Losses that teach you “customers don’t care about feature X in the headline” save the next quarter’s roadmap debate.
Statistical significance cheat sheet
Section titled “Statistical significance cheat sheet”You don’t need a statistics degree. You need to know when to trust a number and when to wait.
Quick reference table
Section titled “Quick reference table”| Baseline rate | MDE (relative) | Approx. sample per variant (95% conf, 80% power) | At 1,000 daily visitors, runtime |
|---|---|---|---|
| 2% | 20% (2.0% → 2.4%) | ~18,000 | ~18 days |
| 5% | 10% (5.0% → 5.5%) | ~30,000 | ~30 days |
| 5% | 20% (5.0% → 6.0%) | ~8,000 | ~8 days |
| 10% | 10% (10% → 11%) | ~14,000 | ~14 days |
| 20% | 10% (20% → 22%) | ~7,000 | ~7 days |
| 40% (email open) | 5% (40% → 42%) | ~19,000 sends | depends on list size |
| 60% (activation) | 5% (60% → 63%) | ~8,000 users | depends on signup volume |
Approximations for two-sided tests. Use a calculator for exact numbers.
Decision rules
Section titled “Decision rules”| Situation | What to do |
|---|---|
| p-value < 0.05, full sample reached, guardrails clean | Ship the winner (or schedule ship if engineering needed) |
| p-value 0.05–0.10, full sample reached | Inconclusive. Extend the test, increase sample, or accept you can’t detect this MDE |
| p-value < 0.05 but guardrail degraded >5% | Kill. A signup win that increases churn is not a win |
| Sample not reached, deadline hit | Don’t call it. Report “underpowered” and either extend or increase MDE target |
| Winner flips between days | Keep running. Early variance is normal; you’re peeking |
| One segment wins, overall flat | Follow-up test targeted at that segment; don’t ship globally yet |
Confidence vs business significance
Section titled “Confidence vs business significance”Statistical significance answers: “Is this difference real?” Business significance answers: “Is this difference worth the cost of shipping and maintaining?”
A 0.3% absolute lift on signup rate may be p < 0.001 at 500k visitors — but if the variant adds 200ms page load and requires a permanent engineering flag, the business case may still be “no.” Always pair statistical results with an impact estimate: expected incremental signups/revenue per month.
Lifecycle program lift and holdouts
Section titled “Lifecycle program lift and holdouts”Standard A/B tests compare variant A vs variant B on a page or email. Lifecycle programs — multi-touch, multi-channel journeys — need a different measurement design because there’s no single “conversion point” and because users who receive more touches almost always convert more (correlation ≠ causation).
See Lifecycle Programs for program design. This section covers how to measure them honestly.
Why holdouts exist
Section titled “Why holdouts exist”If you send a 5-email welcome series to 100% of new signups, you’ll see higher activation than if you sent nothing. That doesn’t prove the series caused the lift — activated users might have activated anyway. A holdout group (typically 5–10% of eligible users who receive no program touches) gives you a counterfactual.
Incremental lift = (treatment group metric) − (holdout group metric)
Not: (treatment group metric) − (historical baseline from before the program existed)
Holdout design for lifecycle programs
Section titled “Holdout design for lifecycle programs”- Randomize at enrollment. When a user enters the program trigger (e.g., signup), randomly assign to treatment (full program) or holdout (no program touches). Assignment must be sticky — a holdout user who later gets emails because of a bug invalidates the test.
- Holdout size: 5% minimum for high-volume B2C; 10% for lower-volume B2B. Smaller holdouts work but widen confidence intervals.
- Duration: Lifecycle lift manifests over weeks, not days. Run holdouts for at least one full program cycle (e.g., 14 days for a welcome series, 90 days for an expansion program).
- Primary metric: Match the program stage — activation rate for welcome series, 4-week retention for engagement programs, expansion rate for upsell programs.
- Guardrails: Unsubscribe rate (should be zero in holdout by definition), support tickets, NPS.
When holdouts aren’t worth it
Section titled “When holdouts aren’t worth it”- Pre-PMF, low volume: You can’t afford to withhold touches from 10% of 50 signups/week. Use before/after with strong caveats, or qualitative cohort review.
- Compliance / transactional messages: Password resets and billing notices aren’t experiments — no holdout.
- Tiny expected lift: If the program is cheap to run and the downside of not sending is low, some teams accept “directional” measurement. Document the assumption.
Combining A/B and holdouts
Section titled “Combining A/B and holdouts”The most rigorous lifecycle measurement uses both:
- Holdout → measures incremental lift of the program vs nothing
- A/B within treatment → measures which variant of the program performs best among those who receive it
Example: 90% of signups enter the program; 10% are holdout. Of the 90%, 50% get email sequence v1 and 50% get v2. You learn (a) whether the program beats silence, and (b) which sequence is better.
Templates
Section titled “Templates”Experiment brief
Section titled “Experiment brief”Copy before every test. One page max.
Experiment ID: EXP-2026-___Owner: [name]Status: [draft / running / complete]Surface: [homepage hero / pricing page / welcome email #2 / etc.]Linked initiative: [campaign name, OKR, or backlog item]
── HYPOTHESIS ──────────────────────────────────────────────We believe that [specific change] will [increase/decrease][primary metric] by [MDE]% because [customer insight or mechanism].
We will consider the hypothesis wrong if [guardrail metric]moves adversely by more than [threshold].
── DESIGN ──────────────────────────────────────────────────Control (A): [describe current experience — link screenshot]Treatment (B): [describe change — link screenshot]Variants: [A/B or A/B/C]Traffic split: [50/50]Primary metric: [exact definition + data source]Guardrail metrics: [1–2 metrics + thresholds]Baseline rate: [current %, date measured]MDE target: [relative % lift you need to detect]Sample per variant:[calculated N]Expected runtime: [days/weeks at current traffic]Start date: [YYYY-MM-DD]Planned end: [YYYY-MM-DD — do not stop early]Platform: [Optimizely / VWO / LaunchDarkly / ESP native / etc.]Tracking QA: [ ] control fires [ ] treatment fires [ ] warehouse syncs
── STAKEHOLDERS ────────────────────────────────────────────Decision maker: [who ships or kills]Reviewers: [who sees results before broader share]Results write-up
Section titled “Results write-up”Complete within 48 hours of test end.
Experiment ID: EXP-2026-___Runtime: [start] → [end] ([N] days)Sample reached: [yes / no — if no, note underpowered]
── RESULTS ───────────────────────────────────────────────── Control Treatment Lift p-valuePrimary metric: [x%] [y%] [+z%] [0.0xx]Guardrail 1: [...] [...] [...] [...]Guardrail 2: [...] [...] [...] [...]
Segment notes: Mobile: [...] Desktop: [...] Channel [X]: [...]
── DECISION ────────────────────────────────────────────────[ ] Ship treatment [ ] Keep control [ ] Iterate (new test)Rationale: [2–3 sentences]
Estimated monthly impact: [incremental signups / revenue / etc.]
── LEARNING ────────────────────────────────────────────────What we learned (one sentence a stranger would understand): "..."
Follow-up tests suggested: 1. [...] 2. [...]
Link to dashboard: [URL]Experiment backlog (ICE scoring)
Section titled “Experiment backlog (ICE scoring)”Prioritize what to test next. Review monthly in your Reporting Cadence weekly ops meeting.
| ID | Idea (one line) | Surface | Impact (1–10) | Confidence (1–10) | Ease (1–10) | ICE score | Status ||----|-----------------|---------|---------------|-------------------|-------------|-----------|--------|| 1 | Outcome-first headline vs feature-first | Homepage hero | 8 | 6 | 9 | 7.7 | backlog || 2 | Annual pricing default vs monthly default | Pricing page | 9 | 5 | 6 | 6.7 | backlog || 3 | 3-email vs 5-email welcome series | Lifecycle | 7 | 7 | 4 | 6.0 | in design || 4 | Social proof above fold vs below fold | Signup page | 5 | 4 | 9 | 6.0 | complete ✓ |
ICE score = (Impact + Confidence + Ease) / 3Sort by ICE; adjust for strategic priority (a 6.0 test tied to a launch beats a 7.0 nice-to-have).Target: 8–15 tests per quarter for a growth-stage team.Metrics to track
Section titled “Metrics to track”Track the experiment program, not just individual tests.
| Metric | Definition | Healthy range |
|---|---|---|
| Tests launched / quarter | Count of experiments reaching full sample | 8–15 (growth stage); 3–5 (early stage) |
| Test velocity (days) | Median days from brief → decision | 14–28 days for page tests; 7–14 for email |
| Win rate | % of tests where treatment beats control on primary metric | 25–40% — lower means you’re testing bold ideas; higher means you’re testing safe tweaks or p-hacking |
| Ship rate | % of winners actually deployed to 100% traffic | >80% — if winners don’t ship, experimentation is theater |
| Incremental lift (holdout programs) | Treatment minus holdout on primary metric | Varies; welcome series: +5–15pp activation is strong |
| Sample-size adherence | % of tests run to pre-calculated sample | >90% — peeking kills trust |
| Documented learnings / quarter | Results write-ups completed | 100% of completed tests |
| Impact shipped / quarter | Sum of estimated monthly impact from shipped winners | Track trend; absolute target depends on funnel size |
Worked examples
Section titled “Worked examples”SaaS workspace (B2B)
Section titled “SaaS workspace (B2B)”Context: 25-person product teams, PLG motion with sales assist above $5k ACV. ~2,400 homepage visitors/week, 4.2% signup rate, 38% activation (first doc + invite within 7 days).
Experiment: Homepage hero headline test
| Control (A) | Treatment (B) | |
|---|---|---|
| Headline | ”The unified team workspace" | "Stop losing context.” |
| Sub-head | Feature-led (docs, tasks, chat) | Outcome-led (replaces 4 tools, 1 hour setup) |
| Primary metric | Signup rate (visitor → account created) | |
| Guardrails | Bounce rate, demo-request rate | |
| Sample target | 15,000 per variant (~7 weeks at current traffic) | |
| MDE | 10% relative (4.2% → 4.6%) |
Results (after 7 weeks):
- Signup rate: 4.2% control → 4.9% treatment (+16.7% relative, p = 0.03)
- Bounce rate: unchanged
- Demo-request rate: +8% (not significant alone, but directionally positive for sales-assist segment)
- Mobile: flat. Desktop: drove the lift.
Decision: Ship treatment to desktop traffic; follow-up mobile-specific test (shorter headline, different hero crop).
Lifecycle holdout (parallel track): 5-email welcome series vs 10% holdout. After 14 days: activation 38% treatment vs 29% holdout → +9pp incremental lift. Program validated; A/B within treatment tested 3-email vs 5-email (5-email won +4pp activation among treated users).
Consumer fitness app (B2C)
Section titled “Consumer fitness app (B2C)”Context: Bodyweight workout app, 28–44 urban professionals. ~18,000 app-store landing page visitors/week, 8.5% install → trial, 22% trial-to-paid. High email volume (40% Day-1 open rate).
Experiment: App Store screenshot order test (cannot A/B the store itself — tested via paid landing page mirror that feeds identical store link)
| Control (A) | Treatment (B) | |
|---|---|---|
| First screenshot | Workout in gym setting | Living-room session, “14 min complete” overlay |
| Primary metric | Click-through to App Store | |
| Guardrails | Install rate (post-click), CPI from paid traffic | |
| Sample target | 8,000 per variant (~3 days via paid) | |
| MDE | 8% relative on CTR |
Results (3 days, paid traffic only):
- CTR: 11.2% → 13.4% (+19.6% relative, p = 0.01)
- Install rate post-click: unchanged (13.1% both) — creative attracts same-quality users
- Decision: Reorder App Store screenshots to match treatment; apply same visual to Paid Advertising Meta creative
Lifecycle holdout: Day-3 “Hard day? 12 minutes will reset your mood” push + email vs holdout. Primary metric: Week-1 retention (4+ sessions). After 21 days: 34% treatment vs 31% holdout — +3pp, p = 0.12 (underpowered). Extended holdout to 90 days; final lift +4pp, p = 0.04. Smaller absolute lift than B2B welcome series, but at 50k monthly signups = ~2,000 retained users/month.
Common pitfalls
Section titled “Common pitfalls”- Testing without a baseline. “Let’s try something new” with no control conversion rate wastes weeks. Instrument first.
- Peeking and early stopping. The #1 source of false wins. Pre-commit to sample size and end date.
- Changing multiple variables. You learn nothing actionable when headline + image + CTA all change.
- Ignoring guardrails. A signup win that increases Day-7 churn is a net loss. Always watch 1–2 guardrails.
- No holdout on lifecycle programs. “Our welcome series has 45% activation” means nothing if 40% would have activated anyway.
- Underpowered tests on low traffic. Running a 5-day test on 500 visitors and calling it is worse than not testing — it produces confident-sounding noise.
- Winners that never ship. Experimentation debt (15 completed tests, 3 shipped) erodes team trust. If eng bandwidth is the blocker, prioritize fewer, higher-ICE tests.
- Testing the wrong surface. Optimizing homepage headline when activation is 15% is Place: Logistics problem, not acquisition. Fix the bottleneck stage first.
Tools / further reading
Section titled “Tools / further reading”| Tool | Best for |
|---|---|
| Optimizely / VWO / AB Tasty | Web page A/B, multivariate, personalization |
| LaunchDarkly / Statsig | Product-led experiments, feature flags, server-side tests |
| HubSpot / Iterable / Braze / Customer.io | Email and push A/B within lifecycle programs |
| Evan Miller’s Sample Size Calculator | Pre-launch sample-size math (free) |
| Google Optimize (deprecated) | Migrate to GA4 + dedicated experimentation platform |
Reading:
- Trustworthy Online Controlled Experiments (Kohavi, Tang, Xu) — the canonical reference for experiment design at scale
- Lean Analytics (Croll & Yoskovitz) — stage-appropriate metrics that tell you what to test
- Martech Stack → Experimentation theme — how experimentation fits the 6-layer stack
Cross-links
Section titled “Cross-links”- GTM Measurement Plan — where experiments sit in the annual measurement scorecard
- Lifecycle Programs — program design that holdouts measure
- Pricing Model — pricing-page experiments and tier tests
- Paid Advertising — ad creative testing and landing-page alignment
- KPIs & Metrics — baselines and guardrail definitions
- Attribution — triangulating experiment wins with channel attribution
- Reporting Cadence — weekly review of experiment backlog and results
- Martech Stack & Automation — upstream instrumentation and experimentation platform selection