Loading tutorials…
Loading tutorials…
A/B testing inside Amplitude pairs experiments with the analytics that measure them. But running an experiment correctly is harder than launching one. Here's the framework that keeps you out of false-positive trouble.
Who this is forProduct, growth, and engineering teams who want to ship feature changes with confidence. Especially relevant for SaaS teams with 5K+ monthly active users where statistical significance is achievable in reasonable timeframes.
What you'll need
Step 1
Amplitude Experiments is on Growth plan and above. Install `@amplitude/experiment-js-client` (web) or the platform-specific SDK alongside your analytics SDK.
Confirm Growth plan at Organization Settings → Billing. Experiments is NOT on Starter or Plus — pricing changes; verify current.
Install the Experiment SDK: `npm install @amplitude/experiment-js-client` for web. For mobile, use the iOS/Android Experiment SDKs.
Initialize alongside your analytics SDK: `const experiment = Experiment.initialize(DEPLOYMENT_KEY)`. The deployment key is different from your analytics API key — find it at Experiment → Deployments.
Call `experiment.fetch(user)` on app load to fetch the user's variant assignments. Pass the same `user_id` + user properties you use for analytics — Experiment uses the same identity model.
In your feature code: `const variant = experiment.variant("my-experiment-flag")` returns the assigned variant ("control" / "treatment_a" / "treatment_b"). Branch your UI accordingly.
Step 2
In Amplitude → Experiment → "Create New," define the hypothesis, variants, primary + secondary metrics, and audience.
Open Amplitude → Experiment → "Create New Experiment."
Hypothesis: write one sentence. "If we remove the credit-card requirement from free trial, then trial signups will increase by 15-30%, because the friction is the dominant cause of drop-off." Make the size of effect explicit.
Variants: control + 1-3 treatments. More variants = more sample needed per arm. Default to A/B (one treatment) unless you have a clear reason to test multiple.
Primary metric: ONE event. "Subscription Started." If you can't pick one primary, your hypothesis isn't sharp enough.
Secondary/guardrail metrics: 2-4 events that should NOT regress. "Total Revenue," "Page Load Time," "Activation Rate." If a guardrail breaks, you ship the change only at significant primary-metric improvement.
Audience: which users see the test? Filter by cohort, property, or event-based eligibility. Avoid running on 100% of users initially — start with 50% holdout to compare.
Step 3
Use Amplitude's built-in sample-size calculator or an external tool. Most teams under-power their experiments by 3-10x and read false signals.
In the Experiment setup, Amplitude shows estimated detectable effect at a given sample size and duration. Set: baseline conversion (your current rate), minimum detectable effect (MDE — usually 5-15% relative), significance level (0.05), statistical power (0.80).
Example: baseline 10% conversion, MDE 15% relative (so detecting an improvement to 11.5%), 0.05 significance, 0.80 power → typically needs ~10,000 users per variant.
If your monthly user volume is 3,000, this experiment needs 3+ months to read clearly. That's the math; respect it.
Smaller MDE (e.g., detecting 5% relative improvement) requires 4-10x more sample. Don't try to detect small effects without massive scale.
If sample size is impossible at your traffic, the experiment isn't worth running. Either ship the change and measure pre/post (less rigorous but actionable), or wait until you have scale.
Step 4
Engineering wraps the variant logic around the feature change. Variant assignment must happen BEFORE any UI rendering — otherwise users see flicker.
In the code path that handles the feature change, branch on `experiment.variant("flag-name")`. Example: `if (variant.value === "treatment") { showNewOnboarding() } else { showOldOnboarding() }`.
Critical: call `experiment.fetch(user)` and AWAIT it before rendering the experimental UI. Without await, users see the control then flicker to treatment — invalidates the test.
For server-side rendering (Next.js, etc.), fetch variants server-side and pass to client to avoid hydration mismatch. The Experiment Node SDK has a server-side method.
Wrap the variant fetch in a try/catch with a control fallback. If the Experiment service is slow/down, users get the control experience — never an error.
Test in staging: force-assign yourself to each variant via Experiment → Deployments → "Assign User to Variant." Verify the UI behaves correctly in each.
Step 5
Amplitude's Experiment SDK fires an "$exposure" event automatically when a user is assigned a variant. This is what powers the analysis — without it, the experiment has no data.
Every time `experiment.variant("flag")` is called for a user, Amplitude fires `[Experiment] $exposure` automatically.
Verify this in User Lookup for a test user: open User Lookup → search test user → look for `[Experiment] $exposure` event with `experiment_id` + `variant` properties.
For analysis to work, the exposure MUST fire before the primary-metric event for that user. If a user is exposed to the variant AFTER converting, they don't count.
Audit by checking: of 100 users assigned to treatment, do at least 90+ have an `$exposure` event? If not, your code is branching without calling `variant()` — fix the integration.
Common bug: the variant is fetched server-side but exposure isn't fired client-side. Always confirm exposure fires from the client where the UI change actually shows.
Step 6
Open Amplitude → Experiment → [your test] → Results. Look at the primary metric, guardrails, and statistical significance. Wait for the planned duration.
Primary metric: Amplitude shows lift (or loss) per variant vs control, with confidence interval and p-value.
Significance: typically Amplitude flags "Statistically Significant" when p < 0.05. But ONLY trust this at the end of the planned duration — interim p-values are noise.
Guardrails: review every secondary metric. If any guardrail regresses significantly, the change probably shouldn't ship even if primary improved.
Heterogeneity: Amplitude shows treatment effect by segment (plan tier, source, device). If treatment works for paid users but hurts free users, you may want to ship to only one segment.
Decision rules to set BEFORE launch: "Ship if primary lifts ≥10% with p < 0.05 AND no guardrail regresses by >2%." Pre-committing avoids motivated reasoning when results are mixed.
Step 7
After the experiment, do one of three things: roll out to 100% (if win), retire the flag (if no effect), or extend the test (if borderline). Document the result.
If win + significance: in Experiment → Deployments, set the variant to 100% rollout. Then over the next 30 days, remove the experiment code and ship the variant as the new default.
If no effect: retire the flag. Keep the code path as control (the old behavior). Document why the hypothesis didn't work — these "negative results" prevent the same idea being re-tested in 6 months.
If borderline (e.g., 8% lift but p = 0.12): extend the test if you have traffic, or accept that the change is "neutral" and ship based on qualitative reasons.
In Amplitude → Notebooks, create an "Experiments Log" notebook. For each experiment, log: hypothesis, design, results, decision. This becomes your experiment library — analysts in 12 months will thank you.
Be ruthless about retiring flags. Stale flags accumulate in code and become tech debt. Quarterly cleanup of un-used flags.
Common mistakes
Running underpowered experiments
What goes wrong: You run a test on 500 users per arm with 10% baseline conversion. The result "wins" with p = 0.06. You ship it. Three months later, conversion hasn't moved. The "win" was noise. Estimated wasted eng time: 100-300 hours per false positive.
How to avoid: Use Amplitude's sample-size calculator BEFORE launching. Hit the calculated minimum before reading. If your traffic doesn't support it, the experiment isn't worth running.
Peeking at results and stopping early
What goes wrong: Test runs 7 days, looks positive, you stop and declare victory. False positive rate jumps from 5% to 25-30%. Half of your "wins" are random walks. Cumulative bad ship rate over a year is 20-40% — significant product velocity lost.
How to avoid: Pre-commit to a duration based on sample-size math. Don't look at p-values during the run. Use Amplitude's sequential testing (Growth plan feature) if you genuinely need to peek.
No guardrail metrics
What goes wrong: You ship a change that lifts trial signups by 12% — but it also drops paid conversion by 8% (free users who can't convert later). Net revenue is negative. Discovered 90 days later when MRR shifts. Lost revenue: $5K-50K depending on scale.
How to avoid: Define 2-4 guardrail metrics for every experiment. "Activation Rate," "Paid Conversion," "Page Load Time," "Total Revenue." Ship only when primary lifts AND no guardrail regresses materially.
Multiple comparisons without correction
What goes wrong: You run 20 experiments simultaneously without alpha correction. At p < 0.05 each, you'll find 1 "winner" by pure chance. You ship 5 winners and 3-4 are random noise.
How to avoid: For 10+ concurrent experiments, apply Bonferroni or Benjamini-Hochberg correction. Amplitude's Growth plan supports false-discovery-rate adjustments — enable it.
Flag not removed after experiment ends
What goes wrong: After a successful experiment, the if/else stays in code. Two years later, 40+ stale experiment flags clutter the codebase. New devs can't tell what's live vs experimental. Tech debt + onboarding friction. Eng cost: $20K-80K to clean up.
How to avoid: Set a calendar reminder 60 days post-experiment-end to remove the flag. Make it part of your eng team's definition of done.
Different exposure rules between code and Amplitude
What goes wrong: Code branches on `variant()` but exposure tracking is off, or vice versa. Amplitude thinks 100K users saw treatment but actually only 30K did (or 300K did). All analysis is wrong by 3x. Recommendations based on this data send the team in the wrong direction for months.
How to avoid: Always call `experiment.variant()` from the SAME place that branches UI. Verify exposure fires for a test user via User Lookup before ramping the experiment.
Recap
Done — what's next
How to set up Amplitude event tracking the right way
Read the next tutorial
Hand it off
Running experiments rigorously is a craft. Most product teams who don't have a dedicated experimentation specialist ship false positives at 20-40% — meaning a significant fraction of their "winning" changes don't actually win. A specialist designs the test, monitors execution, and reads results without motivated reasoning. Typically $500-1,500/mo at $14-16/hr.
See specialist rates
Yes, with extra wiring. You can fire `[Experiment] $exposure` events manually from any flag system, then read results in Amplitude. The integration is cleaner with Amplitude Experiment SDK, but if LaunchDarkly is already core to your eng workflow, integration is supported.
Long enough to hit calculated sample size, AND at least 1-2 full business cycles (so you capture weekday/weekend variation). Minimum 14 days regardless of sample. Typical experiment runs 21-42 days.
For 10% MDE (detecting a 10% relative lift), you need ~7K-15K users per arm depending on baseline. For 5% MDE, multiply by 4x. For 20% MDE (detecting a 20% relative lift), divide by 4. Lower baseline = more sample needed.
Amplitude has iOS + Android Experiment SDKs. The challenge is app-store update cycles — variant code ships in the next app version. Use server-driven variants where possible. Plan experiments around release windows.
Technically yes, legally + ethically gray. Price-discrimination experiments raise fairness concerns; some jurisdictions have laws. Most teams test pricing PRESENTATION (page copy, order of tiers) rather than actual price points. Get legal review before testing price values.
Amplitude
Bad event tracking is the most common reason Amplitude projects fail. Here is the naming convention, the SDK code, and the Data Guard rules that keep your taxonomy clean for years — not weeks.
Amplitude
Funnels are the most-used chart type in Amplitude and the most-misused. The defaults assume an attribution window that doesn't match most SaaS sales cycles. Here's how to build funnels that match reality.
Amplitude
Funnels show you the path you expected. Pathfinder shows you the paths users actually take. The gap is where 60% of product insight lives — and where most teams never look.
Amplitude
DIY Amplitude is a great idea — until your taxonomy gets out of control or your charts disagree with reality. This is the honest framework for when the math flips toward hiring.