How to set up Amplitude Experiments for A/B testing

A/B testing inside Amplitude pairs experiments with the analytics that measure them. But running an experiment correctly is harder than launching one. Here's the framework that keeps you out of false-positive trouble.

~4-6 hrAdvancedUpdated May 26, 2026

Who this is forProduct, growth, and engineering teams who want to ship feature changes with confidence. Especially relevant for SaaS teams with 5K+ monthly active users where statistical significance is achievable in reasonable timeframes.

What you'll need

Amplitude Growth plan or higher (Experiments is not on Starter or Plus)
Active event tracking with clean taxonomy
A primary success metric defined as a specific event (e.g., "Subscription Started")
At least 5,000 monthly active users on the surface where you'll run the test
Engineering capacity to wire up flag-based code branches
About 4-6 hours from experiment design through first launch

Step 1

Confirm your plan + install the Experiment SDK

Amplitude Experiments is on Growth plan and above. Install `@amplitude/experiment-js-client` (web) or the platform-specific SDK alongside your analytics SDK.

Confirm Growth plan at Organization Settings → Billing. Experiments is NOT on Starter or Plus — pricing changes; verify current.

Install the Experiment SDK: `npm install @amplitude/experiment-js-client` for web. For mobile, use the iOS/Android Experiment SDKs.

Initialize alongside your analytics SDK: `const experiment = Experiment.initialize(DEPLOYMENT_KEY)`. The deployment key is different from your analytics API key — find it at Experiment → Deployments.

Call `experiment.fetch(user)` on app load to fetch the user's variant assignments. Pass the same `user_id` + user properties you use for analytics — Experiment uses the same identity model.

In your feature code: `const variant = experiment.variant("my-experiment-flag")` returns the assigned variant ("control" / "treatment_a" / "treatment_b"). Branch your UI accordingly.

Step 2

Design the experiment in Amplitude → Experiment

In Amplitude → Experiment → "Create New," define the hypothesis, variants, primary + secondary metrics, and audience.

Open Amplitude → Experiment → "Create New Experiment."

Hypothesis: write one sentence. "If we remove the credit-card requirement from free trial, then trial signups will increase by 15-30%, because the friction is the dominant cause of drop-off." Make the size of effect explicit.

Variants: control + 1-3 treatments. More variants = more sample needed per arm. Default to A/B (one treatment) unless you have a clear reason to test multiple.

Primary metric: ONE event. "Subscription Started." If you can't pick one primary, your hypothesis isn't sharp enough.

Secondary/guardrail metrics: 2-4 events that should NOT regress. "Total Revenue," "Page Load Time," "Activation Rate." If a guardrail breaks, you ship the change only at significant primary-metric improvement.

Audience: which users see the test? Filter by cohort, property, or event-based eligibility. Avoid running on 100% of users initially — start with 50% holdout to compare.

Step 3

Calculate sample size before launching

Use Amplitude's built-in sample-size calculator or an external tool. Most teams under-power their experiments by 3-10x and read false signals.

In the Experiment setup, Amplitude shows estimated detectable effect at a given sample size and duration. Set: baseline conversion (your current rate), minimum detectable effect (MDE — usually 5-15% relative), significance level (0.05), statistical power (0.80).

Example: baseline 10% conversion, MDE 15% relative (so detecting an improvement to 11.5%), 0.05 significance, 0.80 power → typically needs ~10,000 users per variant.

If your monthly user volume is 3,000, this experiment needs 3+ months to read clearly. That's the math; respect it.

Smaller MDE (e.g., detecting 5% relative improvement) requires 4-10x more sample. Don't try to detect small effects without massive scale.

If sample size is impossible at your traffic, the experiment isn't worth running. Either ship the change and measure pre/post (less rigorous but actionable), or wait until you have scale.

Step 4

Ship the code changes behind the flag

Engineering wraps the variant logic around the feature change. Variant assignment must happen BEFORE any UI rendering — otherwise users see flicker.

In the code path that handles the feature change, branch on `experiment.variant("flag-name")`. Example: `if (variant.value === "treatment") { showNewOnboarding() } else { showOldOnboarding() }`.

Critical: call `experiment.fetch(user)` and AWAIT it before rendering the experimental UI. Without await, users see the control then flicker to treatment — invalidates the test.

For server-side rendering (Next.js, etc.), fetch variants server-side and pass to client to avoid hydration mismatch. The Experiment Node SDK has a server-side method.

Wrap the variant fetch in a try/catch with a control fallback. If the Experiment service is slow/down, users get the control experience — never an error.

Test in staging: force-assign yourself to each variant via Experiment → Deployments → "Assign User to Variant." Verify the UI behaves correctly in each.

Step 5

Track Exposure events automatically

Amplitude's Experiment SDK fires an "$exposure" event automatically when a user is assigned a variant. This is what powers the analysis — without it, the experiment has no data.

Every time `experiment.variant("flag")` is called for a user, Amplitude fires `[Experiment] $exposure` automatically.

Verify this in User Lookup for a test user: open User Lookup → search test user → look for `[Experiment] $exposure` event with `experiment_id` + `variant` properties.

For analysis to work, the exposure MUST fire before the primary-metric event for that user. If a user is exposed to the variant AFTER converting, they don't count.

Audit by checking: of 100 users assigned to treatment, do at least 90+ have an `$exposure` event? If not, your code is branching without calling `variant()` — fix the integration.

Common bug: the variant is fetched server-side but exposure isn't fired client-side. Always confirm exposure fires from the client where the UI change actually shows.

Step 6

Read the results — without lying to yourself

Open Amplitude → Experiment → [your test] → Results. Look at the primary metric, guardrails, and statistical significance. Wait for the planned duration.

Primary metric: Amplitude shows lift (or loss) per variant vs control, with confidence interval and p-value.

Significance: typically Amplitude flags "Statistically Significant" when p < 0.05. But ONLY trust this at the end of the planned duration — interim p-values are noise.

Guardrails: review every secondary metric. If any guardrail regresses significantly, the change probably shouldn't ship even if primary improved.

Heterogeneity: Amplitude shows treatment effect by segment (plan tier, source, device). If treatment works for paid users but hurts free users, you may want to ship to only one segment.

Decision rules to set BEFORE launch: "Ship if primary lifts ≥10% with p < 0.05 AND no guardrail regresses by >2%." Pre-committing avoids motivated reasoning when results are mixed.

Step 7

Roll out, document, or retire

After the experiment, do one of three things: roll out to 100% (if win), retire the flag (if no effect), or extend the test (if borderline). Document the result.

If win + significance: in Experiment → Deployments, set the variant to 100% rollout. Then over the next 30 days, remove the experiment code and ship the variant as the new default.

If no effect: retire the flag. Keep the code path as control (the old behavior). Document why the hypothesis didn't work — these "negative results" prevent the same idea being re-tested in 6 months.

If borderline (e.g., 8% lift but p = 0.12): extend the test if you have traffic, or accept that the change is "neutral" and ship based on qualitative reasons.

In Amplitude → Notebooks, create an "Experiments Log" notebook. For each experiment, log: hypothesis, design, results, decision. This becomes your experiment library — analysts in 12 months will thank you.

Be ruthless about retiring flags. Stale flags accumulate in code and become tech debt. Quarterly cleanup of un-used flags.

Common mistakes

What goes wrong (and how to avoid it)

Running underpowered experiments
What goes wrong: You run a test on 500 users per arm with 10% baseline conversion. The result "wins" with p = 0.06. You ship it. Three months later, conversion hasn't moved. The "win" was noise. Estimated wasted eng time: 100-300 hours per false positive.
How to avoid: Use Amplitude's sample-size calculator BEFORE launching. Hit the calculated minimum before reading. If your traffic doesn't support it, the experiment isn't worth running.
Peeking at results and stopping early
What goes wrong: Test runs 7 days, looks positive, you stop and declare victory. False positive rate jumps from 5% to 25-30%. Half of your "wins" are random walks. Cumulative bad ship rate over a year is 20-40% — significant product velocity lost.
How to avoid: Pre-commit to a duration based on sample-size math. Don't look at p-values during the run. Use Amplitude's sequential testing (Growth plan feature) if you genuinely need to peek.
No guardrail metrics
What goes wrong: You ship a change that lifts trial signups by 12% — but it also drops paid conversion by 8% (free users who can't convert later). Net revenue is negative. Discovered 90 days later when MRR shifts. Lost revenue: $5K-50K depending on scale.
How to avoid: Define 2-4 guardrail metrics for every experiment. "Activation Rate," "Paid Conversion," "Page Load Time," "Total Revenue." Ship only when primary lifts AND no guardrail regresses materially.
Multiple comparisons without correction
What goes wrong: You run 20 experiments simultaneously without alpha correction. At p < 0.05 each, you'll find 1 "winner" by pure chance. You ship 5 winners and 3-4 are random noise.
How to avoid: For 10+ concurrent experiments, apply Bonferroni or Benjamini-Hochberg correction. Amplitude's Growth plan supports false-discovery-rate adjustments — enable it.
Flag not removed after experiment ends
What goes wrong: After a successful experiment, the if/else stays in code. Two years later, 40+ stale experiment flags clutter the codebase. New devs can't tell what's live vs experimental. Tech debt + onboarding friction. Eng cost: $20K-80K to clean up.
How to avoid: Set a calendar reminder 60 days post-experiment-end to remove the flag. Make it part of your eng team's definition of done.
Different exposure rules between code and Amplitude
What goes wrong: Code branches on `variant()` but exposure tracking is off, or vice versa. Amplitude thinks 100K users saw treatment but actually only 30K did (or 300K did). All analysis is wrong by 3x. Recommendations based on this data send the team in the wrong direction for months.
How to avoid: Always call `experiment.variant()` from the SAME place that branches UI. Verify exposure fires for a test user via User Lookup before ramping the experiment.

Recap

What to take away

Amplitude Experiments is Growth-plan-and-above. Confirm plan + install Experiment SDK separate from analytics.
Calculate sample size BEFORE launching. Most teams under-power by 3-10x.
Define primary + 2-4 guardrail metrics. Pre-commit to ship/no-ship rules.
Don't peek. Wait for planned duration. Peeking inflates false-positive rate to 25-30%.
After the experiment: roll out, retire, or extend. Document every outcome.

Done — what's next

How to set up Amplitude event tracking the right way

Read the next tutorial

Hand it off

Running experiments rigorously is a craft. Most product teams who don't have a dedicated experimentation specialist ship false positives at 20-40% — meaning a significant fraction of their "winning" changes don't actually win. A specialist designs the test, monitors execution, and reads results without motivated reasoning. Typically $500-1,500/mo at $14-16/hr.

See specialist rates

Frequently Asked Questions

Can I use Amplitude Experiments with my existing feature-flag tool (LaunchDarkly, etc.)?

Yes, with extra wiring. You can fire `[Experiment] $exposure` events manually from any flag system, then read results in Amplitude. The integration is cleaner with Amplitude Experiment SDK, but if LaunchDarkly is already core to your eng workflow, integration is supported.

How long should an experiment run?

Long enough to hit calculated sample size, AND at least 1-2 full business cycles (so you capture weekday/weekend variation). Minimum 14 days regardless of sample. Typical experiment runs 21-42 days.

What's a good baseline conversion rate to detect changes?

For 10% MDE (detecting a 10% relative lift), you need ~7K-15K users per arm depending on baseline. For 5% MDE, multiply by 4x. For 20% MDE (detecting a 20% relative lift), divide by 4. Lower baseline = more sample needed.

How do I run experiments on mobile apps?

Amplitude has iOS + Android Experiment SDKs. The challenge is app-store update cycles — variant code ships in the next app version. Use server-driven variants where possible. Plan experiments around release windows.

Can I A/B test pricing?

Technically yes, legally + ethically gray. Price-discrimination experiments raise fairness concerns; some jurisdictions have laws. Most teams test pricing PRESENTATION (page copy, order of tiers) rather than actual price points. Get legal review before testing price values.

How to set up Amplitude Experiments for A/B testing

Confirm your plan + install the Experiment SDK

Design the experiment in Amplitude → Experiment

Calculate sample size before launching

Ship the code changes behind the flag

Track Exposure events automatically

Read the results — without lying to yourself

Roll out, document, or retire

What goes wrong (and how to avoid it)

What to take away

Frequently Asked Questions

Related tutorials

How to set up Amplitude event tracking the right way

How to build Amplitude funnels that actually answer business questions

How to use Amplitude Pathfinder for user journey discovery

When to hire an Amplitude specialist — an honest checklist

How to set up Amplitude Experiments for A/B testing

Confirm your plan + install the Experiment SDK

Design the experiment in Amplitude → Experiment

Calculate sample size before launching

Ship the code changes behind the flag

Track Exposure events automatically

Read the results — without lying to yourself

Roll out, document, or retire

What goes wrong (and how to avoid it)

What to take away

Frequently Asked Questions

Related tutorials

How to set up Amplitude event tracking the right way

How to build Amplitude funnels that actually answer business questions

How to use Amplitude Pathfinder for user journey discovery

When to hire an Amplitude specialist — an honest checklist