Loading tutorials…
Loading tutorials…
Mixpanel Experiments is the A/B test layer over your existing event tracking. Done right it's the fastest way to ship product changes with conviction. Done wrong it's the fastest way to ship the wrong change at scale.
Who this is forPMs and growth engineers running A/B tests on web or mobile product changes. If you've been emailing engineering to 'set up an experiment' and the rollout takes weeks, this is the in-house path.
What you'll need
Step 1
Hypothesis, primary metric, secondary metrics, success threshold, expected effect size, required sample size. All written down before code is touched.
Open a doc. Write: 'Hypothesis: changing [X] from [A] to [B] will improve [primary metric] by at least [Z%].'
Pick ONE primary metric. The one number that decides win/lose. Examples: signup_completed conversion, trial_started rate, purchase_completed rate. Tracking 5 'primary' metrics is tracking zero primary metrics.
List 2-3 secondary metrics for safety (no regression). Examples: retention Day-7, average session duration, bounce rate. If these tank while primary improves, you need to talk.
Decide the minimum detectable effect (MDE) — the smallest meaningful improvement. Usually 5-10% relative for primary metrics. 1-2% is hard to detect without huge sample sizes.
Calculate required sample size. Rule of thumb for binary metrics: ~10,000 users per variant for 5% MDE at 80% power and 95% confidence. Use an online calculator (Evan Miller's, Statsig's) or Mixpanel's built-in.
Define the decision rule: 'If primary metric improves >5% with p<0.05 and no secondary metric drops >3%, we ship. Otherwise we don't.' Get sign-off from PM + Eng + Design before launch.
Step 2
Mixpanel Experiments integrates with LaunchDarkly, Statsig, Optimizely, and others. The feature flag assigns variants; Mixpanel measures the impact.
In Mixpanel left sidebar: Experiments → +New Experiment. Or via the Reports menu in some UI versions.
Connect a feature-flag source: Data Management → Integrations → pick LaunchDarkly (or Statsig, or whichever you use). One-time auth.
Pick the feature flag that will drive this experiment. Mixpanel auto-detects variants (control, variant_a, variant_b).
Map each variant to its definition (the actual product change). Add a description of what the variant does, so future PMs know what they're looking at.
If you don't have an external feature-flag tool, Mixpanel can do basic flag assignment via the JS SDK — but a dedicated flag tool is recommended for any production-grade testing.
Step 3
For Mixpanel to attribute outcomes correctly, every user must be EXPOSED to a variant — and that exposure must fire as a Mixpanel event.
When a user encounters the experiment surface (loads the page, opens the feature), fire an exposure event: `mixpanel.track('$experiment_started', { experiment_name: 'pricing_redesign_v2', variant: 'variant_a' });`
Mixpanel's $experiment_started is the canonical exposure event. Conversions are then attributed to users who fired this event in the time window.
Critical: exposure must fire on every user who SEES the variant, not just users who CONVERT. Otherwise you're measuring 'of users who converted, how many were in variant_a' instead of 'of users in variant_a, what was the conversion rate'.
Fire exposure as early as possible in the user journey — usually on page load or feature-surface render. Late exposure (right before conversion) skews results toward users who would have converted anyway.
Server-side rendering: if the page is SSR'd with the variant decision, fire exposure server-side via the Server SDK with the user's distinct_id. Don't wait for client hydration.
Step 4
In the experiment config, pick the primary metric (conversion event), secondary metrics, confidence level, and the variants to compare.
Primary Metric: pick the event you wrote in the brief (e.g., purchase_completed). Mixpanel will measure conversion rate per variant.
Secondary Metrics: add 2-3 from your brief. Mixpanel calculates these alongside and flags any regression.
Confidence Level: default 95% (p<0.05). Don't lower this to 'see results faster' — you'll ship false-positive wins.
Statistical Test: Mixpanel uses sequential testing (Bayesian or frequentist depending on plan). Sequential tests handle peeking better than fixed-horizon tests. Read which one your project is configured for.
Minimum sample size warning: Mixpanel shows a 'Power' indicator. If you don't have enough users yet, the test isn't significant — don't ship based on it.
Step 5
Launch the experiment. Set a calendar reminder for the planned end date. Don't ship based on results before that date unless there's a critical safety regression.
Click Launch (or in your feature-flag tool, enable rollout). Variants are now being assigned.
Set a calendar reminder for the planned end date (usually 2-4 weeks after launch, based on sample-size math).
Monitor SAFETY only during the test: secondary metrics, error rates, support ticket volume. If a variant breaks the product, kill the test.
Do NOT peek at primary metric results and ship early. Sequential testing helps but isn't immune. If you check results every day and ship the first 'significant' day, you'll false-positive 1-in-20 tests.
On the planned end date, evaluate against the decision rule. If primary metric improved beyond threshold with sufficient confidence AND no secondary regression: ship. Otherwise: don't.
Step 6
After the test ends, open the experiment report. Document the result, the lessons, and the next test in a public doc.
Open the experiment in Mixpanel. Note the primary metric result (lift % + confidence interval), secondary metric results, and sample size per variant.
Check for sample ratio mismatch (SRM): variants should have ~50/50 traffic if you split evenly. If one variant has significantly fewer users, the assignment is broken and the test is invalid.
Check segment-level lift. Sometimes a test shows 0% overall lift but +15% on new users and -15% on existing users. Segment-level insights drive future iterations.
Write the post-mortem in a shared doc: hypothesis, result, decision, lessons learned, next test idea. Even losing tests have value — document them.
Archive the experiment in Mixpanel after writing the post-mortem. Tag tests with consistent naming so you can find them later (e.g., 'pricing_redesign_v2' archived → searchable).
Step 7
One-off tests waste setup overhead. Build a cadence: 1-2 tests per week, sprint-aligned, with a shared backlog.
Maintain a public experiment backlog (Notion, Linear, GitHub Project). Each entry: hypothesis, brief link, status (proposed → ready → running → done).
Triage backlog weekly. Each test should have a written brief before it gets prioritized.
Run 1-2 tests in parallel on different surfaces. Running 5+ overlapping tests creates interaction effects that invalidate results.
Hold a weekly experiment-review meeting. Look at running tests, completed tests, and propose new ones.
Track the experiment program's overall win rate. Industry benchmark: 10-25% of tests 'win' (ship). If yours is 80%, your tests aren't ambitious enough; if 0%, your hypotheses are off.
Common mistakes
Peeking at results and shipping early
What goes wrong: You launch a test on Monday. By Thursday it's 'significant'. You ship. Two weeks later you realize the win was noise — the metric reverts to baseline. You've shipped a change with no real impact and you've poisoned future test results by introducing a confound. Repeat this 10 times and your roadmap is full of changes that did nothing.
How to avoid: Plan the end date based on sample-size math. Use sequential testing if available. Even with sequential testing, peek-and-ship inflates false-positive rate above 5%. Discipline matters more than the statistical method.
Wrong primary metric (or no primary metric)
What goes wrong: Test shows +12% on 'time on page' but -8% on 'signup conversion'. Without a defined primary metric, the team debates which to optimize for. Eventually ships based on whoever argues loudest. A month later signup conversion is genuinely down and revenue is hit.
How to avoid: Lock the primary metric in the experiment brief BEFORE launch. Make it the metric that maps to revenue or core product value. Secondary metrics are safety checks, not decision drivers.
Insufficient sample size — calling significance on noise
What goes wrong: You run a test with 500 users per variant. P-value hits 0.04. You ship 'the winner'. With 500 users, the confidence interval on the lift is so wide that the 'winner' could be 20% better OR 5% worse — you have no idea. You ship something that might be hurting you.
How to avoid: Calculate required sample size BEFORE launching. Use 80% power, 95% confidence, and your minimum detectable effect. Don't launch tests that can't reach this in 4 weeks — restructure (bigger surface, higher-traffic page).
Sample Ratio Mismatch (SRM) — assignment is broken
What goes wrong: You expected 50/50 traffic split but Mixpanel shows variant A has 8,200 users and variant B has 5,400. That's not random — something is wrong with the flag assignment (maybe variant B has a JS error causing some users to bounce before exposure fires). Your test is invalid but the report still shows a 'winner'. You ship a change based on broken data.
How to avoid: Check SRM on every test. Variants should be within 1-2% of expected ratio. If not, kill the test, fix the assignment, restart. Mixpanel Experiments has SRM detection built in — heed the warnings.
Running too many concurrent tests
What goes wrong: You launch 5 tests simultaneously on different parts of the signup flow. Users are in multiple variant combinations. Interaction effects skew each individual test. Three tests 'win', two 'lose' — you can't tell whether any individual change actually helped because they all influenced each other.
How to avoid: Run 1-2 non-overlapping tests at a time on the same user journey. If you must run more, ensure variants are on completely separate surfaces (signup test + pricing test + email test = OK; three tests on signup steps = bad).
Not firing $experiment_started exposure
What goes wrong: You assign variants via your feature flag tool but never fire an exposure event in Mixpanel. Mixpanel doesn't know which users were in which variant. The 'experiment report' shows zero data or random noise. You declare the test inconclusive when really you just didn't instrument it.
How to avoid: Fire mixpanel.track("$experiment_started", { experiment_name, variant }) on every user who sees the variant surface. Validate in Live View before launching the test. Treat exposure firing as the most important event in the test.
Recap
Done — what's next
How to set up Mixpanel funnels the right way
Read the next tutorial
Hand it off
A proper experimentation program is a force multiplier — teams running 50 tests/year with rigor ship 2-3x more impactful changes than teams running 10 tests/year on vibes. A vetted product analytics specialist can stand up your framework, train your team, and run the first 5 tests with you for $1,500-3,500 over 4-6 weeks at $14-16/hr.
See specialist rates
You can run tests with any feature-flag tool + manually building Mixpanel funnels filtered by variant property. Mixpanel Experiments adds the statistical rigor (significance, sequential testing, SRM detection, sample size guidance) that DIY analysis often skips. For >5 tests/year, the Experiments product is worth the cost.
For a 5% relative lift on a 10% conversion rate, you need ~16,000 users per variant. For a 10% relative lift on the same baseline, ~4,000 per variant. Use Mixpanel's built-in sample-size calculator or external tools like Evan Miller's calculator. The exact number depends on baseline conversion rate and desired MDE.
Minimum 1 full week (to cover weekday/weekend cycles). Usually 2-4 weeks. Long enough to hit sample size AND cover at least one full user cycle (e.g., for B2B SaaS, 2 weeks captures monthly-active patterns).
Traditional A/B tests use fixed-horizon statistics — you decide upfront when to stop and only check at the end. Sequential testing (Bayesian or always-valid frequentist) lets you check results during the test without inflating false-positive rate. Mixpanel Experiments uses sequential testing, which is why it's safer to monitor in-flight than DIY analysis.
Optimizely / VWO are dedicated A/B testing platforms — they include the variant-delivery layer (visual editor, traffic allocation). Mixpanel Experiments is the analysis layer — it relies on your existing feature-flag tool for variant delivery. If you have engineering capacity, Mixpanel + LaunchDarkly is more flexible. If you don't, Optimizely is more self-contained.
Mixpanel
Funnels are the most-used Mixpanel report and the most-misread. The conversion-window setting alone changes 'we have a 12% signup-to-paid funnel' to '34%' — and most teams never touch it. Here's the build that actually answers the question.
Mixpanel
Cohorts are how Mixpanel goes from 'analytics tool' to 'user-targeting engine'. The team that learns to build, sync, and curate cohorts well runs marketing 2-3x more efficiently than the team that doesn't.
Amplitude
A/B testing inside Amplitude pairs experiments with the analytics that measure them. But running an experiment correctly is harder than launching one. Here's the framework that keeps you out of false-positive trouble.
PostHog
Running tests is easy. Running tests that produce real decisions is hard. This walks through hypothesis design, sample-size calculation, the PostHog experiment UI, and the 5 statistical mistakes that invalidate 80% of DIY A/B tests.