How to set up PostHog experiments and A/B tests that actually decide things

Running tests is easy. Running tests that produce real decisions is hard. This walks through hypothesis design, sample-size calculation, the PostHog experiment UI, and the 5 statistical mistakes that invalidate 80% of DIY A/B tests.

~3-5 hrAdvancedUpdated May 26, 2026

Who this is forProduct managers, growth leads, and engineers running A/B tests inside PostHog. Especially relevant if you have run tests before but suspect the conclusions were not statistically valid — most are not.

What you'll need

PostHog with event tracking and feature flags set up (tutorials #3, #4)
A specific hypothesis to test (NOT "let's try a thing and see")
A primary metric and a baseline conversion rate
Enough traffic — see step 2 for the math
About 3-4 hours for the first experiment, less for subsequent

Step 1

Write the hypothesis BEFORE looking at PostHog

A real hypothesis: "If we change X, then Y will increase by Z%, because of mechanism W." Write it on paper. Pre-register it. Stick to it.

Template: "If we [change], then [primary metric] will [direction] by [effect size]% within [duration], because [mechanism]."

Example: "If we add social proof above the pricing CTA, then trial-signup rate will increase by 15% within 14 days, because users are anchoring on whether other users have trusted the product."

Bad hypothesis: "Let's try a green button instead of blue." (No effect size, no mechanism, no primary metric.)

Document the hypothesis in a shared doc with a date. This pre-registration is the difference between a real experiment and post-hoc storytelling.

List the secondary metrics (signup-to-paid conversion, average order value) — but commit upfront that the decision will be made on the primary metric. Secondary metrics give context, not decisions.

Step 2

Calculate required sample size

Use the PostHog sample-size calculator (built into experiment setup). Need ~10K conversions per variant for a 10% lift detection. More for smaller lifts.

PostHog → Experiments → New experiment → it shows a built-in calculator: enter baseline conversion rate + minimum detectable effect (MDE).

Rule of thumb: detecting a 10% relative lift on a 5% baseline conversion rate requires ~31,000 users per variant (62K total). Detecting a 5% lift requires ~125K per variant.

If you do not have that traffic in a reasonable time (4-6 weeks), pick a more aggressive MDE (test for 25% lift, not 5%). You will detect only big effects, but at least the test will conclude.

For sub-1,000-DAU products, A/B testing is usually not the right tool. Run qualitative user research instead, or use Bayesian methods (PostHog supports both Frequentist and Bayesian).

NEVER start a test without knowing what sample size you need to detect what effect. Tests that "look directional" are noise.

Step 3

Create the experiment in PostHog

PostHog → Experiments → New experiment → set feature flag, primary metric, secondary metrics, exposure rule. Run for the calculated duration.

PostHog → Experiments → New experiment. Pick a name like `pricing_social_proof_test_2026_05`.

Create a multi-variant feature flag with variants: `control` (50%), `treatment` (50%). PostHog enforces equal allocation.

Set the primary metric: `trial_signup_completed` (the event you defined in your event taxonomy). Choose "Funnel" if you care about a multi-step conversion, "Trends" for a single event.

Set secondary metrics (up to 3): `signup_to_paid_conversion`, `pricing_page_bounce_rate`. These give context, do not drive decisions.

Set exposure: who is in the experiment? Often "all users who visit the pricing page" (use a `$pageview` exposure event).

Set duration: minimum 1 week (to cover full weekly cycle), ideally 2-4 weeks. Set the planned end date in PostHog.

Start the experiment. PostHog freezes the feature flag config for the test duration.

Step 4

Implement the variant in code

Use the same multi-variant flag pattern from tutorial #4. The flag check decides which variant to show; PostHog tracks which user saw which variant.

In your code: `const variant = useFeatureFlagVariantKey("pricing_social_proof_test_2026_05"); return variant === "treatment" ? <PricingWithSocialProof/> : <PricingControl/>;`

PostHog automatically captures a `$feature_flag_called` event when the variant is read. This is how the experiment knows which user saw which variant.

CRITICAL: only call the flag-check on the page where the variation matters. If you check the flag on every page, you "expose" users to the experiment without showing them the variant — pollutes results.

Test in dev with overrides: PostHog Toolbar (bookmarklet) lets you force a specific variant in your browser. Walk through both variants before launching.

Step 5

Monitor the experiment without peeking

Do NOT check results daily and decide based on early trends. Set a calendar reminder for the end date. Look once, decide once.

Multiple-comparisons problem: if you check results daily and act on any "significant" reading, your effective false-positive rate goes from 5% to 30%+. You will ship "winning" variants that are actually random noise.

PostHog's experiment dashboard shows a 'don't peek' indicator. Trust it. Set a calendar reminder for the planned end date. Until then, do not look.

If you must monitor (for sanity), monitor exposure balance only — verify ~50/50 split. If the split is wrong, something is broken in the variant assignment. Pause and fix.

If the experiment shows the treatment is causing CATASTROPHIC harm (revenue down 40%, error rate up 10x), kill it. That's a legitimate safety stop, not statistical peeking.

Otherwise, wait for the planned duration. Then decide.

Step 6

Interpret results and ship the decision

At end-date, check primary metric. If statistical significance + practical effect size both pass, ship the winner. If only one passes, do not ship.

Open PostHog → Experiments → your experiment. Look at the primary metric.

PostHog shows: probability that treatment is better, conversion rate per variant, credible interval (Bayesian) or p-value (Frequentist).

Decision rule: ship treatment if (a) probability of being better > 95% AND (b) the point estimate of lift is ≥ your minimum practical effect size from step 1.

If (a) passes but (b) doesn't (statistically significant but tiny effect): do not ship. The lift is real but not worth the engineering / cognitive overhead.

If (b) passes but (a) doesn't (big point estimate but high uncertainty): do not ship. You do not have enough data to commit.

Document the decision in your experiment doc. Archive the experiment. Schedule a flag-removal PR for 14 days out.

Step 7

Audit for novelty and segment effects

Check: does the lift hold up for new users vs returning users? Mobile vs desktop? Power users vs casual? Sometimes a winning experiment is winning only for one segment.

After deciding the overall result, split by segment. PostHog → Experiments → result → Breakdown by `device`, `user_tier`, `signup_age`.

Look for inversions: treatment wins overall but loses for power users. Treatment wins on desktop but harms mobile.

For inverted segments, consider a multi-variant rollout: ship treatment only to the winning segments, keep control for the others.

Novelty effects: if you re-ran the test in 6 months, would treatment still win? Novelty drives initial wins that fade. Re-test major bets quarterly.

Common mistakes

What goes wrong (and how to avoid it)

Peeking at results and stopping early
What goes wrong: Daily check at day 4 shows treatment +12%, p=0.04. You ship it. Three months later you re-run the test and treatment is actually 3% WORSE. The 'win' was noise. Estimated cost of the wrong ship: $15K-50K in lost conversions or wasted engineering on a doomed feature.
How to avoid: Pre-register sample size. Set a calendar reminder. Look at results ONCE, on the planned end-date. Use Bayesian methods (PostHog supports them) if you absolutely need early-stopping rules.
Running underpowered tests
What goes wrong: You test a 5% lift on a 5K-DAU product. The test runs 2 weeks, shows treatment +8% (p=0.12, "not significant"). You conclude "no effect." But the test had only 30% power to detect a real 5% lift. You shipped a real winner as a loser. Six months of conversion loss before someone retests.
How to avoid: Calculate required sample size BEFORE starting. If you cannot reach it in a reasonable time, do not run an A/B test — use qualitative research or a larger MDE.
Multiple-comparisons problem
What goes wrong: You set 6 secondary metrics. Run the test. The primary is null but secondary #4 shows p=0.04. You ship anyway, claiming 'we found something.' Reality: with 6 metrics, the chance of at least one false-positive is ~26%. You shipped noise.
How to avoid: Decide BEFORE the test which metric is primary. Make the decision on the primary only. Use secondary metrics for context, not decisions. If you really need multiple primary metrics, apply a Bonferroni correction (p_threshold / n_metrics).
Pre-exposure pollution
What goes wrong: Flag check happens in app shell. 80% of "exposed" users never see the test surface. Statistical power is destroyed. Experiment runs 6 weeks instead of 2 and conclusions are weaker.
How to avoid: Move the `useFeatureFlagVariantKey` call to the specific component that renders the variation. PostHog only "exposes" the user when the check actually runs.
Not accounting for weekly cycles
What goes wrong: Test runs Wednesday through Friday. Weekend users (different behavior) are not represented. Result extrapolates a weekday-only sample to all users. Decision is wrong for ~30% of the user base.
How to avoid: Run experiments for at least 1 full week, ideally 2. Cover all day-of-week and time-of-day cycles. For B2B with strong weekday patterns, 4 weeks minimum.
Ignoring sample ratio mismatch (SRM)
What goes wrong: Test allocates 50/50 but PostHog reports 53/47 actual exposure. Something is broken in variant assignment — maybe the flag check is failing for some users. The experiment data is corrupted. You make a decision on unbalanced data. The "winner" was just the over-represented variant.
How to avoid: Run a chi-square test on exposure counts at end of experiment (or check PostHog's built-in SRM warning). If imbalance exceeds 1-2 percentage points, debug the flag-assignment logic before trusting any result.

Recap

What to take away

Write the hypothesis with effect size + mechanism BEFORE looking at data.
Calculate sample size. If you cannot reach it, do not run the test.
One primary metric. Decide on that. Everything else is context.
Do not peek. Set the end-date and respect it.
Verify exposure balance (SRM) before trusting any result.

Done — what's next

How to set up PostHog feature flags for safe rollouts

Read the next tutorial

Hand it off

A/B testing is a tool that looks free and is actually expensive — most DIY tests produce decisions made on noise, then those decisions ship to production for months. The fix is process and statistical discipline. EverestX matches you with a vetted PostHog specialist who designs the first 3 experiments and trains the team on the framework — typically $1,200-3,000 once + $400-600/mo ongoing.

See specialist rates

Frequently Asked Questions

How long should I run an A/B test?

Minimum 7 days (to cover one weekly cycle). Ideally 14-28 days. Long enough for the calculated sample size, but not so long that external factors (seasonality, marketing campaigns) distort the result. Most B2B tests run 2-4 weeks.

Frequentist vs Bayesian — which should I use?

PostHog supports both. Bayesian (default) gives "probability treatment is better than control" — easier to communicate to non-statisticians, supports early stopping with proper priors. Frequentist gives p-values — more familiar to teams from a traditional stats background. For most product teams, Bayesian is the better default.

My experiment is "inconclusive" — what does that mean?

PostHog couldn't reach 95% probability of one variant being better. The most common cause is undersize — you needed 20K conversions per variant and only got 8K. Don't ship treatment based on a directional reading. Either extend the experiment, accept the null result, or run a follow-up with a bigger MDE.

Can I test multiple things in one experiment?

A multivariate test (MVT) tests combinations: button-color × headline-variant × CTA-text. Powerful but requires huge traffic — each combination needs full statistical power. For most teams, run sequential single-factor A/B tests instead. Faster decisions, cleaner attribution.

When should I use experiments vs feature flags vs surveys?

Experiments: when you need to make a causal decision (ship A or B?) and you have traffic. Feature flags: when you know what you want and are de-risking the rollout. Surveys: when you want to understand *why* users behave a certain way, not whether the change worked.

How to set up PostHog experiments and A/B tests that actually decide things

Write the hypothesis BEFORE looking at PostHog

Calculate required sample size

Create the experiment in PostHog

Implement the variant in code

Monitor the experiment without peeking

Interpret results and ship the decision

Audit for novelty and segment effects

What goes wrong (and how to avoid it)

What to take away

Frequently Asked Questions

Related tutorials

How to set up PostHog feature flags for safe rollouts

How to set up PostHog event tracking with a taxonomy that scales

How to set up PostHog surveys that get real responses

When to hire a PostHog specialist — an honest checklist

How to set up PostHog experiments and A/B tests that actually decide things

Write the hypothesis BEFORE looking at PostHog

Calculate required sample size

Create the experiment in PostHog

Implement the variant in code

Monitor the experiment without peeking

Interpret results and ship the decision

Audit for novelty and segment effects

What goes wrong (and how to avoid it)

What to take away

Frequently Asked Questions

Related tutorials

How to set up PostHog feature flags for safe rollouts

How to set up PostHog event tracking with a taxonomy that scales

How to set up PostHog surveys that get real responses

When to hire a PostHog specialist — an honest checklist