Loading tutorials…
Loading tutorials…
Running tests is easy. Running tests that produce real decisions is hard. This walks through hypothesis design, sample-size calculation, the PostHog experiment UI, and the 5 statistical mistakes that invalidate 80% of DIY A/B tests.
Who this is forProduct managers, growth leads, and engineers running A/B tests inside PostHog. Especially relevant if you have run tests before but suspect the conclusions were not statistically valid — most are not.
What you'll need
Step 1
A real hypothesis: "If we change X, then Y will increase by Z%, because of mechanism W." Write it on paper. Pre-register it. Stick to it.
Template: "If we [change], then [primary metric] will [direction] by [effect size]% within [duration], because [mechanism]."
Example: "If we add social proof above the pricing CTA, then trial-signup rate will increase by 15% within 14 days, because users are anchoring on whether other users have trusted the product."
Bad hypothesis: "Let's try a green button instead of blue." (No effect size, no mechanism, no primary metric.)
Document the hypothesis in a shared doc with a date. This pre-registration is the difference between a real experiment and post-hoc storytelling.
List the secondary metrics (signup-to-paid conversion, average order value) — but commit upfront that the decision will be made on the primary metric. Secondary metrics give context, not decisions.
Step 2
Use the PostHog sample-size calculator (built into experiment setup). Need ~10K conversions per variant for a 10% lift detection. More for smaller lifts.
PostHog → Experiments → New experiment → it shows a built-in calculator: enter baseline conversion rate + minimum detectable effect (MDE).
Rule of thumb: detecting a 10% relative lift on a 5% baseline conversion rate requires ~31,000 users per variant (62K total). Detecting a 5% lift requires ~125K per variant.
If you do not have that traffic in a reasonable time (4-6 weeks), pick a more aggressive MDE (test for 25% lift, not 5%). You will detect only big effects, but at least the test will conclude.
For sub-1,000-DAU products, A/B testing is usually not the right tool. Run qualitative user research instead, or use Bayesian methods (PostHog supports both Frequentist and Bayesian).
NEVER start a test without knowing what sample size you need to detect what effect. Tests that "look directional" are noise.
Step 3
PostHog → Experiments → New experiment → set feature flag, primary metric, secondary metrics, exposure rule. Run for the calculated duration.
PostHog → Experiments → New experiment. Pick a name like `pricing_social_proof_test_2026_05`.
Create a multi-variant feature flag with variants: `control` (50%), `treatment` (50%). PostHog enforces equal allocation.
Set the primary metric: `trial_signup_completed` (the event you defined in your event taxonomy). Choose "Funnel" if you care about a multi-step conversion, "Trends" for a single event.
Set secondary metrics (up to 3): `signup_to_paid_conversion`, `pricing_page_bounce_rate`. These give context, do not drive decisions.
Set exposure: who is in the experiment? Often "all users who visit the pricing page" (use a `$pageview` exposure event).
Set duration: minimum 1 week (to cover full weekly cycle), ideally 2-4 weeks. Set the planned end date in PostHog.
Start the experiment. PostHog freezes the feature flag config for the test duration.
Step 4
Use the same multi-variant flag pattern from tutorial #4. The flag check decides which variant to show; PostHog tracks which user saw which variant.
In your code: `const variant = useFeatureFlagVariantKey("pricing_social_proof_test_2026_05"); return variant === "treatment" ? <PricingWithSocialProof/> : <PricingControl/>;`
PostHog automatically captures a `$feature_flag_called` event when the variant is read. This is how the experiment knows which user saw which variant.
CRITICAL: only call the flag-check on the page where the variation matters. If you check the flag on every page, you "expose" users to the experiment without showing them the variant — pollutes results.
Test in dev with overrides: PostHog Toolbar (bookmarklet) lets you force a specific variant in your browser. Walk through both variants before launching.
Step 5
Do NOT check results daily and decide based on early trends. Set a calendar reminder for the end date. Look once, decide once.
Multiple-comparisons problem: if you check results daily and act on any "significant" reading, your effective false-positive rate goes from 5% to 30%+. You will ship "winning" variants that are actually random noise.
PostHog's experiment dashboard shows a 'don't peek' indicator. Trust it. Set a calendar reminder for the planned end date. Until then, do not look.
If you must monitor (for sanity), monitor exposure balance only — verify ~50/50 split. If the split is wrong, something is broken in the variant assignment. Pause and fix.
If the experiment shows the treatment is causing CATASTROPHIC harm (revenue down 40%, error rate up 10x), kill it. That's a legitimate safety stop, not statistical peeking.
Otherwise, wait for the planned duration. Then decide.
Step 6
At end-date, check primary metric. If statistical significance + practical effect size both pass, ship the winner. If only one passes, do not ship.
Open PostHog → Experiments → your experiment. Look at the primary metric.
PostHog shows: probability that treatment is better, conversion rate per variant, credible interval (Bayesian) or p-value (Frequentist).
Decision rule: ship treatment if (a) probability of being better > 95% AND (b) the point estimate of lift is ≥ your minimum practical effect size from step 1.
If (a) passes but (b) doesn't (statistically significant but tiny effect): do not ship. The lift is real but not worth the engineering / cognitive overhead.
If (b) passes but (a) doesn't (big point estimate but high uncertainty): do not ship. You do not have enough data to commit.
Document the decision in your experiment doc. Archive the experiment. Schedule a flag-removal PR for 14 days out.
Step 7
Check: does the lift hold up for new users vs returning users? Mobile vs desktop? Power users vs casual? Sometimes a winning experiment is winning only for one segment.
After deciding the overall result, split by segment. PostHog → Experiments → result → Breakdown by `device`, `user_tier`, `signup_age`.
Look for inversions: treatment wins overall but loses for power users. Treatment wins on desktop but harms mobile.
For inverted segments, consider a multi-variant rollout: ship treatment only to the winning segments, keep control for the others.
Novelty effects: if you re-ran the test in 6 months, would treatment still win? Novelty drives initial wins that fade. Re-test major bets quarterly.
Common mistakes
Peeking at results and stopping early
What goes wrong: Daily check at day 4 shows treatment +12%, p=0.04. You ship it. Three months later you re-run the test and treatment is actually 3% WORSE. The 'win' was noise. Estimated cost of the wrong ship: $15K-50K in lost conversions or wasted engineering on a doomed feature.
How to avoid: Pre-register sample size. Set a calendar reminder. Look at results ONCE, on the planned end-date. Use Bayesian methods (PostHog supports them) if you absolutely need early-stopping rules.
Running underpowered tests
What goes wrong: You test a 5% lift on a 5K-DAU product. The test runs 2 weeks, shows treatment +8% (p=0.12, "not significant"). You conclude "no effect." But the test had only 30% power to detect a real 5% lift. You shipped a real winner as a loser. Six months of conversion loss before someone retests.
How to avoid: Calculate required sample size BEFORE starting. If you cannot reach it in a reasonable time, do not run an A/B test — use qualitative research or a larger MDE.
Multiple-comparisons problem
What goes wrong: You set 6 secondary metrics. Run the test. The primary is null but secondary #4 shows p=0.04. You ship anyway, claiming 'we found something.' Reality: with 6 metrics, the chance of at least one false-positive is ~26%. You shipped noise.
How to avoid: Decide BEFORE the test which metric is primary. Make the decision on the primary only. Use secondary metrics for context, not decisions. If you really need multiple primary metrics, apply a Bonferroni correction (p_threshold / n_metrics).
Pre-exposure pollution
What goes wrong: Flag check happens in app shell. 80% of "exposed" users never see the test surface. Statistical power is destroyed. Experiment runs 6 weeks instead of 2 and conclusions are weaker.
How to avoid: Move the `useFeatureFlagVariantKey` call to the specific component that renders the variation. PostHog only "exposes" the user when the check actually runs.
Not accounting for weekly cycles
What goes wrong: Test runs Wednesday through Friday. Weekend users (different behavior) are not represented. Result extrapolates a weekday-only sample to all users. Decision is wrong for ~30% of the user base.
How to avoid: Run experiments for at least 1 full week, ideally 2. Cover all day-of-week and time-of-day cycles. For B2B with strong weekday patterns, 4 weeks minimum.
Ignoring sample ratio mismatch (SRM)
What goes wrong: Test allocates 50/50 but PostHog reports 53/47 actual exposure. Something is broken in variant assignment — maybe the flag check is failing for some users. The experiment data is corrupted. You make a decision on unbalanced data. The "winner" was just the over-represented variant.
How to avoid: Run a chi-square test on exposure counts at end of experiment (or check PostHog's built-in SRM warning). If imbalance exceeds 1-2 percentage points, debug the flag-assignment logic before trusting any result.
Recap
Done — what's next
How to set up PostHog feature flags for safe rollouts
Read the next tutorial
Hand it off
A/B testing is a tool that looks free and is actually expensive — most DIY tests produce decisions made on noise, then those decisions ship to production for months. The fix is process and statistical discipline. EverestX matches you with a vetted PostHog specialist who designs the first 3 experiments and trains the team on the framework — typically $1,200-3,000 once + $400-600/mo ongoing.
See specialist rates
Minimum 7 days (to cover one weekly cycle). Ideally 14-28 days. Long enough for the calculated sample size, but not so long that external factors (seasonality, marketing campaigns) distort the result. Most B2B tests run 2-4 weeks.
PostHog supports both. Bayesian (default) gives "probability treatment is better than control" — easier to communicate to non-statisticians, supports early stopping with proper priors. Frequentist gives p-values — more familiar to teams from a traditional stats background. For most product teams, Bayesian is the better default.
PostHog couldn't reach 95% probability of one variant being better. The most common cause is undersize — you needed 20K conversions per variant and only got 8K. Don't ship treatment based on a directional reading. Either extend the experiment, accept the null result, or run a follow-up with a bigger MDE.
A multivariate test (MVT) tests combinations: button-color × headline-variant × CTA-text. Powerful but requires huge traffic — each combination needs full statistical power. For most teams, run sequential single-factor A/B tests instead. Faster decisions, cleaner attribution.
Experiments: when you need to make a causal decision (ship A or B?) and you have traffic. Feature flags: when you know what you want and are de-risking the rollout. Surveys: when you want to understand *why* users behave a certain way, not whether the change worked.
PostHog
Feature flags are the cheap insurance product teams skip until their first bad deploy. PostHog makes them free — but the targeting rules, SSR caveats, and cleanup discipline are not obvious. This walks through all of it.
PostHog
Most teams ship 50 events in week one, then spend month four rewriting them because the names made no sense in retrospect. This walks through an event taxonomy that scales, a property schema that does not drift, and the identify flow that keeps your funnel reports honest.
PostHog
PostHog surveys are the cheapest user-research tool in the stack — but only if you target right and write the questions right. This walks through both, plus the analysis pipeline that turns free-text answers into product decisions.
PostHog
DIY PostHog is the right call up to a point. Then it isn't. This is the honest framework: when the cost of self-managing exceeds the cost of hiring, and how to tell which side you're on.