Loading tutorials…
Loading tutorials…
60-70% of A/B tests don't reach 95% significance. That doesn't mean VWO is broken — it usually means the test was designed wrong. This is the diagnostic that separates 'no winner' from 'broken setup.'
Who this is forTeams running VWO tests that keep coming back inconclusive or take 60+ days without resolution. If your last 3 tests all ended at 75-90% probability and you're not sure whether to ship or wait, this troubleshooting flow is for you.
What you'll need
Step 1
Most tests "stuck at 90% significance" haven't reached calculated sample yet. Pull the math first — half of "stuck tests" aren't actually stuck.
Pull your calculated required sample (from VWO Sample Size Calculator — re-run if you didn't the first time). Compare to current visitors per variant.
Common pattern: required sample was 14,000 per variant; current is 9,000 per variant. Test isn't stuck — it's 65% done. Keep waiting.
If you didn't calculate sample upfront: do it now. Use current baseline conversion rate, 10% MDE, 95% confidence, 80% power. The calculator returns required sample.
If current sample is <80% of required: not stuck. Wait. Don't peek daily.
If current sample is >120% of required AND probability is <90%: there's no clear winner. Move to step 2.
If current sample is between 80-120%: borderline. Wait 2-3 more days, then evaluate.
Step 2
A variant with zero conversions isn't 'not winning' — it's broken. Compare conversion COUNTS, not just rates. A clean variant should have non-zero conversions within the first few hundred visitors.
Open the test report. Look at absolute conversion counts per variant, not just conversion rates.
If one variant has zero or near-zero conversions despite hundreds of visitors, the variant is broken. Common causes:
— Variant CSS hides the conversion CTA (a layout change broke the button position).
— Variant JS errors prevent the conversion action (form submit broken, link broken).
— Variant analytics tracking is broken (variant fires but the goal event doesn't fire).
Reproduce: open the test page in incognito and force the variant via VWO Helper Chrome extension. Try to complete the conversion. If you can't (or hit a console error), the variant is broken.
Fix the variant, then RESTART the test from scratch. Don't try to salvage data from a broken test — the data is contaminated.
Step 3
If your 50/50 split is actually showing 55/45 traffic distribution, there's a contamination issue — usually caching, redirects, or bot traffic affecting one variant differently.
Open the test report. Look at total visitors per variant. For a 50/50 test, the split should be within ~5% of even. (5,000 vs 5,200 = fine. 5,000 vs 6,500 = SRM.)
Sample Ratio Mismatch (SRM) is a red flag. It means something is selectively excluding traffic from one variant, contaminating the comparison.
Common SRM causes:
— Variant uses a slow-loading custom font/CSS that times out for some users (those users see Control because Variant fails to render in time).
— Variant is cached differently by CDN — one variant gets cached HTML while the other gets fresh HTML.
— Bot traffic disproportionately hits one variant (bots don't reliably re-roll variants on each visit).
If SRM is present, the test result is unreliable. Pause the test, debug the cause, restart fresh. Don't try to interpret SRM'd data.
Step 4
Most variants don't lift conversion by 10%+. If your test is designed to detect 10% lift but the variant only moves the needle 3%, the test will run forever without significance.
Open the report. Check current relative lift: e.g., Variant at 2.4% vs Control at 2.3% = ~4% relative lift.
Your declared MDE was likely 10% (or higher). The actual effect appears to be much smaller.
Two possible interpretations:
— The variant has a real but small effect (3-5% lift). To detect this with confidence, you need 4-9x more sample than you planned. On a 1K-visitor/day page, that's 4-9 months of test runtime. Usually not worth it.
— The variant has NO real effect. The observed 'lift' is random noise. With enough sample, the rate would converge back to no difference.
Decision: if the observed lift is below your MDE after substantial sample, declare inconclusive and ship neither. Don't extend the test hoping for significance — at small effect sizes, you'll get false positives more often than real wins.
For your next test, either design bigger changes (test things expected to move conversion 10%+) or accept smaller MDE with the runtime cost.
Step 5
If you've been opening the report daily, you've already been biased. Future decisions on this test are likely wrong. Acknowledge it and reset.
Honestly: how many times have you opened the test report since launch? If >5 times in a 14-day window, you've peeking-biased your judgment.
Peeking creates two problems: (a) you're more likely to call a winner on a random spike, (b) you're emotionally invested in a direction, which biases future test interpretation.
If you've been peeking, the cleanest reset is: write down your current expectation ("I think Variant is winning by 8%"), close the report, set a calendar reminder for the calculated sample date, and don't reopen until then.
When you do reopen, evaluate against the calculated sample threshold ONLY. Probability >95% AND lift > MDE = ship. Otherwise = inconclusive.
For future tests, implement a stopping rule: write down required sample BEFORE launch, set calendar reminder, refuse to open the report until that date.
Step 6
Variants that win in week 1 often lose by week 4 because users habituate to the new design. Tests must run long enough to average out novelty + seasonality.
If your test is showing strong lift in week 1 but the lift fades by week 3-4: classic novelty effect. Users initially respond to "different" but normalize over time.
Mitigation: run tests for at least 2 full weekly cycles (14 days minimum) to wash out weekday/weekend differences AND give novelty time to fade. Longer is better for paid-retention products where the effect plays out over a sales cycle.
For seasonality contamination: did your test run through a major event (sale, holiday, marketing push)? Traffic mix shifts during these events, often dramatically. Variants that won during Black Friday may not win in normal traffic.
Fix: if the test ran through an event, segment the analysis to pre-event and post-event windows. If both windows agree on the winner, ship with confidence. If they disagree, the event was the source of the signal — don't trust the overall result.
Best practice: avoid launching new tests through known major events. Pause active tests during Black Friday, end-of-year sales, etc. — restart fresh in normal traffic.
Step 7
After diagnosis, you have three honest options. Each is valid; the wrong move is hoping for significance that won't come.
Option 1 — Extend the test: if calculated sample isn't yet reached AND no breakage/SRM/peeking issues, keep running. Set a hard end date based on calculated sample. Refuse to peek.
Option 2 — Ship the variant despite inconclusive: if calculated sample IS reached, probability is 80-95% (not 95+), and the change is low-risk + low-reversal-cost, you can ship with a 'directional' shipping decision. Monitor post-launch for 30 days; revert if conversion drops.
Option 3 — Kill the test as inconclusive: if calculated sample is reached and probability is <80%, ship NEITHER and move on. Most tests should end here. Inconclusive results are honest results.
Avoid: 'let me extend the test indefinitely and hope.' Tests over 60 days accumulate seasonality and traffic-mix contamination. The longer you run, the less reliable the result.
Document the decision in your test log: hypothesis, result, decision, reasoning. Even inconclusive tests teach the team what doesn't move the needle — that's valuable learning.
Common mistakes
Calling tests "stuck" before calculated sample is reached
What goes wrong: Team peeks at day 7, sees Variant at 88% probability, declares 'stuck' and starts troubleshooting. Test wasn't stuck — just half-baked. Time and attention diverted to fake problem; test gets killed or extended unnecessarily. ~$1,500-4,000 in misdirected team time + opportunity cost from skipping the actual problem (defining MDE upfront).
How to avoid: Calculate required sample BEFORE launch. A test isn't "stuck" until current sample exceeds required and probability is still <95%.
Interpreting Sample Ratio Mismatch (SRM) as random noise
What goes wrong: A test with 5,000 vs 5,800 visitors looks 'close to 50/50'. Team interprets as normal variance. In reality, the 16% imbalance signals something systematic excluding traffic from one variant — and that something correlates with conversion behavior. The 'winning' variant ships, conversion drops 5% on full traffic. ~$5,000-15,000/quarter in losses from SRM-contaminated wins.
How to avoid: Any split outside 5% of expected is an SRM red flag. Investigate (page load issues, CDN caching, bot exclusions) before trusting the result.
Designing tests with unrealistic MDE
What goes wrong: Team sets MDE at 5% relative lift on a 2% conversion page. Required sample = 40K+ per variant. Tests run 60-90 days and 70% end inconclusive. Team loses faith in CRO. On a $40K/mo ad-spend account, ~$15,000-30,000/year in CRO infrastructure spent producing no wins.
How to avoid: For most pages, design tests targeting 10-15% MDE. If your effects are smaller, test bigger changes — radical redesigns, full-page experiments, not button-color tweaks.
Peeking daily and calling winners early
What goes wrong: Daily peeks inflate false-positive rate from 5% to 30-40%. Teams ship 4-5 'winners' per quarter that don't hold post-launch. Each false win costs $2,000-8,000 in design + dev + opportunity cost. ~$10,000-30,000/quarter on most accounts.
How to avoid: Pre-declared sample size + pre-declared stopping rule. Refuse to peek before sample lands. Use a calendar reminder.
Letting tests run >60 days hoping for significance
What goes wrong: Test runs for 90+ days, accumulating seasonality contamination. Result becomes uninterpretable: did the variant win because it's better, or because the traffic mix shifted seasonally? Decisions on long-running tests are unreliable. ~$3,000-10,000 in opportunity cost per test from extended run + uncertain result.
How to avoid: Hard cap test runtime at 60 days. If you haven't reached significance, declare inconclusive and move on. Better to ship 5 inconclusive-but-decisive tests than 1 contaminated long-running test.
Not documenting inconclusive tests
What goes wrong: Team logs only winners. Re-tests old hypotheses 6 months later because no one remembers what was tried. ~30% of test slots get spent re-running prior inconclusive tests. On a $30K/mo ad-spend account, ~$5,000-10,000/year in repeated CRO work.
How to avoid: Document every test (winner, loser, inconclusive) in a test log: hypothesis, variant, sample, result, decision. Reference before launching new tests.
Recap
Done — what's next
How to set up a VWO A/B test the right way
Read the next tutorial
Hand it off
Test methodology is a discipline, not a feature. A specialist diagnoses stuck tests in 15 minutes (vs hours of self-debugging) and rebuilds your testing program so wins compound and inconclusive tests don't multiply. Engagements run $500-1,200/mo at $14-16/hr.
See specialist rates
95% is the academic standard and the right default for most decisions. For low-risk, easily-reversed changes (button text, microcopy), 90% is acceptable. For high-risk changes (pricing, checkout flow), wait for 97%+ and confirm with a 30-day post-launch monitor. Never ship below 80% — the false-positive rate is too high.
Probably not yet. Early-test "high probability" is often peeking bias and novelty effect. Wait until calculated sample is reached AND the test has run at least 14 days (2 weekly cycles). If 99% probability holds at that point, ship.
Yes, but at significant sample-size cost. Detecting 3% lift typically requires 3-9x the sample of detecting 10% lift. On most pages, this means 6+ month test runtime, which exceeds the seasonality-safe runtime window. For sites that genuinely need small-effect detection (large ecom, high-traffic SaaS), VWO's Bayesian SmartStats helps. For most: test bigger changes instead.
Yes. Even if your traffic-volume math says sample is reached in 5 days, run for at least 14 days to wash out weekday/weekend cycles and novelty effects. If both calculated sample AND 14 days are met, you can read the result.
Compare visitor counts per variant. For 50/50 split, the visitors should be within 5% of each other (e.g., 5,000 vs 4,800 = fine; 5,000 vs 4,200 = SRM). VWO doesn't flag SRM automatically — manual check. If detected, debug page-load timing, CDN caching, and bot exclusions before trusting the result.
Once. If calculated sample wasn't actually reached, extend until it is. If sample IS reached and probability is still <95%, the test is genuinely inconclusive. Extending further accumulates seasonality contamination without improving signal. Hard-cap at 60 days total runtime.
VWO
A/B testing in VWO is 20% setup and 80% statistical discipline. Most teams skip the sample-size math, call winners early, and ship 'wins' that don't hold. This is the workflow that produces tests you can actually trust.
VWO
Multivariate testing answers the question A/B testing can't: which combination of changes drives the lift, and which change is dead weight. The catch — MVT eats sample size for breakfast. Use it only when you actually have the traffic.
VWO
GA4 tells you 60% of users abandoned at checkout step 2. VWO Funnels tells you which 60% — and lets you click straight into recordings and design an A/B test for that specific step.
Hotjar
Hotjar's recordings are the most powerful feature in the tool — and the most-wasted. The difference is filter discipline. This is the setup that turns 1,000 recordings/week into 5 useful insights, not 1,000 hours of "someday I'll watch these."
Microsoft Clarity
Clarity is free and the install is famously easy — but the choices you make in the first 45 minutes (data masking, retention, project ownership) are hard to undo later. This walkthrough gets the configuration right the first time.