How to troubleshoot VWO A/B tests that won't reach significance

60-70% of A/B tests don't reach 95% significance. That doesn't mean VWO is broken — it usually means the test was designed wrong. This is the diagnostic that separates 'no winner' from 'broken setup.'

~2 hrIntermediateUpdated May 26, 2026

Who this is forTeams running VWO tests that keep coming back inconclusive or take 60+ days without resolution. If your last 3 tests all ended at 75-90% probability and you're not sure whether to ship or wait, this troubleshooting flow is for you.

What you'll need

A current VWO test stuck below significance for 14+ days
Access to the test report (Testing → click the test)
Your 30-day baseline conversion rate for the test page
About 1-2 hours including diagnosis + fix application
Patience to NOT call the test early while diagnosing

Step 1

Check whether sample size has actually been reached

Most tests "stuck at 90% significance" haven't reached calculated sample yet. Pull the math first — half of "stuck tests" aren't actually stuck.

Pull your calculated required sample (from VWO Sample Size Calculator — re-run if you didn't the first time). Compare to current visitors per variant.

Common pattern: required sample was 14,000 per variant; current is 9,000 per variant. Test isn't stuck — it's 65% done. Keep waiting.

If you didn't calculate sample upfront: do it now. Use current baseline conversion rate, 10% MDE, 95% confidence, 80% power. The calculator returns required sample.

If current sample is <80% of required: not stuck. Wait. Don't peek daily.

If current sample is >120% of required AND probability is <90%: there's no clear winner. Move to step 2.

If current sample is between 80-120%: borderline. Wait 2-3 more days, then evaluate.

Step 2

Check for variant breakage (the silent killer)

A variant with zero conversions isn't 'not winning' — it's broken. Compare conversion COUNTS, not just rates. A clean variant should have non-zero conversions within the first few hundred visitors.

Open the test report. Look at absolute conversion counts per variant, not just conversion rates.

If one variant has zero or near-zero conversions despite hundreds of visitors, the variant is broken. Common causes:

— Variant CSS hides the conversion CTA (a layout change broke the button position).

— Variant JS errors prevent the conversion action (form submit broken, link broken).

— Variant analytics tracking is broken (variant fires but the goal event doesn't fire).

Reproduce: open the test page in incognito and force the variant via VWO Helper Chrome extension. Try to complete the conversion. If you can't (or hit a console error), the variant is broken.

Fix the variant, then RESTART the test from scratch. Don't try to salvage data from a broken test — the data is contaminated.

Step 3

Check for sample ratio mismatch (SRM)

If your 50/50 split is actually showing 55/45 traffic distribution, there's a contamination issue — usually caching, redirects, or bot traffic affecting one variant differently.

Open the test report. Look at total visitors per variant. For a 50/50 test, the split should be within ~5% of even. (5,000 vs 5,200 = fine. 5,000 vs 6,500 = SRM.)

Sample Ratio Mismatch (SRM) is a red flag. It means something is selectively excluding traffic from one variant, contaminating the comparison.

Common SRM causes:

— Variant uses a slow-loading custom font/CSS that times out for some users (those users see Control because Variant fails to render in time).

— Variant is cached differently by CDN — one variant gets cached HTML while the other gets fresh HTML.

— Bot traffic disproportionately hits one variant (bots don't reliably re-roll variants on each visit).

If SRM is present, the test result is unreliable. Pause the test, debug the cause, restart fresh. Don't try to interpret SRM'd data.

Step 4

Check whether your MDE is realistic

Most variants don't lift conversion by 10%+. If your test is designed to detect 10% lift but the variant only moves the needle 3%, the test will run forever without significance.

Open the report. Check current relative lift: e.g., Variant at 2.4% vs Control at 2.3% = ~4% relative lift.

Your declared MDE was likely 10% (or higher). The actual effect appears to be much smaller.

Two possible interpretations:

— The variant has a real but small effect (3-5% lift). To detect this with confidence, you need 4-9x more sample than you planned. On a 1K-visitor/day page, that's 4-9 months of test runtime. Usually not worth it.

— The variant has NO real effect. The observed 'lift' is random noise. With enough sample, the rate would converge back to no difference.

Decision: if the observed lift is below your MDE after substantial sample, declare inconclusive and ship neither. Don't extend the test hoping for significance — at small effect sizes, you'll get false positives more often than real wins.

For your next test, either design bigger changes (test things expected to move conversion 10%+) or accept smaller MDE with the runtime cost.

Step 5

Check for peeking bias and stopped-early tests

If you've been opening the report daily, you've already been biased. Future decisions on this test are likely wrong. Acknowledge it and reset.

Honestly: how many times have you opened the test report since launch? If >5 times in a 14-day window, you've peeking-biased your judgment.

Peeking creates two problems: (a) you're more likely to call a winner on a random spike, (b) you're emotionally invested in a direction, which biases future test interpretation.

If you've been peeking, the cleanest reset is: write down your current expectation ("I think Variant is winning by 8%"), close the report, set a calendar reminder for the calculated sample date, and don't reopen until then.

When you do reopen, evaluate against the calculated sample threshold ONLY. Probability >95% AND lift > MDE = ship. Otherwise = inconclusive.

For future tests, implement a stopping rule: write down required sample BEFORE launch, set calendar reminder, refuse to open the report until that date.

Step 6

Check for novelty effects and seasonality

Variants that win in week 1 often lose by week 4 because users habituate to the new design. Tests must run long enough to average out novelty + seasonality.

If your test is showing strong lift in week 1 but the lift fades by week 3-4: classic novelty effect. Users initially respond to "different" but normalize over time.

Mitigation: run tests for at least 2 full weekly cycles (14 days minimum) to wash out weekday/weekend differences AND give novelty time to fade. Longer is better for paid-retention products where the effect plays out over a sales cycle.

For seasonality contamination: did your test run through a major event (sale, holiday, marketing push)? Traffic mix shifts during these events, often dramatically. Variants that won during Black Friday may not win in normal traffic.

Fix: if the test ran through an event, segment the analysis to pre-event and post-event windows. If both windows agree on the winner, ship with confidence. If they disagree, the event was the source of the signal — don't trust the overall result.

Best practice: avoid launching new tests through known major events. Pause active tests during Black Friday, end-of-year sales, etc. — restart fresh in normal traffic.

Step 7

Decide: extend, ship, or kill

After diagnosis, you have three honest options. Each is valid; the wrong move is hoping for significance that won't come.

Option 1 — Extend the test: if calculated sample isn't yet reached AND no breakage/SRM/peeking issues, keep running. Set a hard end date based on calculated sample. Refuse to peek.

Option 2 — Ship the variant despite inconclusive: if calculated sample IS reached, probability is 80-95% (not 95+), and the change is low-risk + low-reversal-cost, you can ship with a 'directional' shipping decision. Monitor post-launch for 30 days; revert if conversion drops.

Option 3 — Kill the test as inconclusive: if calculated sample is reached and probability is <80%, ship NEITHER and move on. Most tests should end here. Inconclusive results are honest results.

Avoid: 'let me extend the test indefinitely and hope.' Tests over 60 days accumulate seasonality and traffic-mix contamination. The longer you run, the less reliable the result.

Document the decision in your test log: hypothesis, result, decision, reasoning. Even inconclusive tests teach the team what doesn't move the needle — that's valuable learning.

Common mistakes

What goes wrong (and how to avoid it)

Calling tests "stuck" before calculated sample is reached
What goes wrong: Team peeks at day 7, sees Variant at 88% probability, declares 'stuck' and starts troubleshooting. Test wasn't stuck — just half-baked. Time and attention diverted to fake problem; test gets killed or extended unnecessarily. ~$1,500-4,000 in misdirected team time + opportunity cost from skipping the actual problem (defining MDE upfront).
How to avoid: Calculate required sample BEFORE launch. A test isn't "stuck" until current sample exceeds required and probability is still <95%.
Interpreting Sample Ratio Mismatch (SRM) as random noise
What goes wrong: A test with 5,000 vs 5,800 visitors looks 'close to 50/50'. Team interprets as normal variance. In reality, the 16% imbalance signals something systematic excluding traffic from one variant — and that something correlates with conversion behavior. The 'winning' variant ships, conversion drops 5% on full traffic. ~$5,000-15,000/quarter in losses from SRM-contaminated wins.
How to avoid: Any split outside 5% of expected is an SRM red flag. Investigate (page load issues, CDN caching, bot exclusions) before trusting the result.
Designing tests with unrealistic MDE
What goes wrong: Team sets MDE at 5% relative lift on a 2% conversion page. Required sample = 40K+ per variant. Tests run 60-90 days and 70% end inconclusive. Team loses faith in CRO. On a $40K/mo ad-spend account, ~$15,000-30,000/year in CRO infrastructure spent producing no wins.
How to avoid: For most pages, design tests targeting 10-15% MDE. If your effects are smaller, test bigger changes — radical redesigns, full-page experiments, not button-color tweaks.
Peeking daily and calling winners early
What goes wrong: Daily peeks inflate false-positive rate from 5% to 30-40%. Teams ship 4-5 'winners' per quarter that don't hold post-launch. Each false win costs $2,000-8,000 in design + dev + opportunity cost. ~$10,000-30,000/quarter on most accounts.
How to avoid: Pre-declared sample size + pre-declared stopping rule. Refuse to peek before sample lands. Use a calendar reminder.
Letting tests run >60 days hoping for significance
What goes wrong: Test runs for 90+ days, accumulating seasonality contamination. Result becomes uninterpretable: did the variant win because it's better, or because the traffic mix shifted seasonally? Decisions on long-running tests are unreliable. ~$3,000-10,000 in opportunity cost per test from extended run + uncertain result.
How to avoid: Hard cap test runtime at 60 days. If you haven't reached significance, declare inconclusive and move on. Better to ship 5 inconclusive-but-decisive tests than 1 contaminated long-running test.
Not documenting inconclusive tests
What goes wrong: Team logs only winners. Re-tests old hypotheses 6 months later because no one remembers what was tried. ~30% of test slots get spent re-running prior inconclusive tests. On a $30K/mo ad-spend account, ~$5,000-10,000/year in repeated CRO work.
How to avoid: Document every test (winner, loser, inconclusive) in a test log: hypothesis, variant, sample, result, decision. Reference before launching new tests.

Recap

What to take away

First check: is calculated sample actually reached? Most "stuck" tests are still cooking.
Check variant breakage — zero conversions on a variant means broken, not losing.
Check Sample Ratio Mismatch (SRM) — any imbalance >5% is a contamination red flag.
Check if your MDE is realistic. Real lifts are usually <10%; tests designed for 10% MDE often go inconclusive.
Stop peeking. Pre-declared stopping rule fixes most diagnostic problems.
Hard-cap runtime at 60 days. Beyond that, contamination > signal.
Three honest endings: extend (if sample not reached), ship directionally (if reasonable), or kill as inconclusive. All valid.

Done — what's next

How to set up a VWO A/B test the right way

Read the next tutorial

Hand it off

Test methodology is a discipline, not a feature. A specialist diagnoses stuck tests in 15 minutes (vs hours of self-debugging) and rebuilds your testing program so wins compound and inconclusive tests don't multiply. Engagements run $500-1,200/mo at $14-16/hr.

See specialist rates

Frequently Asked Questions

What probability threshold should I use to ship?

95% is the academic standard and the right default for most decisions. For low-risk, easily-reversed changes (button text, microcopy), 90% is acceptable. For high-risk changes (pricing, checkout flow), wait for 97%+ and confirm with a 30-day post-launch monitor. Never ship below 80% — the false-positive rate is too high.

My test is at 99% probability after 3 days — can I ship?

Probably not yet. Early-test "high probability" is often peeking bias and novelty effect. Wait until calculated sample is reached AND the test has run at least 14 days (2 weekly cycles). If 99% probability holds at that point, ship.

What if my MDE is below 5%? Can I detect small lifts?

Yes, but at significant sample-size cost. Detecting 3% lift typically requires 3-9x the sample of detecting 10% lift. On most pages, this means 6+ month test runtime, which exceeds the seasonality-safe runtime window. For sites that genuinely need small-effect detection (large ecom, high-traffic SaaS), VWO's Bayesian SmartStats helps. For most: test bigger changes instead.

Should I always run tests for at least 2 weeks regardless of sample?

Yes. Even if your traffic-volume math says sample is reached in 5 days, run for at least 14 days to wash out weekday/weekend cycles and novelty effects. If both calculated sample AND 14 days are met, you can read the result.

How do I tell if my variant has Sample Ratio Mismatch (SRM)?

Compare visitor counts per variant. For 50/50 split, the visitors should be within 5% of each other (e.g., 5,000 vs 4,800 = fine; 5,000 vs 4,200 = SRM). VWO doesn't flag SRM automatically — manual check. If detected, debug page-load timing, CDN caching, and bot exclusions before trusting the result.

Can I extend a test that's been running 30 days without significance?

Once. If calculated sample wasn't actually reached, extend until it is. If sample IS reached and probability is still <95%, the test is genuinely inconclusive. Extending further accumulates seasonality contamination without improving signal. Hard-cap at 60 days total runtime.

How to troubleshoot VWO A/B tests that won't reach significance

Check whether sample size has actually been reached

Check for variant breakage (the silent killer)

Check for sample ratio mismatch (SRM)

Check whether your MDE is realistic

Check for peeking bias and stopped-early tests

Check for novelty effects and seasonality

Decide: extend, ship, or kill

What goes wrong (and how to avoid it)

What to take away

Frequently Asked Questions

Related tutorials

How to set up a VWO A/B test the right way

How to set up a VWO multivariate test the right way

How to set up VWO Funnel Analysis and find drop-off

How to set up Hotjar Session Recordings the right way

How to set up a Microsoft Clarity account

How to troubleshoot VWO A/B tests that won't reach significance

Check whether sample size has actually been reached

Check for variant breakage (the silent killer)

Check for sample ratio mismatch (SRM)

Check whether your MDE is realistic

Check for peeking bias and stopped-early tests

Check for novelty effects and seasonality

Decide: extend, ship, or kill

What goes wrong (and how to avoid it)

What to take away

Frequently Asked Questions

Related tutorials

How to set up a VWO A/B test the right way

How to set up a VWO multivariate test the right way

How to set up VWO Funnel Analysis and find drop-off

How to set up Hotjar Session Recordings the right way

How to set up a Microsoft Clarity account