Loading tutorials…
Loading tutorials…
A/B testing in VWO is 20% setup and 80% statistical discipline. Most teams skip the sample-size math, call winners early, and ship 'wins' that don't hold. This is the workflow that produces tests you can actually trust.
Who this is forTeams with VWO installed who want to test specific changes on landing pages, pricing, checkout, or product pages. If you've already shipped 5-10 'gut feel' changes this year and conversion is flat, A/B testing is how you replace guesswork with evidence.
What you'll need
Step 1
Don't open the visual editor first. Write the hypothesis: 'If we [change X], then [metric Y] will improve by [Z%] because [user reasoning].' Specific hypotheses produce specific tests.
Bad hypothesis: 'Let's try a new hero on the homepage.' This produces a vague test, vague results, and no learning.
Good hypothesis: 'If we replace the hero text "All-in-one CRO platform" with "Increase conversions by 30% in 90 days," the click-through to /pricing will improve by 15%, because specificity outperforms abstraction for B2B SaaS visitors.'
The hypothesis must specify: (a) the change, (b) the metric to move, (c) the expected lift size, (d) the user-psychology reason. The reason matters because tests where the hypothesis is wrong but the variant wins anyway are red flags — usually noise.
Score hypotheses by ICE (Impact × Confidence × Ease) or similar. Run the top-scored hypothesis first. Document the others in a backlog — they're tests 2-10 once you finish this one.
Pro tip: hypotheses come from data, not opinions. Source from heatmaps (Hotjar/Clarity/VWO), recordings, exit-intent surveys, GA4 drop-off data, and support tickets. Hypotheses sourced from "the CEO thought of this" win at ~15%; hypotheses sourced from real user friction win at 40-50%.
Step 2
Use VWO's built-in calculator (Tools → Sample Size Calculator) or any external tool. Without this, you don't know when to stop the test — and you'll call winners early.
Open VWO → Tools → Sample Size Calculator. Enter: current baseline conversion rate (from your 30-day baseline), minimum detectable effect (e.g., 10% relative lift), statistical confidence (95% standard), and statistical power (80% standard).
Example: baseline = 2.3%, MDE = 10% relative, confidence = 95%, power = 80% → required sample = ~14,600 visitors per variant = ~29,200 total for a 2-variant test.
Divide that by your daily traffic to that page. If the page does 1,000 visitors/day, the test needs 29 days to reach significance. Plan accordingly — and don't stop early.
If the required sample is impossibly large (e.g., 6 months of traffic to detect 5% lift on a low-traffic page), the test is the wrong test. Either pick a higher-traffic page, a larger expected effect (test bigger changes), or use Bayesian SmartStats which handles small samples more efficiently.
Rule: if your test cannot collect the required sample in <60 days, change the test scope. Tests running longer than 60 days accumulate seasonality noise, traffic-mix shifts, and SmartCode changes that contaminate the read.
Step 3
Testing → Create New → A/B Test. Use the visual editor to design the variant. Keep it simple — one change per test if possible.
In the left sidebar, Testing → click Create New → A/B Test.
Name the test specifically: 'Pricing page hero — specificity hypothesis — May 2026'. Default names ('Untitled Test 14') become orphan tests in 90 days.
Enter the page URL. VWO loads the page in its visual editor. Click Add Variation. You can edit text, swap images, hide elements, restyle CSS — all visually, no code required.
Single-variable tests (one change at a time) are easier to read. Multi-change variants ('new hero + new CTA + new pricing card') win or lose without telling you WHICH change drove it. Use multivariate tests (next tutorial) when you genuinely need to test combinations.
For changes requiring custom code (animations, interactive components), click the </> Code Editor and write JavaScript/CSS. VWO injects this into the variant only — the original page is untouched.
Test in preview mode before saving: top-right Preview → opens a new tab with the variant rendered. QA carefully — variants that break the page silently lose 100% of test traffic.
Step 4
Goals are how VWO measures the test. Pick the primary goal that maps directly to the hypothesis. Add 1-2 secondary goals for context.
Below the variant editor, click Goals → Add Goal.
Primary goal: maps directly to the hypothesis. For 'pricing page → click-through to checkout' hypothesis, primary goal = click on .checkout-cta or URL visit /checkout.
Secondary goals: revenue per visitor, bounce rate, time on page. These give context — a variant that wins on click-through but loses on revenue-per-visitor is shipping a leakier funnel.
Goal types: URL Visit, Click on Element, Form Submit, Custom Conversion (event-based), and Revenue (for ecommerce — passes order value via JavaScript).
For ecommerce, set up a Revenue goal: paste the JS snippet VWO provides on your /thank-you page with the order_total variable. VWO will compute revenue per visitor per variant — the only metric that actually matters for product/pricing tests.
Cap at 1 primary + 2-3 secondary goals. More goals = more chances of random significance on something irrelevant + more decision paralysis at test-end.
Step 5
Default = 50/50 split. Start there. Use segmentation to limit the test to specific traffic sources if your hypothesis is segment-specific.
Below Goals, click Settings → Traffic Allocation. Default is 50% Control / 50% Variant. Keep this for most tests — uneven splits (e.g., 90/10) require more total traffic to reach significance.
Test scope: default = 100% of page visitors. Reduce only if you have a reason (e.g., test only on paid traffic landing pages where the hypothesis applies).
Segmentation: click + Add Segment to limit the test to specific cohorts. Common segments: new visitors only, mobile only, US traffic only, paid (UTM = google/cpc) only.
Segment-specific tests are more sensitive (you exclude noise from irrelevant traffic) but slower to reach sample size. Use sparingly.
For tests with high-stakes downside risk (testing pricing changes, checkout flow), consider starting at 90/10 split for 24-48 hours to confirm nothing catastrophic breaks, then move to 50/50 for the real test.
Step 6
Click Start Test. Then leave it alone for the calculated sample period. Checking results daily produces 'peeking bias' and bad decisions.
Top-right click Start Test. VWO begins serving the variant to 50% of incoming traffic.
Verify 30 minutes after launch: open the test page in incognito multiple times. You should see Control about half the time and Variant the other half. The VWO Helper Chrome extension confirms which variant you're seeing.
Don't check the report daily. Daily checks invite the temptation to call a winner early. Set a calendar reminder for the date your required sample size will be reached — typically 14-30 days out.
If you must check sooner, look only at: (a) total visitors entering test (is it on track?), (b) any catastrophic variant breakage (zero conversions on one variant = bug, not a result). Ignore the conversion-rate-by-variant chart until sample is reached.
The 'Peeking Problem' is the #1 reason A/B tests fail: when you peek daily and stop the moment you see a 'win,' you've inflated your false positive rate from 5% to 30-40%. The test result becomes random.
Step 7
Once the calculated sample is reached, read the result. VWO SmartStats shows Bayesian probability of beat. Only ship variants with 95%+ probability AND a practical lift > MDE.
Open the test report. Top-right shows current visitors per variant, conversion rate, and statistical significance.
VWO SmartStats uses Bayesian analysis: it shows "Probability to be Best" per variant. Ship a variant only when this exceeds 95% AND the relative lift exceeds your declared MDE (typically 10%).
If 95% threshold is reached but lift is below MDE (e.g., variant wins with 2% lift when MDE was 10%): don't ship. The lift is too small to matter practically and is likely within long-term noise.
If sample is reached but probability is 60-80%: declare 'no winner.' Don't ship. Don't extend the test. Move to the next hypothesis. Inconclusive results are honest results — most tests don't win.
When you do ship a winner, monitor the metric for 30 days post-launch on the full traffic. If the lift doesn't hold, the win was likely noise (~10% of "winners" don't hold). Document the discrepancy in your test log.
Archive every test (winning, losing, inconclusive) with: hypothesis, variant description, sample size, result, decision. After 30+ tests, this archive becomes the most valuable internal document you own.
Common mistakes
Launching tests with no calculated sample size
What goes wrong: Tests called early on whatever direction looks 'winning' on day 5. False positive rate inflates from 5% to 30-40%. Teams ship 4-5 'wins' per quarter that don't hold post-launch. On a $50K/mo ad-spend account, ~$8,000-20,000/quarter is spent optimizing toward random noise.
How to avoid: Calculate required sample BEFORE launch using VWO Sample Size Calculator. Hardcode the stopping rule. Refuse to read results until sample is met.
Testing too many things in one variant
What goes wrong: Variant changes hero + CTA + pricing card + testimonial layout. Variant wins. But you don't know WHICH change drove the lift, so you can't replicate or scale the learning. Worse, the variant might have won despite 2 of the 4 changes being negative — and you ship the negatives. Typical cost: 3-6 months before the team realizes the win didn't compound, ~$5,000-12,000 in design and dev work that produced no scalable learnings.
How to avoid: One variable per test where possible. Multi-change variants belong in multivariate tests (separate tutorial) where VWO can isolate each variable's contribution.
Reading test results before the calculated sample is reached
What goes wrong: The 'Peeking Problem' — checking daily and stopping at the first 'significant' moment inflates false positives from 5% to 30-40%. Teams ship more losers than winners. On a $30K/mo ad-spend account, this is $4,000-10,000/quarter in misdirected optimization.
How to avoid: Calendar the stopping date based on calculated sample. Don't open the report until that date. If you must check sooner, look only at sample progress, not conversion rates.
Picking the wrong primary goal
What goes wrong: Optimizing for clicks instead of revenue: a variant lifts clicks 20% but those clicks don't convert. Net revenue drops 5-10%. On a $200K/mo ecom store, that's $10K-20K/mo in lost revenue from a 'winning' test that wasn't.
How to avoid: Always pick the deepest goal you can reliably measure — revenue per visitor for ecom, paid-trial-converted for SaaS, qualified-lead-submitted for B2B. Vanity metrics (clicks, time on page) belong as secondary, not primary.
Not excluding internal traffic from tests
What goes wrong: Team QA, designer reviews, and stakeholder demos all generate sessions on the variant. On a low-traffic page (5K visitors/test), 200 internal sessions can swing a result by 5-15%. Teams call wins or losses based on contaminated data. Typical cost: 1-2 false reads per quarter = $3,000-8,000 in wasted CRO cycles.
How to avoid: Settings → Account Settings → IP Exclusion. Add office, home, VPN IPs. Audit quarterly.
Not documenting losing tests
What goes wrong: Team logs only winners. Six months later, someone proposes the same hypothesis you already tested and lost. You re-test, lose again, and waste another 30 days of traffic. On a $40K/mo ad-spend account, re-running 2-3 old losing tests per year = $5,000-10,000 in repeated CRO work.
How to avoid: Document every test (winners, losers, inconclusive) in a single test log: hypothesis, variant, sample, result, decision, date. Reference before launching any new test on the same page.
Recap
Done — what's next
How to set up a VWO multivariate test the right way
Read the next tutorial
Hand it off
Running A/B tests well is a quarterly cadence, not a one-off. A vetted CRO specialist on EverestX typically ships 4-8 tests per quarter, reaches significance on 60-70%, and compounds winners into 15-30% conversion lift over 12 months. Engagements run $500-1,200/mo at $14-16/hr.
See specialist rates
Until the calculated sample size is reached. Typically 14-30 days for mid-traffic pages, 7-14 days for high-traffic checkout flows, 30-60 days for low-traffic B2B pages. Tests below 7 days are usually noise; tests above 60 days accumulate seasonality contamination.
SmartStats is VWO's Bayesian engine — it computes "Probability to be Best" per variant, which is more intuitive than p-values. Frequentist tests give you 95% confidence intervals + p<0.05 threshold. Both produce similar conclusions on properly-sized tests. SmartStats handles small samples better; frequentist is the academic standard.
Technically yes, but the tests will contaminate each other if they target overlapping elements. Use VWO's Mutually Exclusive groups (Pro+) to ensure a user only sees one test at a time. For most teams, sequential testing (one at a time) is cleaner and easier to read.
Run tests for full weekly cycles (multiple of 7 days) to wash out weekday/weekend differences. Avoid running tests through major sales (Black Friday, end-of-year) — traffic mix shifts dramatically and contaminates the read. If a test runs through a known event, segment the analysis to pre/during/post for context.
Probably not. The 95% threshold exists because tests at 90% have a 1-in-10 chance of being wrong. Over 20 tests, you'd ship 2 losers. Either extend the test if budget allows (more sample), or declare inconclusive and move to the next hypothesis. Inconclusive is a valid outcome.
Settings → Integrations → Google Analytics 4. VWO pushes test variant data as a GA4 custom dimension. In GA4, you can then segment any report by experiment variant. Useful for analyzing test impact on downstream metrics (LTV, retention) that VWO doesn't track natively.
VWO
Multivariate testing answers the question A/B testing can't: which combination of changes drives the lift, and which change is dead weight. The catch — MVT eats sample size for breakfast. Use it only when you actually have the traffic.
VWO
60-70% of A/B tests don't reach 95% significance. That doesn't mean VWO is broken — it usually means the test was designed wrong. This is the diagnostic that separates 'no winner' from 'broken setup.'
VWO
VWO's heatmaps sit inside the Insights module — alongside recordings and form analytics. They're the qualitative complement to A/B test data: heatmaps tell you what hypothesis to test next.
Hotjar
Hotjar's recordings are the most powerful feature in the tool — and the most-wasted. The difference is filter discipline. This is the setup that turns 1,000 recordings/week into 5 useful insights, not 1,000 hours of "someday I'll watch these."
Microsoft Clarity
Clarity is free and the install is famously easy — but the choices you make in the first 45 minutes (data masking, retention, project ownership) are hard to undo later. This walkthrough gets the configuration right the first time.