Paste in your visitor and conversion counts for two variants and find out in seconds whether the difference is real — or just noise dressed up as a win.
Variant B is up 35%. You can feel the dopamine. You're already drafting the announcement. Then a quieter voice asks: is that real, or did 200 extra clicks just fall the right way by luck? That voice is correct to ask, and most of us ship the change anyway. This analyzer settles it before you do something you'll quietly reverse in three weeks. Drop in the visitor and conversion counts for each version, and it returns statistical confidence, lift percentage, and a p-value — plus a plain verdict so you don't need a stats degree to read it.
The sample size calculator inside the tool is just as useful. Before you start a test, enter your baseline conversion rate and the minimum improvement worth detecting, and it shows you exactly how many visitors each variant needs before the result means anything. This stops you from calling tests early because the numbers looked good on day three.
The difference between a real win and random variation
Say Control A sends 5,000 visitors through and records 155 conversions — a 3.1% rate. Variant B gets 5,000 visitors and converts 210 — a 4.2% rate. That looks like a 35% lift, and it feels exciting. But with sample sizes in the low thousands, a difference that size can appear by chance. The tool runs a two-proportion z-test and returns a confidence level: in this case, 95% confidence at a p-value of 0.003. That is a real win. But with only 500 visitors per side and a similar gap, confidence might only reach 60%, which means you should keep running.
The p-value is the probability that you would observe a gap this large if there were actually no real difference. A p-value below 0.05 is the conventional threshold for declaring a result significant. Below 0.01 is strong. Above 0.1, keep collecting data. The tool displays all three and adds a plain-language verdict: Win, Keep Running, or No Significant Difference.
How to use the sample size calculator before you start
The single most common A/B testing mistake is calling the result after a week because one version was ahead. The sample size calculator prevents this. Enter your current baseline conversion rate (pull it from your analytics — be honest) and the minimum detectable effect you actually care about. A 5% relative improvement on a 3% baseline means you need to detect 3.15% versus 3%, which requires a much larger sample than detecting a 20% lift would.
The tool returns three sample size targets at 80%, 90%, and 95% confidence levels. Pick the level that matches the stakes. Testing a button color on a low-traffic blog? 80% might be fine. Testing a checkout flow change on a business that depends on that revenue? 95% is the minimum worth trusting.
It also shows you how long the test will take based on your current traffic volume. A test that needs 10,000 visitors per variant on a site getting 200 visits a day will take 50 days. Know that before you start, not after you have been waiting three weeks already.
The benchmarks that put your conversion rates in context
The Guide tab inside the tool shows typical lift ranges and minimum sample sizes for six common test types. A headline change typically moves conversions 5 to 20% with a minimum of 2,000 visitors. A pricing page layout test can shift things 10 to 40% with as few as 1,000 visitors because pricing decisions tend to be binary. A landing page redesign needs 3,000 or more visitors per variant because there are so many variables in play.
These are ranges, not guarantees. Your audience, your offer, and your baseline all affect what kind of lift is detectable. But the benchmarks help you calibrate expectations. If you ran a CTA button color test and got a 35% lift with 95% confidence, that is above the typical range for that kind of test — worth double-checking whether the test was set up cleanly before celebrating.
Logging tests over time and spotting patterns
The History tab lets you save each test result — date, Control rate, Variant rate, lift percentage, and whether it was significant. Over a dozen tests, patterns emerge. Maybe your headline changes consistently produce lifts around 8%, but pricing layout tests rarely reach significance. Maybe email subject line tests hit significance faster than any other type. That is real institutional knowledge about your audience.
Most creators and freelancers run tests in isolation and forget the results. The log changes that. Six months of documented tests tells you which levers actually move your specific funnel and which ones waste time. That is worth more than any single winning variant.
What to do when a test fails to reach significance
A non-significant result is not a failed test. It is information. If Variant B showed a 4% lift but only reached 72% confidence, the change might be real but you did not have enough traffic to confirm it. Three options: keep running until you hit your required sample size, combine the learnings with a bolder change in the next iteration, or accept the null and move on to a test with more potential impact.
What you should not do is declare a winner at 72% confidence and ship the change. Over many tests, doing that consistently will produce a coin-flip rate of good versus bad decisions. The point of the analyzer is to stop that pattern — not to give you permission to call it early.
Paste your two variants in right now and get the honest verdict. Then start a free trial to keep the history — because six months of your own logged tests will teach you more about your audience than any single green-bar win ever could.
How to use it
- Enter Visitors and Conversions for Control A in the left panel — use real numbers from your analytics, not a sample or estimate.
- Enter Visitors and Conversions for Variant B in the right panel — the Conversion Rate display updates live as you type.
- Read the four KPIs at the top: Confidence (the main verdict), Lift (how much better B is), P-Value (the statistical precision), and Verdict (YES or NO for significance).
- If the verdict is Keep Running, use the Sample Size tab to find how many more visitors you need before calling the test.
- Click Save to History to log the result with the date and lift — the History tab builds a record you can review across all your tests.
Who it's for
- Etsy seller testing listing title variations — Ran Title A to 2,200 visitors at 1.8% conversion, Title B to 2,200 at 2.4% — the tool returns 89% confidence, which means keep running for at least another week before switching.
- Course creator testing two sales page headlines — After 1,000 visitors each, Version B is at 3.9% versus 3.1% — only 78% confidence, so she does not ship it yet and sets a calendar reminder to check again after another 800 visitors.
- Freelancer testing two email subject lines — Sends to 500-person sub-segments of a 1,400-person list — the tool shows 96% confidence for the winning subject line and a 28% lift, enough to use for future sends.
- Creator checking whether a checkout button color mattered — Green versus orange, 4,000 visitors per variant, 2.1% versus 2.3% difference — 61% confidence confirms this was noise and he moves on to testing the actual copy.
Key terms
- Statistical confidence
- The probability that the observed difference between two variants is real and not due to random chance. 95% confidence means a 5% chance the result is a fluke.
- P-value
- The probability that a difference this large would occur if both variants were actually identical. Values below 0.05 are conventionally treated as statistically significant.
- Minimum detectable effect (MDE)
- The smallest lift you care about detecting. A smaller MDE requires a larger sample size. Setting MDE too small means waiting an impractically long time to call a test.
- Lift
- The percentage improvement in conversion rate from Control to Variant, expressed as a relative change. A jump from 3% to 4% is a 33% lift, not a 1-point lift.
Frequently asked questions
How do I know when a test has run long enough?
Use the Sample Size tab before you start. Enter your baseline conversion rate and minimum detectable improvement to get the visitor count each variant needs. That number, not a time period, is when a test is ready to call.
What confidence level should I require before calling a winner?
95% is the standard for decisions with meaningful financial impact. For lower-stakes tests like social copy or email subjects, 85% to 90% is often acceptable. Below 80%, you are essentially guessing.
What does a p-value of 0.05 actually mean?
It means there is a 5% probability you would see a gap this large by random chance if there were actually no real difference between the variants. Lower p-values mean stronger evidence that the difference is real.
Can I test more than two variants at once?
The tool compares exactly two variants at a time. For multi-variant tests, compare each variant against the original control in a separate analysis — but be aware that running many comparisons simultaneously increases your chance of a false positive.
My test shows a negative lift — did something go wrong?
Not necessarily. A negative lift means Variant B performed worse than Control A. That is useful information. Check that traffic was split evenly and that both variants were live during the same time period with no other changes that could have affected conversions.