Analyzing Results8 min read

Reading Results

Frequentist and Bayesian analysis, confidence thresholds, and key fields.

Reading Results

Every test in Otter A/B is scored using one of two analysis methods, configured per-test alongside a confidence level (80, 90, 95, or 99). The results page shows a single decision score whose label and meaning depend on the method you chose.

A results page in Otter A/B answers three questions: is there a difference between the variants, how big is it, and how sure are we? Each of those questions has its own column on the table — the score, the lift (improvement), and the confidence interval — and the three are connected. A big lift with tight intervals and a high score is a clear win. A small lift with wide intervals and a borderline score is noise.

The most important thing to understand is what method is scoring your test. Frequentist and Bayesian both work; they answer slightly different questions. Frequentist asks “if there were no real difference, how surprised would I be to see data this extreme?” Bayesian asks “what's the probability the variant is genuinely better?” Most teams use frequentist; teams that run many low-traffic tests or value an intuitive interpretation often prefer Bayesian.

Frequentist

Significance (1-p)

The classic A/B testing approach. Otter A/B picks the right test for your data: Fisher's exact test when conversion counts are sparse, a two-proportion z-test for typical conversion-rate comparisons, and Welch's t-test (with Satterthwaite degrees of freedom) for revenue metrics. The score is one minus the p-value, expressed as a percentage.

  • Score label on the results page: "Significance (1-p)".
  • You hit "significance" when the score reaches your effective confidence threshold (see below).
  • Multivariate tests automatically apply a Bonferroni correction — the bar to clear rises with the number of challenger variants.

Bayesian

Chance to Beat Original

A probability-based approach that answers "what is the chance this variant is better than the control?" Otter A/B runs a Bayesian bootstrap and reports the probability that each variant outperforms the original. Revenue metrics use the same bootstrap on the underlying sample distributions.

  • Score label on the results page: "Chance to Beat Original".
  • If a variant is currently behind, the label flips to "Chance Original Beats Variant" and the percentage shown is the chance the control wins.
  • The configured confidence level is used directly as the threshold — there is no Bonferroni adjustment for Bayesian analysis.

Effective confidence threshold (Frequentist only)

When you run a frequentist test with more than one challenger variant, Otter A/B raises the confidence threshold using a Bonferroni correction so the chance of any single variant looking like a winner by accident stays bounded. The formula is:

effective_threshold = (1 - α / challenger_count) × 100
  where α = 1 - (confidence_level / 100)

For example, a test with 95% confidence and 2 challenger variants (3 variants total) needs the score to clear 97.5%, not 95%, before declaring a winner. Bayesian tests use the configured confidence level as-is — no adjustment is applied. The results page always shows the effective threshold you actually need to beat, not just what you configured.

Common fields on the results page

  • Visitors — unique humans assigned to each variant (bot traffic and impersonation sessions are excluded).
  • Conversions — primary-goal conversions attributed to each variant. Secondary goals are reported separately and do not affect the score.
  • Conversion rate / Revenue per visitor — the metric family the primary goal drives. Revenue goals switch to revenue-per-visitor; everything else uses conversion rate.
  • Improvement — relative lift over the control. Can go negative when the variant underperforms.
  • Score — the resolved decision score described above. The progress bar fills relative to the effective threshold, capping at 100%.

Reading results well

Don't call it early. Watching the score creep up on day two and shipping the “winner” is the most common way to ship a false positive. Combine the score with a visitor floor (a sample-size cap from the wizard) so you stop on power, not on a peek.

Pay attention to the lift, not just the score. A test can hit a high score with a 0.3% lift if the sample is huge — statistically real, practically irrelevant. Ask whether the lift is large enough to matter to the business before you ship.

If results look surprising, check the activity log first. Pauses, edits, and stop-condition trips can all explain unexpected patterns. The activity log on the test page captures what happened and when.

Use Bayesian when you want to peek safely. Bayesian's “chance to beat control” reading is much safer to monitor mid-test than a frequentist p-value, which inflates false positives every time you check.

Frequently asked questions

Quick answers to the questions teams ask most about this part of Otter A/B.