In today’s fast-paced market environment, making decisions based on data is no longer optional – it’s essential for success. Data-driven organizations routinely leverage experimentation to validate ideas in a corporate setting, assess risk, and implement changes that justify their investment of time and money. One of the most common experimentation techniques is the A/B test (also known as a split test).
Have you ever wondered if adding a new feature to your business is a worthwhile effort? An A/B test could help you answer this question by comparing two versions of a product feature, marketing asset, or workflow to determine which performs better based on a predefined metric. These experiments are not simply exploratory; they are grounded in sound statistical principles, ensuring that decisions are not only data-driven but also statistically valid.
Practical Example:
Let’s say you work for an e-commerce company and you’re exploring ways to increase the average order value (AOV). One idea is to offer product bundles to see if customers spend more when they’re offered.
- Objectives and Competing Hypotheses:
- Null hypothesis (H₀): Offering bundles do not increase AOV.
- Alternative hypothesis (H₁): Offering bundles increase AOV by at least $10.
- Primary Metric: Average Order Value (AOV)
- Randomization: Ensures that users are assigned randomly and without bias.
- Control Group: Users who do not receive any changes.
- Treatment Group: Users who receive the bundle recommendations.
- Sample Size Calculation: Let’s say we want to detect an increase in AOV of at least $10. We also want to keep a low (5%) chance of a false positive, and a 80% chance to detect the effect in our experiment. To calculate the sample size, we need to use the standard formula for two-sample tests:
n=2*Z1-/2+Z1-/2
Where:
- Minimum Detectable Effect (MDE) (): $10
- Standard Deviation (): $30
- Significance Level at 5% Z1-/2: 1.96
- Power (1-β) Z1-: 0.84 for 80% chance we’ll detect the effect if it’s real.
Plugging in the numbers in the formula shows that we’d need at least 142 users in each group (control and treatment) to run this experiment with the desired sensitivity and confidence.
Interpreting and Acting on Results
| Group | Orders | Avg AOV | Std Dev |
| Control | 142 | $76.20 | $29.80 |
| Treatment | 142 | $84.90 | $31.50 |
- Statistical Analysis
After running the experiment, we observed that:
-
- Control Group AOV: $76.20
- Treatment Group AOV: $84.90
- That’s an uplift of $8.70, which seems promising at first glance.
However, to determine if this difference is statistically significant, we need to estimate some additional values.
- Step 1: Standard Deviation: Measures uncertainty around the difference
SE=sc2 nc+st2 nt =29.82142+31.521423.64
- Step 2: Calculate the t-statistic: Measures how extreme our observed difference is compared to what we’d expect under the null hypothesis. For that, we subtract our target effect size ($10) from the observed difference ($8.70):
t=xt–xc–SE=(84.90-76.20)-103.71=-0.36
- Step 3: Make a Decision: Using a significance level of 5%:
- With 282 degrees of freedom: nc+nt-2=142+142-2=282, the critical t-value is approximately 1.65.
- Since -0.36<1.65, our result is not statistically significant.
- Therefore, even though there was an increase of $8.70 in AOV with bundles (which was close to our $10 target), variation was too high to be confident bundles were the real cause.
Conclusions and Recommendations
Failing to achieve statistical significance is not the end of the story. There may still be valuable insights in the experiment that the experimenter can uncover through careful interpretation and business expertise. For example, in this case, lowering the Minimum Detectable Effect (MDE) results in statistically significant evidence of an increase of AOV.
However, statistical significance does not automatically imply business value. A thorough cost-benefit analysis is essential to evaluate the true benefit of implementing any changes, taking into account additional costs, potential risks, and long-term impact.
Additionally, some user segments may respond more positively to specific changes. Running experiments on segmented groups and focusing on those with the highest impact is a common strategy to tailor the user experience (UX) to the specific needs and preferences of different audiences.
Remember: measuring differences is only part of the story. Truly data-driven decisions require caution, context, expertise, and statistical rigor!
Bibliography
- Design and Analysis of Experiments (2nd Edition) – Dean, A. et al. (2017)
- https://medium.com/data-science/required-sample-size-for-a-b-testing-6f6608dd330a
Written by:
Fabián Sánchez
Data Analyst
Country: Colombia