УДК 519.24
Chumachenko A.A.
Product Analytics Team Lead, Simpals Chisinau, Moldova
IMPROVING AND SPEEDING UP A/B TESTING Abstract
The research focuses on improving and accelerating A/B testing processes, which are key to optimizing digital products. The paper discusses advanced statistical techniques such as Bootstrap, stratification, CUPED, the delta method, and linearization for ratio metrics, which helps to improve the effectiveness of testing. Their advantages are analyzed, such as accuracy and speed of obtaining results, which allows companies to make decisions based on data faster. The work demonstrates how the integration of these methods into corporate practices helps to optimize the processes of development and implementation of changes, providing increased conversion and improved user experience.
Keywords:
Bootstrap, CUPED, ratio metrics, A/B testing, programming, digitalization.
Introduction
Product analytics is a specialized form of business intelligence that focuses on understanding and interpreting customer behavior and interactions with digital products such as web and mobile applications. It involves collecting and analyzing user engagement data, tracking key metrics, and visualizing this data to optimize the customer journey map and improve product performance. This process allows companies to make data-driven decisions, improve their user adoption strategy and ultimately drive business growth [1].
In turn, if we talk about product analytics, they take into account the company's budget and goals to create a product that could increase the firm's revenue [2].
Using product management allows companies to analyze two or more versions of a web page or app to find out which one is more effective in terms of metrics such as user interaction, conversions, and others. By dividing audiences into groups randomly and giving them different options, A/B testing provides valuable data on what drives or decreases user engagement.
When it comes to the value of testing, in the age of data, decisions based on assumptions can be unnecessarily risky. A/B testing offers a scientific approach to optimizing elements of a website or app, such as design, text, and call-to-action buttons. By testing different variations at the same time, companies can collect accurate data on which changes increase conversion or user engagement, allowing them to make informed management decisions and prioritize improvements [2,3].
1. Overview of A/B testing
A/B testing, also known as split testing, is a scientific method that involves comparing two or more versions of an element to determine which one is more effective. These elements can include various aspects of a product, from design features to component layout or an entire marketing campaign. The main goal is to collect data to reasonably decide which version of the product provides the greatest user response and achieves the desired results.
A/B testing provides invaluable information to support this trend:
• A/B testing helps optimize marketing campaigns, website layouts, and calls to action, resulting in improvements across various metrics such as conversions, click-through rates, and user engagement.
•Testing helps minimize the risks associated with making changes that may adversely affect user behavior.
• A/B testing ensures that a culture of experimentation and iteration is created, allowing for continuous
improvement of the product to achieve optimal results.
A/B testing can be the foundation of various stages of the product lifecycle, including:
• Idea generation and testing: Marketing campaigns, product features, and hypotheses about user behavior can all be tested using A/B testing.
• Product Launch: A/B testing can be used to optimize features before launch, ensuring that user interactions are tailored for success and engagement.
• Product Development: To find the best design elements, content, or layouts, A/B testing allows for multiple iterations on elements.
• Continuous Optimization: Continuous improvement in functionality, user visibility, and business metrics can be achieved through continuous A/B testing [4].
Figure 1 below shows an example of A/B testing.
---
Visitors t
A B tesi
Option A
RiMiLb
Figure 1 - An example of A/B testing evaluation
To implement A/B testing in marketing campaigns, specialized tools, and software solutions are usually used that automate data collection and analysis, as well as allow for the necessary statistical analyses to confirm the reliability of the results[5,6].
Next, let us consider the existing Python A/B testing libraries. The interest in considering them lies in the fact that Python provides a wide range of libraries for data analysis, statistical analysis, and machine learning [7].
SciPy is a large Python package that provides an extensive set of tools for scientific computing. Although it includes basic statistical functions such as t-tests, its main purpose is to solve more complex mathematical problems that are beyond the capabilities of NumPy [8]. It is built on top of NumPy and is used to solve both basic statistical problems and complex mathematical computations that go beyond NumPy's capabilities. Among the most significant functions of SciPy for statistics are:
1. Optimal quantization
2. Fourier transform and interpolation
3. Linear algebra
4. Signal and image processing
5. Numerical integration and optimization
StatsModels: StatsModels is another important Python library for statistical analysis, designed for building
statistical models, performing hypothesis tests, and estimating statistical models. Based on SciPy and NumPy, it also integrates well with Pandas, making it very flexible and powerful. Key features provided by StatsModels include:
1. Statistical tests and hypothesis testing
2. Generalized linear models
3. Ordinary least squares linear regression models
These libraries play a key role in statistical analysis in Python, providing a wide range of options for researchers and analysts [9].
CUPED method
The Covariate Adjusted Pre-Post Experiment Design (CUPED) is a sophisticated technique used to enhance the efficiency and accuracy of A/B testing. This method improves the power of statistical tests by incorporating covariates, specifically pre-experiment data, to adjust post-experiment outcomes. CUPED is particularly beneficial in reducing variance and increasing the ability to detect smaller effects with greater precision.
A covariate is a metric correlated with the experiment metric that is independent of the experiment. The covariate is often taken as the values of the experiment metric calculated over the period before the experiment. Let us denote the values of the experiment metric as Y and the values of the covariate as X.
The essence of the CUPED method is the transition from the metric Y to the metric Ycuped, which is calculated by the formula:
Ycuped = Y —
where 0 is a real-valued coefficient.
For each subject in both the control and experimental groups, the values of the target metric and the covariate are obtained. These values are then used to calculate the CUPED-adjusted metric for each subject. The adjusted metrics for both groups are subsequently used in statistical tests to evaluate the hypotheses.
The point estimate of the treatment effect is calculated as the difference in the mean values of the metric between the control and experimental groups. Assuming the random assignment of subjects to groups and the independence of the covariate from the experiment, replacing the raw metric with the CUPED metric ensures an unbiased estimate of the effect.
The sensitivity of the test increases as the variance decreases. The variance of the mean estimate when transitioning to the CUPED metric is given by:
V(Y - вХ) V(Y) - 29cov(Y,X) + в2У(Х)
H n n
The variance depends quadratically on 0, forming a simple parabolic shape with a minimum at:
_ cov(Y,X) °0 = V(X)
The minimum variance is:
min(VYcuped) = VYx(1- p2)
cov(Y X)
where p = . '= represents the correlation coefficient between
The stronger the correlation between the covariate and the target metric, the greater the reduction in variance achieved through CUPED.
CUPED Implementation Algorithm
1. Calculate the raw target metric Y for each subject in both control and experimental groups.
2. Calculate the covariate X for each subject in both groups.
3. Compute the adjustment coefficient 0 using the formula:
_ cov(Y,X) 0 = V(X)
4. Calculate the CUPED metric for each subject using:
^cuped = Y —
5. Apply the statistical test to the CUPED metrics of the control and experimental groups.
Table 1
Calculation example
group covariate metric theta CUPED metric
0 4 5 0.5 3
1 5 7 0.5 4.5
1 6 8 0.5 5
0 7 7 0.5 3.5
0 8 7 0.5 3
Bootstrap method
The Bootstrap method involves resampling with replacement, which is a type of Monte Carlo method. The process uses the raw data to randomly extract observations, calculate a statistic such as the mean, and return each sample to the original dataset. This is repeated many times to generate an estimate of the statistic with its variance. The quality of Bootstrap's performance is judged by the unimodality of the distribution of the resulting statistic and the minimal bias from the true value.
There are different approaches to data sampling: random and deterministic. Random sampling involves selecting values without a specific system, while deterministic sampling involves selecting values based on a predefined rule or interval, such as selecting every nth value. Although Bootstrap primarily uses random sampling, the bucket method is also applicable when processing big data, allowing you to compare the distribution of mean values across different segments.
Bootstrap confidence intervals are calculated without assuming the normality of the data, making the method flexible and applicable even in the presence of distortions. The application of the three-sigma rule in this context allows quantiles to be determined at the 2a and 1-2a levels, which provides grounds for statistical rejection or confirmation of the null hypothesis within the established confidence intervals.
Figure 2 - Bootstrap method [10].
When there is uncertainty in the methods for assessing statistical significance for a particular metric, Bootstrap becomes a useful tool. The problem with assessing the statistical significance of differences in ratio metrics using the t-test is the dependence on the observed data, which makes it impossible to directly use standard methods to calculate the sample variance and standard error of the mean. Bootstrap allows you to bypass this limitation.
To analyze ratio metrics using the Bootstrap method, data resampling must be performed, including randomly selecting users in each group and their respective signals, which form the numerator and denominator of the metric. The difference between the bootstrapped values of the ratio metrics is then computed repeatedly, allowing an empirical distribution of this difference to be generated. The mean of this distribution reflects the observed effect in the experiment, and its variability can be estimated through the empirical standard error. From this distribution, it is already possible to derive the desired p-value, which allows us to assess the statistical significance of the observed changes (Fig.3).
-20 0 20 40 60 BO -0 003 -0.002 -0.001 0000 0.001 0002 -400 -200 0 200 400 600
A0V_diff CTRdiff SessionLen_diff
Figure 3 - Empirical distributions of ratio-metrics differences obtained by Bootstrap [8].
Another variation is the delta method, which in its essence does the same thing as Bootstrap, not empirically, but analytically, through a formula [11]. However, linearization is considered even more effective than the delta method because it directly approximates the variance of ratio metrics, providing more accurate and reliable results in many cases.
2. Methods for accelerating A/B testing
2.1 Parallel Testing
Parallel testing, also known as concurrent testing, is a methodological approach in A/B testing that allows multiple experiments to run simultaneously. This strategy can significantly accelerate the testing process by enabling the evaluation of various hypotheses at the same time, rather than sequentially. Implementing parallel testing requires careful consideration of statistical dependencies and potential interactions between experiments to avoid confounding effects.
Parallel testing operates on the principle of running multiple experiments concurrently, which can drastically reduce the total time required to reach conclusions. By leveraging the ability to test several variants at once, organizations can optimize their decision-making processes and expedite the deployment of effective changes. The primary benefits of parallel testing include:
1. Efficiency: Reduces the overall duration needed to complete multiple tests, allowing for faster iteration and implementation of improvements.
2. Resource Optimization: Utilizes existing traffic more effectively, thereby maximizing the use of available
data.
3. Comprehensive Insights: Provides a broader understanding of user behavior and interaction patterns by examining multiple variables simultaneously.
Implementation of Parallel Testing
To implement parallel testing effectively, several critical steps must be followed:
1. Experimental Design: Carefully design each experiment to ensure they are independent of each other. This involves selecting non-overlapping user segments for each test to avoid interaction effects.
2. Statistical Considerations: Employ advanced statistical techniques to manage the potential complexities of parallel testing. Techniques such as multi-level modeling or the use of interaction terms in regression analysis can help account for potential dependencies between tests.
3. Data Collection and Analysis: Utilize robust data collection frameworks to capture and analyze results from multiple experiments simultaneously. Tools like R or Python, along with specialized libraries, can facilitate the handling of complex datasets.
Practical Example
Consider a scenario where an e-commerce platform wants to test the impact of different homepage layouts (Layout A, B, and C) and various discount strategies (10%, 20%, and 30%). Instead of running these tests sequentially, the platform can implement parallel testing by dividing the user base into distinct segments for each combination of layout and discount strategy. This approach is particularly viable for companies with a large customer base, as testing multiple variations simultaneously requires a substantial amount of traffic to ensure statistically significant results.
Example Code: Implementing Parallel Testing in Python
conversions = np.random.binomial(l, 0.1 - 0.02 * (layouts == 'B') + 0.03 * anova = AnovaRM(df: 'conversion', 'user id'. within=['layouf, 'discount'])
Visual Representation: Interaction Plot
sns.pointp Jo t(data=df, s='discoimt', y=rconversion', irue-layout', ci=None,
The example provided demonstrates the practical application of parallel testing using Python, highlighting its potential to drive efficiency and deeper insights in A/B testing endeavors.
2.2 Preliminary Analysis and Forecasting
Building on the efficiency gains achieved through parallel testing, another crucial strategy for improving and accelerating A/B testing is preliminary analysis and forecasting. This approach involves leveraging historical data and advanced analytical techniques to predict outcomes, thereby streamlining the testing process. By implementing these methods, organizations can make more informed decisions, reduce testing durations, and allocate resources more effectively.
Preliminary analysis entails a thorough examination of existing data before launching new experiments. This step helps identify patterns, trends, and potential issues that could impact the outcomes of A/B tests. Forecasting uses statistical and machine learning models to predict the results of experiments based on historical data and early results. The primary benefits include:
1. Time Reduction: By predicting outcomes, testing durations can be shortened as fewer samples may be needed to reach significant conclusions.
2. Resource Efficiency: Resources can be better allocated to the most promising variants, reducing waste and focusing efforts on impactful changes.
3. Enhanced Decision-Making: Preliminary insights allow for more strategic planning and prioritization of experiments, improving overall decision-making processes.
Implementation of Preliminary Analysis and Forecasting
Implementing preliminary analysis and forecasting involves several key steps:
1. Data Collection and Cleaning: Gather comprehensive historical data and ensure it is clean and well-organized. This data will serve as the foundation for analysis and model training.
2. Exploratory Data Analysis (EDA): Conduct EDA to uncover patterns, correlations, and anomalies. Techniques such as visualizations, summary statistics, and correlation matrices are useful here.
3. Model Selection and Training: Choose appropriate statistical or machine learning models for forecasting. Common models include ARIMA, exponential smoothing, and machine learning algorithms like random forests and gradient boosting.
4. Validation and Testing: Validate models using a portion of historical data to ensure accuracy. This involves splitting the data into training and testing sets and evaluating model performance using metrics like mean squared error (MSE) or R-squared.
Practical Example
Consider an e-commerce platform aiming to forecast the impact of various promotional strategies on sales. The platform can use historical sales data and early results from ongoing tests to predict outcomes and adjust strategies accordingly.
Example Code: Implementing Forecasting in Python from statsmo'delsJsa.statespace.sarimaxijaport SARIMAX
date_rianEe = pd.date_range(start='2022-Ql-01'. perjods=365, fieq-D1)
sales = np_rando<m_pois&on(lam=2Q£l: size=365) - np.linspace{lQ_ 50. 365} = Trend
df_sales = pd.DataPrame({'date': date_range, 'sales': sales}) plt.plot(df^sale&_mdexr d£_sales ["sales'], label-Historical Sales. )
model = SARIMAX(df salesf sales']. crder=(l. 1. 1). seasonal order^(l. 1, 1, 12))
plt.piot(df_sales.index/df_sales[sales'], label-Historical Sales") plt.plot(pd .date_ranEe(stafl^df_sales. index[-1 ]. periods=forecast_steps-1, freq—D')[l:]. forecaat.predicted_inean, label=T'crecasted Sales') pit. fill_betweenipd_date_ran?e{star1^df_sales. indes.[-1 ]. periods=forecast_steps-1. freq—D')[l:]. forecaa^ci.ilocf:. 0]. forecast_ci.ilcc[:_ 1]. colcr='pirik:': alpha=0_3)
The practical example demonstrates the application of forecasting techniques using Python, highlighting their potential to streamline A/B testing and drive more efficient outcomes. Through the integration of these advanced methodologies, businesses can stay ahead in the competitive landscape by rapidly iterating and implementing effective changes based on reliable forecasts.
Conclusion
The study confirms the significant impact of applying advanced statistical methods on the effectiveness of A/B testing. Methods such as Bootstrap and CUPED not only speed up the process of obtaining results but also significantly improve their reliability by reducing variability. While Bootstrap achieves this through resampling
techniques, CUPED leverages pre-experimental data to adjust post-experimental outcomes. The introduction of stratification and delta method, and linearization for ratio metrics contributes to more precise data analysis, which is indispensable for optimizing key business processes and improving metrics and performance indicators. Thus, the refinement of A/B testing techniques opens up new perspectives for improving more adaptable to the rapidly changing requirements of the digital economy.
References
1. Rongrong Zhang Product market competition, competitive strategy, and analyst coverage // Review of Quantitative Finance and Accounting. 2018. №50. pp. 239-260.
2. E. Claeys, P. Gançarski, M. Maumy-Bertrand and H. Wassner Dynamic Allocation Optimization in A/B-Tests Using Classification-Based Preprocessing // IEEE Transactions on Knowledge and Data Engineering. №35. pp. 335-349..
3. Steven Ritter, Neil Heffernan, Joseph Jay Williams, Derek Lomas, Ben Motz, Debshila Basu Mallick, Klinton Bicknell, Danielle McNamara, Rene F. Kizilcec, Jeremy Roschelle, Richard Baraniuk, and Ryan Baker. Third Annual Workshop on A/B Testing and Platform-Enabled Learning Research // Proceedings of the Ninth ACM Conference on Learning @ Scale. 2022. pp.252-254..
4. Ostrow K.S. Heffernan, N.T.,Williams, J.J. Tomorrow's EdTech Today: Establishing a Learning Platform as a Collaborative Research Tool for Sound Science // Teachers College Record. 2017. №3. pp.1-36.
5. Federico Quin, Danny Weyns , Matthias Galster, Camila Costa Silva A/B testing: A systematic literature review // Journal of Systems and Software. 2024. №211. pp. 1-28.
6. Aharon, Michal, Somekh, Oren, Shahar, Avi, Singer, Assaf, Trayvas, Baruch, Vogel, Hadas, Dobrev, Dobri, 2019b. Carousel ads optimization in yahoo gemini native. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. KDD '19, Association for Computing Machinery, New York, NY, USA. pp. 1993-2001.
7. A/B testing. [Electronic resource] Access mode: https://tracker.my.com/blog/204/5-lajfhakov-dlya-uskoreniya-a-b-testirovaniya-ot-analitikov-mytracker?lang=ru (accessed 8.05.2024) .
8. Borisyuk, Fedor, Malreddy, Siddarth, Mei, Jun, Liu, Yiqun, Liu, Xiaoyi, Maheshwari, Piyush, Bell, Anthony, Rangadurai, Kaushik, 2021. VisRel: Media search at scale. In: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. KDD '21, Association for Computing Machinery, New York, NY, USA, pp. 2584-2592.
9. Statistics with SciPy, Statsmodels, and Pingouin. [Electronic resource] Access mode: https://pythontutorials.eu/numerical/statistics/ (accessed 8.05.2024).
10.Chen, Guangde, Chen, Bee-Chung, Agarwal, Deepak, 2017a. Social incentive optimization in online social networks. In: Proceedings of the Tenth ACM International Conference on Web Search and Data Mining. WSDM '17, Association for Computing Machinery, New York, NY, USA, pp. 547-556.
11.Chen, Russell, Chen, Miao, Jadav, Mahendrasinh Ramsinh, Bae, Joonsuk, Matheson, Don, 2017b. Faster online experimentation by eliminating traditional A/A validation. In: 2017 IEEE International Conference on Big Data (Big Data). pp. 1635-1641.
12.Deng, Alex, Li, Yicheng, Lu, Jiannan, Ramamurthy, Vivek, 2021. On post-selection inference in A/B testing. In: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. KDD '21, Association for Computing Machinery, New York, NY, USA, pp. 2743-2752.
© Chumachenko A.A., 2024