Statistics
The Bet Test: Spotting Problems in Bayesian A/B Test Analysis
Learn more
A little over a year since the previous CODE conference (Conference on Digital Experimentation @ MIT, for the uninitiated). Boston is a bit colder, the air crisper, and the conference slightly bigger than last year. Here are some of my personal take-aways and highlights, with the caveat that there were many parallel sessions, therefore I was only able to sample some of the content.
At a macro level, there are some clear trends compared to last year. The obvious one, although I’ll leave it alone in this post in favor of core experimentation topics, is the interest in LLMs, garnering a fireside chat, parallel talks and posters. The continual growth of the CODE conference indicates a growing interest in experimentation, and I also got the feeling that there was a more diverse group of industry folks showing up. The majority still comes from the usual suspects: Meta, Microsoft, Netflix, Amazon, etc. However, there are also more participants coming from less well-known and up-and-coming companies.
Regarding topics, interference of various kinds still remains a core staple of the conference. Surrogate modeling, which received more attention than last year, had a stronger focus on practical issues over theory. Always valid inference and bandits were also prominently featured. On the other hand, I got a sense that heterogeneous treatment effect estimation has taken more of a back seat this year, perhaps it’s simply regressing towards the mean.
Let’s dive into some of the talks from the parallel sessions.
Often, we are interested in estimating long term outcomes of experiments, but not patient enough to wait for those to materialize. In theory, surrogate outcomes provide a useful way to estimate long-term outcomes based on short-term results. However, the assumptions that underlie this technique can often be questioned in practice. The surrogates session focused on the practical applications, featuring two talks from both Meta and Netflix.
Kenneth Hung and Michael Gill propose to use surrogates as filters on experiments rather than as estimates of treatment effects. We run experiments in two stages. In the first stage, we focus on quickly find promising candidates based on surrogate metrics. Only the promising experiments move to a second and longer stage where we can then focus on estimating the long-term goal metrics. This way, we can still speed up the experiment duration for many filtered (supposedly null) experiments while explicitly validating the long term effects of “winners” without relying on surrogate assumptions.
Michael Zhao presented work evaluating “auto-surrogates” on 200 experiments at Netflix. An “auto-surrogate” (think auto-regressive models) is simply the outcome of a long term metric (e.g. 90-day watch hours) truncated to the observed period (e.g. after 14 days of the experiment, we can observe 14-day watch hours). While auto-surrogates likely do not satisfy the assumptions for validity, empirical evaluation on past experiments indicates that they work well in practice. This is great because they are easy to formulate and may pave a path towards automated surrogate analysis. (Link to the paper)
The goal of always-valid inference is to provide sound statistical analysis when we are continuously monitor results as data trickles in. This avoids the peeking problem and allows for adaptive decision making. It has seen a lot of research interest in the last couple of years. Arguably, there was less content this year compared to the last, with a noticeable shift in focus towards more practical aspects.
The session on always valid inference covered a variety of topics. Biyonka Liang tells us how to adapt anytime valid inference to the multi-armed bandit setting. The key insight here is that traditionally always valid inference approaches require assignment probabilities to stay away from zero and one. However, this is violated by many bandit algorithms, such as UCB and Thompson sampling. We can solve the conundrum by adding an appropriate amount of additional exploration. This approach offers the best of both worlds: it allows for early stopping (as if actions are being randomized) while minimizing regret (as if a bandit algorithm is being run).
From a practical perspective, Daniel Beasley discussed how Vinted uses E-values, finding that they generally allow for tests to be stopped sooner than with classical statistical methods. Note that in this scenario, we stop the sequential test when it reaches a sample size corresponding to 80% power of detecting the relevant MDE. The intuition here is that, while in the worst case the sequential test stops later than the fixed test, it stops much sooner sufficiently often. This latter effect turns out outweigh the former.
The bandit session, unsurprisingly, leaned heavy on the theory and covered a diverse range of topics.
While not directly related to bandits, Jinglong Zhao considers the experiment setting where the variance of a metric for treatment and control differs. In that case, a 50/50 split of traffic is not statistically optimal: we want to assign more users to the group with higher variance. However, we do not know the variances in advance. Instead, we can use a multi-stage process to first estimate the variances and then adjust the allocation in future stages. In practice, we might also be concerned with point estimates, particularly with the aim to avoid increasing the allocation for a less effective variant. Nonetheless, I enjoyed two take-aways: we only have to run the experiment in two stages, and the first stage (to estimate the variance) should be of length $\sqrt{N}$.
Lalit Jain and others from Amazon presented work on combining encouragement design with bandits. sometimes, it is not feasible to run an experiment directly. Consider an example of membership tiers: what if we are interested in determining the most effective overall membership tier? We cannot randomize users into membership tiers, and also cannot directly analyze outcomes for users in each tier due to selection bias. In an encouragement design, treatment is not randomly assigned; instead, it is encouraged. This can then be used to estimate the impact of the treatment (under stringent IV assumptions). This paper shows that we can combine encouragement design with a multi-armed bandit approach to have adaptive encouragement.
One of my personal favorite sessions was on experimentation in markets. Marketplaces usually have unique challenges that make experimentation more challenging. The fact that the market connects units on both sides often causes indirect effects to be at least as important as the direct effects. This session did a great job highlighting the diversity in problems induced by markets.
To begin, Thu Le presented work on lead-day bias in pricing experiments at Airbnb, a topic we also recently discussed in our statistics reading group. They observed that pricing experiments frequently yielded puzzling outcomes, with outcome metrics changing course over extended periods of time. There are two causes of bias that make early results of an experiment biased and unreliable:
When we start a pricing experiment and suddenly change prices for some listings and compare outcomes with control, we are not taking into account that those listings were previously priced under control and that this induces a selection bias in available inventory.
Early on in an experiment, metrics favor variations that sell through inventory faster, even though in the long term that might hurt revenue (e.g. because the prices are too low).
In addition to outlining the causes of the bias, they also discussed strategies to mitigate this bias and maintain the pace of experimentation.
Another presentation that stood out to me was Jessica Fong’s work on estimating the effectiveness on suggested prices on Mercari. Mercari is an online marketplace focused on novice sellers. First they note that new sellers on average set prices 15% higher than experienced sellers. To help sellers with pricing, Mercari has a “suggested price”, and an experiment randomizes the price to be either higher or lower. Findings show that suggested prices indeed affect both new and existing sellers, as well as their revenue. However, potential interference effects could distort these results, and trying to correct for that is an interesting challenge.
Spillover effects can severely affect experiment results. However, adjusting for them comes at a steep cost in statistical power, putting experimenters in a conundrum. It may also make it tricky to convince stakeholders to alter the experimental design without proving that spillover effects are indeed an important factor at play.
Stefan Hut from Amazon presented how they go about quantifying spillover effects in historical (non-clustered) experiments.presented how they go about quantifying spillover effects in historical (non-clustered) experiments. First, find an appropriate clustering, and then compute the within-cluster treatment intensity: for an item, how many other items in the cluster are part of the treatment group? We can then estimate whether the intensity of the treatment has an impact on the empirical outcome of an item. A downside of this approach is that only small clusters will see diversity in the treatment intensities. For larger clusters the treatment intensity converges to the 50% (or whatever the allocation split is). Nonetheless, the authors find that this method can detect spillover effects, which can then be used as an argument to run the experiment again in a cluster randomized design.
The final session of this year’s CODE I attended focused on interference. A main theme that stood out from multiple talks is a trend of moving away from getting unbiased point estimates (often at a severe cost of increasing the variance of our estimates) to carefully thinking about the bias-variance trade-off. Sometimes we should choose to incur some bias when it lowers variance significantly.
Hannah Li emphasized that bias from interference can affect not only point estimates, but also variance, which is often underestimated due to negative correlation between units. A study based on historical Airbnb data suggests that the impact on variance is not as significant as the bias in point estimates. (A longer paper is forthcoming, but earlier work on the same topic can be found here)
Building on this theme, Liang Shi from Meta discussed the trade-off between bias and variance in the context of network interference, specifically when the interference graph is misspecified. The interference graph indicates whether two nodes (think two users) affect each other’s outcome (interference). For example, you might posit that friends on Facebook cause interference effects, but users who are not friends do not. In this case the interference graph would be the friends-graph.
Naturally, we do not have access to the true underlying interference graph, which leads to two main issues:
Missing edges in the graph can lead to a bias in estimates (toward the direct effect) and an underestimation of the variance.
Inclusion of irrelevant edges in the graph can lead to an increase in variance, which is equivalent to a loss in statistical power
From a practical standpoint, this can help steer experimenters towards selecting interference graphs that properly trade-off between bias and variance rather than solely focus on unbiased results.
Progress is being made on multiple fronts, and in particular it’s been exciting to hear practical advancements coming out of industry. While this article has focused on talks from parallel sessions, it's worth noting that there were some excellent plenary talks and panels that are now available on YouTube. In particular, the practitioner's panel and Martin Tingley's talk, which focused on the lessons learned from a century of experimentation, were highlights and well worth the watch.
If you also attended CODE, I would be curious to hear your highlights. If you did not attend, hopefully this article inspires you to join us in Boston next year!