Random controlled experiments are considered the “gold standard” for A/B testing, in terms of assessing the probable effects of customer service or marketing activities. One issue though is that locally optimal interventions evaluated in one-shot experiments might be sub-optimal when it comes to their interdependence across multiple touchpoints of the customer journey. In this study, Yicheng Song and Tianshu Sun developed and tested a Reinforcement Learning (RL) model that integrates multiple historical experiments, in order to optimize interventions holistically and to guide future intervention trials for further learning.
Researchers worked with a US e-commerce platform that pioneered using randomized controlled experiments, to develop a unique, Bayesian Deep Recurrent Q-Network (BDRQN)| model. This model leverages interventions from multiple experiments to learn their effectiveness at different stages of the customer journey. This model not only allows for the identification of long-term rewards for various interventions but also offers the ability to estimate the distribution of rewards. This can guide participant allocation in future intervention trials in order to balance the exploitation of current profit and the exploration of new learning.
This study integrates multiple experiments with the Reinforcement Learning (RL) framework in order to tackle the questions that cannot be answered by standalone one-shot experiments: How can we learn optimal policy with sequence of interventions along the customer journey by ensembling exogenous interventions from multiple historical experiments? And how can we utilize multiple historical experiments to guide future intervention trials to further improve the learnt policy?
Researchers took a clickstream dataset of nearly 150,000 users (across ten historical experiments) and divided it into a training set, to develop a “reinforcement learning agent” and a holdout set. The interventions chosen as optimal from each one-off experiment provide benchmarks against which to assess the effectiveness of those optimized holistically by the BDRQN model.
While holdout data is typically used to evaluate the predictive accuracy of a model. In this case, the researchers used it to simulate how the reinforcement learning agent will evolve when following future interventions recommended by different strategies. This allows them to demonstrate that their model can guide sample allocation in future intervention trials to balance the exploitation of known, promising actions versus exploring actions with the potential to further improve the model.
Results show that adopting the proposed model to learn policy from historical experiments leads to a 7.3% to 43% improvement in terms of reward (
i.e., profits) per episode for the platform.
Read the full
white paper.