Online Alternatives Testing, with Contexts

Online Alternatives Testing, with Contexts

Through several earlier posts on this blog, I outlined an online learning algorithm for maximizing click through rates on landing pages. This strategy turns out to be an old one, from the 1930s in fact, called Thompson sampling. But despite its age and simplicity, it works quite well, and the concurrent data structures / multi-process approach has shown good results in real-world testing!

Some gaps have emerged, though, that really expand the problem and make things a bit more complicated. How do you pick a different winner for mobile and desktop? If you're running independent tests for different traffic sources, how can you share some information about which page content is 'generally good' across the tests without forcing them to have the same result? In offline analysis, this is a great setup for hierarchical models / partial pooling, as described in the Gelman & Hill classic ARM (Analysis with Regression Models), making use of MCMC methods and probabilistic programming languages like Stan. But are there ways we can get at nice features of that approach without turning to their solving methods (rather than simple online data assimilation)?

It turns out this expanded problem of 'bandits with round-specific hints' is called 'contextual bandits', so I went to the arXiv and tried to find some nice ideas.

Some notes on solutions in the literature

I'll pick three papers/families of work that I spent more time with and give quick notes. If this is of interest to you, check out this great survey of the topic from 2015 -- if you skip over the analysis of regret bounds, it's almost readable ;)

ILOVETOCONBANDITS (yes, that's the actual algorithm name)

Thompson Sampling Contextual Bandits

Contextual Bandits with Similarity Information

Thompson Sampling from Online Mixture Model

Drawing from good ideas in this work and aiming for an only incremental change to the online alternative testing strategy described in earlier posts, I've devised an approach



Initialize a 'population weights' matrix W (d x M)

For each expert in M and arm in K, initialize successMK and failureMK counters to represent some prior on reward rate

During each step of online operation:

Then, we update experts' knowledge of arms and the probabilities of each expert for each context:

Update experts in proportion to the current modeled relevance of this context to them:

Update the population weights matrix to make experts with higher likelihood reward estimates more likely as assignments for this context:



Another post soon will show some experimental results and try to assess those concerns.