Contextual Bandits : l0stinsp4ce

Online Alternatives Testing, with Contexts

Through several earlier posts on this blog, I outlined an online learning algorithm for maximizing click through rates on landing pages. This strategy turns out to be an old one, from the 1930s in fact, called Thompson sampling. But despite its age and simplicity, it works quite well, and the concurrent data structures / multi-process approach has shown good results in real-world testing!

Some gaps have emerged, though, that really expand the problem and make things a bit more complicated. How do you pick a different winner for mobile and desktop? If you're running independent tests for different traffic sources, how can you share some information about which page content is 'generally good' across the tests without forcing them to have the same result? In offline analysis, this is a great setup for hierarchical models / partial pooling, as described in the Gelman & Hill classic ARM (Analysis with Regression Models), making use of MCMC methods and probabilistic programming languages like Stan. But are there ways we can get at nice features of that approach without turning to their solving methods (rather than simple online data assimilation)?

It turns out this expanded problem of 'bandits with round-specific hints' is called 'contextual bandits', so I went to the arXiv and tried to find some nice ideas.

Some notes on solutions in the literature

I'll pick three papers/families of work that I spent more time with and give quick notes. If this is of interest to you, check out this great survey of the topic from 2015 -- if you skip over the analysis of regret bounds, it's almost readable ;)

ILOVETOCONBANDITS (yes, that's the actual algorithm name)

Generally pretty attractive
Sample from a distribution over policies, each having a context -> arm mapping
Run coordinate descent over history to get new distribution over policies each epoch
Would like something more fully online and interpretable

Thompson Sampling Contextual Bandits

Sample a weight vector, select an arm based on argmax over dot products
Covariance of weight vector is based on inverse of covariance of context vectors
Mean of weight vector is based on high-reward contexts
Contexts similar in high-variance dimensions should have similar predictions
Exponential gradient over expert weights (also in EXP3 and EXP4)
Assumes you have a context vector per arm

Contextual Bandits with Similarity Information

Generates policy balls that cover the (context, arm) space with similarity-measure based radius
As contexts arrive, generates more, finer-grained policy regions
Similarity measure includes arms, so never-before-played arms can be exposed to (similar) contexts that have had success on similar arms
Requires similarity measures over both arms and contexts

Thompson Sampling from Online Mixture Model

Drawing from good ideas in this work and aiming for an only incremental change to the online alternative testing strategy described in earlier posts, I've devised an approach

Assumptions

Contexts are drawn from M populations — we'll provide soft cluster assignments for each input context vector
Each population has a reward distribution over K arms — thus, each population has an expert

Model

Initialize a 'population weights' matrix W (d x M)

Potentially via K-means on historic context data in d, selecting M centroids

For each expert in M and arm in K, initialize successMK and failureMK counters to represent some prior on reward rate

During each step of online operation:

Receive context x and arm set a
xW gives probability distribution across populations, w
For each expert where w > 0, for each arm in a:
- Sample from Beta(successMK, failureMK) as probability of reward
- Return arm with highest sampled reward probability
Select an expert according to probabilities in w and play the arm it returned
Observe either success or failure

Then, we update experts' knowledge of arms and the probabilities of each expert for each context:

Update experts in proportion to the current modeled relevance of this context to them:

Record a fractional success or failure for each expert according to its probability in w

Update the population weights matrix to make experts with higher likelihood reward estimates more likely as assignments for this context:

Calculate a gradient update to w; for each element in w:
- wt+1 := w p(rt|at)
- That is, new expert probability = old expert probability * likelihood of observed outcome according to that expert
Calculate delta_w as wt+1 - wt; we need to change the corresponding element of xW by this much
- We can do this by adding a scaled xt to the corresponding column of W
- When xt is an indicator vector, Wi += delta_w-scaled normalized xt
- That is, a -.2 change w.r.t context [1, 0, 1] would be accomplished by adding [-.1, 0, -.1]

Features

Fully online, with room for 'warm starts' from offline training: no batched steps are required; balances exploit and explore through sampling
Interpretable probabilities: for each context, you can assess the modeled odds of being a member of each group, and each group's modeled odds of success with each arm
Shares information across settings where set of available arms change: group membership predictions can be used in settings where new arms are available

Concerns

How do you get the number of populations/experts right? How sensitive to having it right is the whole algorithm? It seems possible that this is very important -- if data from a population is split evenly across two experts, both are worse off, but the model will not readily cut off either.
As population weights change, data assimilated into an expert may have come from now-irrelevant contexts, but there's no way to get it out.

Another post soon will show some experimental results and try to assess those concerns.