Through several earlier posts on this blog, I outlined an online learning algorithm for maximizing click through rates on landing pages. This strategy turns out to be an old one, from the 1930s in fact, called Thompson sampling. But despite its age and simplicity, it works quite well, and the concurrent data structures / multi-process approach has shown good results in real-world testing!

Some gaps have emerged, though, that really expand the problem and make things a bit more complicated. How do you pick a different winner for mobile and desktop? If you're running independent tests for different traffic sources, how can you share some information about which page content is 'generally good' across the tests without forcing them to have the same result? In offline analysis, this is a great setup for hierarchical models / partial pooling, as described in the Gelman & Hill classic ARM (Analysis with Regression Models), making use of MCMC methods and probabilistic programming languages like Stan. But are there ways we can get at nice features of that approach without turning to their solving methods (rather than simple online data assimilation)?

It turns out this expanded problem of 'bandits with round-specific hints' is called 'contextual bandits', so I went to the arXiv and tried to find some nice ideas.

I'll pick three papers/families of work that I spent more time with and give quick notes. If this is of interest to you, check out this great survey of the topic from 2015 -- if you skip over the analysis of regret bounds, it's almost readable ;)

- Generally pretty attractive
- Sample from a distribution over policies, each having a context -> arm mapping
- Run coordinate descent over history to get new distribution over policies each epoch
- Would like something more fully online and interpretable

- Sample a weight vector, select an arm based on argmax over dot products
- Covariance of weight vector is based on inverse of covariance of context vectors
- Mean of weight vector is based on high-reward contexts
- Contexts similar in high-variance dimensions should have similar predictions
- Exponential gradient over expert weights (also in EXP3 and EXP4)
- Assumes you have a context vector per arm

- Generates policy balls that cover the (context, arm) space with similarity-measure based radius
- As contexts arrive, generates more, finer-grained policy regions
- Similarity measure includes arms, so never-before-played arms can be exposed to (similar) contexts that have had success on similar arms
- Requires similarity measures over both arms and contexts

Drawing from good ideas in this work and aiming for an only incremental change to the online alternative testing strategy described in earlier posts, I've devised an approach

- Contexts are drawn from M populations — we'll provide soft cluster assignments for each input context vector
- Each population has a reward distribution over K arms — thus, each population has an expert

Initialize a 'population weights' matrix W (`d`

x `M`

)

- Potentially via K-means on historic context data in
`d`

, selecting`M`

centroids

For each expert in `M`

and arm in `K`

, initialize successMK and failureMK counters to represent some prior on reward rate

During each step of online operation:

- Receive context
`x`

and arm set`a`

`xW`

gives probability distribution across populations,`w`

- For each expert where
`w > 0`

, for each arm in`a`

:- Sample from Beta(successMK, failureMK) as probability of reward
- Return arm with highest sampled reward probability

- Select an expert according to probabilities in
`w`

and play the arm it returned - Observe either success or failure

Then, we update experts' knowledge of arms and the probabilities of each expert for each context:

Update experts in proportion to the current modeled relevance of this context to them:

- Record a fractional success or failure for each expert according to its probability in
`w`

Update the population weights matrix to make experts with higher likelihood reward estimates more likely as assignments for this context:

- Calculate a gradient update to
`w`

; for each element in`w`

:`wt+1 := w p(rt|at)`

- That is, new expert probability = old expert probability * likelihood of observed outcome according to that expert

- Calculate delta_w as
`wt+1 - wt`

; we need to change the corresponding element of`xW`

by this much- We can do this by adding a scaled
`xt`

to the corresponding column of`W`

- When
`xt`

is an indicator vector,`Wi +=`

delta_w-scaled normalized`xt`

- That is, a
`-.2`

change w.r.t context`[1, 0, 1]`

would be accomplished by adding`[-.1, 0, -.1]`

- We can do this by adding a scaled

- Fully online, with room for 'warm starts' from offline training: no batched steps are required; balances exploit and explore through sampling
- Interpretable probabilities: for each context, you can assess the modeled odds of being a member of each group, and each group's modeled odds of success with each arm
- Shares information across settings where set of available arms change: group membership predictions can be used in settings where new arms are available

- How do you get the number of populations/experts right? How sensitive to having it right is the whole algorithm? It seems possible that this is very important -- if data from a population is split evenly across two experts, both are worse off, but the model will not readily cut off either.
- As population weights change, data assimilated into an expert may have come from now-irrelevant contexts, but there's no way to get it out.

Another post soon will show some experimental results and try to assess those concerns.