l0stinsp4ce

ML Security

2019-06-02 21:58:12

ML Security

A secure system will do the right thing when faced with malicious inputs

James Mickens is a computer scientist well-known for his hilarious conference presentations. I'll assume you've seen the classics but maybe not a certain 2018 speech of his titled:

Q: Why Do Keynote Speakers Keep Suggesting That Improving Security Is Possible?
A: Because Keynote Speakers Make Bad Life Decisions and Are Poor Role Models

which, despite the title, is actually a crash course in AI with a focus on ethics/real world consequences. Plus security. The thesis is something like: it is a terrible idea to take a system whose inputs and logic you at best vaguely understand and put it in charge of important decisions. Further, if you're feeding it data from humans, both when you train it and when you use it for inference, then the inputs and logic can be compromised (for various definitions of compromised). Presented with the relentless malice of the internet, or, worse, the real physical world, your best case scenario of the machine's learned logic may well not come to pass.

This really hit home for me as something industrial ML efforts are doing a terrible job with. It's actually quite concerning, when you consider things like autonomous vehicles, but also describes a baseline foolishness "in the rush to deploy AI", in Mickens' term.

Data magnetism and security are not friends

The rush to deploy AI is related to the pressure to collect data. As you collect more data and apply increasingly automatic modeling methods and computing power, you can start vacuuming up lots of signal and producing some sort of predictive output. But the less you know about your data -- and knowing is not keeping pace with collection or modeling -- the less sure you can be that the industrial probability sensor you're wiring to your important decision process is something that actually will end up agreeing with your actual values. We're typically using these things for social, human, applications, not modeling unchanging laws of thermodynamics. The legitimacy of generalization from your training to your application is not nearly as clear.

Let me give some examples:

"Kitchen sink" modeling

Suppose you run a credit card fraud detection model. You have lots of data about transactions, where and when and for what, as well as lots of data about your account holder. You have also pulled in data from your customer service systems, and are tracking the contacts customers are making with you, about what topics, how recently, etc.

Your decision trees find a little pattern: when suspicious transactions occur after a customer has contacted you about traveling, they are much less likely to be actually fraud. Perhaps your analysis of your model turns this up: you find the importance of the date_diff_contact_txn value under the contact=travel notice constraint. You're actually proud of this, since it proves that extra data you brought in turned out to be useful!

But where did those important features really come from? You used the phone number from the customer contact records, looked it up against your customer account record. Perhaps the categorization is input by the user in your support phone automation system.

If that were true, you'd have a sadly insecure fraud detection system! The villain could easily spoof the customer number and provide this fake travel notice between obtaining and using the stolen card.

What has gone wrong is that you enlisted systems that normally have low-security applications as inputs into a high-security operation. The question to ask when integrating that extra peripheral data is: how could this be used against my model at inference time? Sure, the attack vector is small, but making a "kitchen sink" model with features from dozens of not-to-be-trusted systems and user-supplied inputs multiplies that vulnerability. Worse, it's typically considered as a 'modeling' problem and accomodated in often-permissive error rate goals, rather than being identified as a security concern.

Panopticon Problems

We're now seeing across more and more applications the power of adversarial networks, oppposing learners which are trying to competitively fool eachother. However, the framework for much commercial machine learning assumes a relatively consistent and obedient adversary. Rather than competing against an equal, we assume we are simply trying to outdo a static baseline.

In digital advertising, it's routine to trust your client to identify itself accurately, answer truthfully about behavioral data that you trust it to store itself. We think that we, the data collector, are in the center of a sort of panopticon, with clients unaware of eachother's outcomes.

Suppose you run a digital 'retargeting' ad campaign, in which near-miss visitors who didn't purchase on your site are offered a deal that you wouldn't present to the public. Rather than provide a coupon code, you attach some information on your links in these ads. Maybe you never even planned the special deal, because you'd wired your pricing system up to an AI price discrimination system that maximized profit given each visitor's information.

But suppose that your visitors are all running the same browser extensions, which inspects the inputs the visitors' browsers are being asked to provide you, and watches the price outcomes. This system could learn the lowest valleys in your price discrimination space, and looking at your model's own metrics, you might not even notice an issue.

A common thread of thinking about your ML inputs in security terms is to assume that your adversaries have a copy of your model. If they can collect all the same data that you can -- since you're asking them to provide the inputs and then sending them the output -- they could very well catch up with you. Obscurity is not a sufficient security strategy for the real world processes we are now thinking of turning over to ML.

A structured thing you might do about it

An important but often neglected part of modeling is 'error analysis', where you actually look at the examples that your ML is getting wrong and try to understand what it's missing.

A counterpart for ML security might be 'induced error analysis', where you take a correct decision and then fuzz the input until the decision becomes wrong. Give a hearty fuzz to any input that might be user-supplied or easily faked. Give a fuzz to inputs that you're likely to get wrong in measurement. Look at the altered example when the model starts to get it wrong.

If you have an error budget, or a performance benchmark that's required to justify deploying the model, you might take apply this fuzzing to your test set and count against you any observation which is fuzzable into getting wrong. You're only getting it right if the entire user-controlled space around the input is right.

If you have only a few user-trusted fields, then perhaps your metrics haven't suffered. But if you have many, you'll see that the possible alterations are too drastic, and you can't really go out there with that model.

Which goes back to Mickens' talk. One of the axioms he observed underlying the current AI craze is: "History has nothing interesting to teach us." Wouldn't something as simple and well known in computer security as fuzzing potentially have something to teach us after all?

Contextual Bandits

2018-01-15 18:18:49

Online Alternatives Testing, with Contexts

Through several earlier posts on this blog, I outlined an online learning algorithm for maximizing click through rates on landing pages. This strategy turns out to be an old one, from the 1930s in fact, called Thompson sampling. But despite its age and simplicity, it works quite well, and the concurrent data structures / multi-process approach has shown good results in real-world testing!

Some gaps have emerged, though, that really expand the problem and make things a bit more complicated. How do you pick a different winner for mobile and desktop? If you're running independent tests for different traffic sources, how can you share some information about which page content is 'generally good' across the tests without forcing them to have the same result? In offline analysis, this is a great setup for hierarchical models / partial pooling, as described in the Gelman & Hill classic ARM (Analysis with Regression Models), making use of MCMC methods and probabilistic programming languages like Stan. But are there ways we can get at nice features of that approach without turning to their solving methods (rather than simple online data assimilation)?

It turns out this expanded problem of 'bandits with round-specific hints' is called 'contextual bandits', so I went to the arXiv and tried to find some nice ideas.

Some notes on solutions in the literature

I'll pick three papers/families of work that I spent more time with and give quick notes. If this is of interest to you, check out this great survey of the topic from 2015 -- if you skip over the analysis of regret bounds, it's almost readable ;)

ILOVETOCONBANDITS (yes, that's the actual algorithm name)

Generally pretty attractive Sample from a distribution over policies, each having a context -> arm mapping Run coordinate descent over history to get new distribution over policies each epoch Would like something more fully online and interpretable

Thompson Sampling Contextual Bandits

Sample a weight vector, select an arm based on argmax over dot products Covariance of weight vector is based on inverse of covariance of context vectors Mean of weight vector is based on high-reward contexts Contexts similar in high-variance dimensions should have similar predictions Exponential gradient over expert weights (also in EXP3 and EXP4) Assumes you have a context vector per arm

Contextual Bandits with Similarity Information

Generates policy balls that cover the (context, arm) space with similarity-measure based radius As contexts arrive, generates more, finer-grained policy regions Similarity measure includes arms, so never-before-played arms can be exposed to (similar) contexts that have had success on similar arms Requires similarity measures over both arms and contexts

Thompson Sampling from Online Mixture Model

Drawing from good ideas in this work and aiming for an only incremental change to the online alternative testing strategy described in earlier posts, I've devised an approach

Assumptions

Contexts are drawn from M populations — we'll provide soft cluster assignments for each input context vector Each population has a reward distribution over K arms — thus, each population has an expert

Model

Initialize a 'population weights' matrix W (d x M)

Potentially via K-means on historic context data in d, selecting M centroids

For each expert in M and arm in K, initialize successMK and failureMK counters to represent some prior on reward rate

During each step of online operation:

Receive context x and arm set a xW gives probability distribution across populations, w For each expert where w > 0, for each arm in a: Sample from Beta(successMK, failureMK) as probability of reward Return arm with highest sampled reward probability Select an expert according to probabilities in w and play the arm it returned Observe either success or failure

Then, we update experts' knowledge of arms and the probabilities of each expert for each context:

Update experts in proportion to the current modeled relevance of this context to them:

Record a fractional success or failure for each expert according to its probability in w

Update the population weights matrix to make experts with higher likelihood reward estimates more likely as assignments for this context:

Calculate a gradient update to w; for each element in w: wt+1 := w p(rt|at) That is, new expert probability = old expert probability * likelihood of observed outcome according to that expert Calculate delta_w as wt+1 - wt; we need to change the corresponding element of xW by this much We can do this by adding a scaled xt to the corresponding column of W When xt is an indicator vector, Wi += delta_w-scaled normalized xt That is, a -.2 change w.r.t context [1, 0, 1] would be accomplished by adding [-.1, 0, -.1]

Features

Fully online, with room for 'warm starts' from offline training: no batched steps are required; balances exploit and explore through sampling Interpretable probabilities: for each context, you can assess the modeled odds of being a member of each group, and each group's modeled odds of success with each arm Shares information across settings where set of available arms change: group membership predictions can be used in settings where new arms are available

Concerns

How do you get the number of populations/experts right? How sensitive to having it right is the whole algorithm? It seems possible that this is very important -- if data from a population is split evenly across two experts, both are worse off, but the model will not readily cut off either. As population weights change, data assimilated into an expert may have come from now-irrelevant contexts, but there's no way to get it out.

Another post soon will show some experimental results and try to assess those concerns.

Python Packaging

2017-04-26 23:46:26

Sane Python Packaging

Received a couple iMessages:

Write a blog post about sane package/module management development in a python app

I can't find one

So let's give it a try!

As an example let's take something I made for an interview once, a little API->Postgres data engine app.

Here's a little thing that pulls some data requested by the interviewer. We pull information about congressional bills containing some query string, as well as profiles of their sponsors: https://github.com/bhtucker/sunroof

Developing

Let's pretend you want to collaborate on this and you're on OSX.

First, get virtualenvwrapper. If you're going to do some Python2 and some Python3, do this. Install it globally, then invoke mkvirtualenv sunroof.

Now, clone this repo and run pip install -e . This registers your package with your virtualenvironment's site-packages. You can verify this with something like:

ls ~/.virtualenvs/sunroof/lib/python2.7/site-packages/sunroof*
/Users/bhtucker/.virtualenvs/sunroof/lib/python2.7/site-packages/sunroof.egg-link

As long as you're in your virtualenv, you'll be able to import your library and its modules.

Modules

You're going to implement a feature, and you're going to do it in a nice module, yay!

mkdir sunroof/services
touch sunroof/services/__init__.py

That's a module.

I usually would make a file for my thing then. Such as cool.py. And write something cool there.

That package could have imports like: import math



import sqlalchemy

from sunroof.models import bills, legislators
(FYI, stdlib -> libraries -> your package is a canonical import ordering)

Then you write something cool.

def _some_private_cool():
    assert (bills, legislators)

def something_cool(): """Stuff""" math.ceil(2) _some_private_cool() assert sqlalchemy

This is now usable from wherever your outer package is installed.

Simply: from sunroof.services.cool import something_cool

You can even let an object be importable from the services layer by adding to your sunroof/services/__init__.py:

from .cool import something_cool

Enabling this type of jazz:

(sunroof) Bensons-MacBook-Air:sunroof bhtucker$ cd 
(sunroof) Bensons-MacBook-Air:~ bhtucker$ python
Python 2.7.10 (default, Oct 23 2015, 19:19:21) 
[GCC 4.2.1 Compatible Apple LLVM 7.0.0 (clang-700.0.59.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from sunroof import services
>>> services.cool.something_cool

>>> services.something_cool

>>> from sunroof import services
>>> services.cool

>>> import sunroof.services
>>> sunroof.services.cool

So now we believe we can at least write modules and objects and make sure Python is finding them the way we want.

Avoiding Circular Dependencies

But, you can't turn around and write into models.py:

from sunroof.services import cool

There needs to be some possible ordering for Python to read the modules with all their imports satisfied.

So if I'm making my database connection in main but I want to query in services without passing the connection in to every interface, I need a singleton.

These live inside some module. Often you can find them in an __init__.py like I outlined above. (Here's an example where I do that to make a Twitter API client available throughout a package).

We can modify sunroof to make it's database session available this way.

In our outer __init__.py, we add:

from sqlalchemy.orm.session import Session

session = Session()

then in our main function:

    congress_engine = engine.create_engine(SQLALCHEMY_DATABASE_URI)
...
    session.bind = congress_engine

happily we can now import this into some module that wants to actually do database stuff. So, we could go back to cool.py and add:

from sunroof import session

def get_sponsor_ids(): return [ x[0] for x in session.query(bills.c.sponsor_id.distinct()).all() ]

which we can import into main and use there without any circular dependency problems.

Other issues?

I'm curious what real world Python circular dependency problems you can get into that don't lend themselves to either of these approaches. Once you get an application going you rarely have these issues, but obviously that can be a big blocker!

Get at me on twitter @lavabenson

Disk IO + Python Threads

2016-06-06 16:53:22

Python Concurrency

I did some exploration a couple weeks ago re: concurrency in Python, with totally blogworthy results!

The question:

We often use the phrase 'waiting on IO' to describe 'cases when threads should improve efficiency of a single core'.

The most common scenario in modern web programming is probably waiting on HTTP responses or database queries, both of which play well with Python threading to improve scheduling / waste less time.

Here's some light test code we can use to check this out:

def get_content(url):
   print('starting read')
   r = requests.get(url)
   print('done reading')
   return r.content


def factorial(n, ix):
   print('factorial {ix} starting'.format(ix=ix)) 
   val = reduce(lambda a, b: a * b, range(1, n, 1))
   print('factorial {ix} done'.format(ix=ix)) 
   return val


def threadtest():
    io_thread = threading.Thread(target=get_content, args=(URL, ))
    cpu_thread = threading.Thread(target=factorial, args=(FAC, 1))

    io_thread.start()
    cpu_thread.start()

    io_thread.join()
    cpu_thread.join()


def synctest():
    get_content(URL)
    factorial(FAC, 1)

if __name__ == '__main__':
    from timeit import Timer
    synctimer = Timer("synctest()", "from __main__ import synctest; gc.enable()")   
    synctimer_times = synctimer.repeat(10, 1)
    threadtimer = Timer("threadtest()", "from __main__ import threadtest; gc.enable()")
    threadtimer_times = threadtimer.repeat(10, 1)

And here're the results:

As you can see, the threaded version successfully does the factorial while it is waiting for a web response.

But does this work for IO from disk?

To test, we can just swap in a new function and point the script at a relatively big file on disk:

def count_lines(fn):
   print('starting read')
   open(fn, 'r').read()
   print('done reading')
   return

And as before, the results via seaborn density plots:

Not really much of a gain!

So: can you actually parallelize the disk IO and compute tasks via threads in Python? Without spawning more processes?

Under the hood

This smells like a potential issue with Python's GIL, or 'global interpreter lock'.

The GIL basically says there's only one entry point to the CPython interpreter, and if you (a Python thread) are passing instructions in, you will retain that status at the front of the line until you release it (or, if you're not holding it explicitly, 100 instructions go by).

So while waiting on a socket to have new content (the way we do for web and database requests) lets other threads proceed, perhaps disk IO doesn't work so nicely!

At this point we should maybe slow down and say: what is disk IO, anyway? Fundamentally, I understand it to be a call to your operating system's kernel that says 'read these bytes from the file at this location to the memory buffer at this other location'. In our 'read speed rules of thumb' scale, we think of this as being 'spinning magnet' speed, which is much slower than memory or CPU speed. Thus, we should be able to tell the kernel to copy that information and do some computing while it deals with the spinning disk. (An added thorn is that the OS may cache a file in memory after reading it, so we may be ultimately just copying memory from location to location, and ultimately waiting on the disk only for the first call.)

Now, the complexity for how we do this in Python is another issue. We need the system call to be made in a way that would allow some other Python code to run while the reading is happening. That is, we need whatever's going on when you call read on a file object to 'release the GIL'.

It's not clear that really happens. Here's the bottom of the rabbit hole, the read function in cpython's iomodule's fileutil.c:

/* Read count bytes from fd into buf.
   On success, return the number of read bytes, it can be lower than count.
   If the current file offset is at or past the end of file, no bytes are read,
   and read() returns zero.
   On error, raise an exception, set errno and return -1.
   When interrupted by a signal (read() fails with EINTR), retry the syscall.
   If the Python signal handler raises an exception, the function returns -1
   (the syscall is not retried).
   Release the GIL to call read(). The caller must hold the GIL. */
Py_ssize_t
_Py_read(int fd, void *buf, size_t count)
{
    Py_ssize_t n;
    int err;
    int async_err = 0;

#ifdef WITH_THREAD
    assert(PyGILState_Check());
#endif

Typically, this read method is called in a while loop until a file is exhausted. The caller, then, is a while loop (by the name of readall) that calls the code above again and again. Entering the function, there's an assertion that the GIL is being held. So it looks like there's no way to do other Python work while the system call is being handled!

An alternative

Fortunately, there are other ways of reading a file than this iomodule. In particular, various C libraries for event loop programming are accessible to us, which may offer their own explicitly non-blocking interface to file system calls.

One such library is libuv, the micro event library that backs nodejs. Among many other things, it allows you register file system work (such as opening and reading) on an event loop, and to pass callbacks that should receive the return value of that work.

For the final tests, we'll look at the performance comparison among threaded/synchronous/callback approaches. To address the potential 'cached file' problem, we'll delete and create a bunch of files as part of each trial run. And to make the motivation a bit realer, we'll now consider the task of reading a directory of files and calculationg the md5 checksum of each file.

More code and results are on the repo here, but to cut to the chase, the callback-based strategy dominated the synchronous one, while the threaded performance remained in that vague "this might be better" territory!

Online Alternatives Testing Pt 3

2016-04-26 21:02:39

Online Alternatives Testing Pt III

Separating Concerns

In the last post, we saw that our strategy of online updates and performance-based alternative rendering efficiently put our best alternative in front of "users".

Now we will begin approaching the practical problems in using a system like this.

In the last implementation, our current estimates of alternative performance were used to synchronously determine the alternative the user should see next. In a web-scale setting, this is a problem: we want to have a variable number of workers serving content, and we want them to respond very fast (no SciPy in the web server!). We also may want to avoid re-initializing the SciPy object of a Beta distribution with every impression, as this is the most expensive operation we have so far. (Yes, this could be solved by simply using the Beta distribution's probability density function directly.)

To that end, I've sketched an implementation with two separate Actors for our two separate concerns -- impression serving and distribution updating.

Simple Data Structures

A primary question is: what is the minimum communication required between these services, and what data structures can represent that?

Impression serving wants quick access to the asset it should serve next.

Distribution updating wants quick access to the latest click-through outcomes.

We can accomplish this with a stack and some labeled counters:

First, the counters:

Each CTA needs an impression counter and a click counter Impression workers should update the impression count for a CTA when they render it (or, to make this more real, should emit this fact to a durable messaging system like Kafka so arbitrary systems can do what they want) Click-receiving workers should update the click counter for a CTA The update worker should reference the counters for each CTA and initialize beta distributions to match.

Then, the stack:

The update worker should draw a batch of CTAs to render (enough to last until the next update) and push these all onto the stack. Impression workers should pop the CTA stack and display what they get.

Sketch implementation

Both of these data structures are available in Redis, which can be clustered for high availability and is easily available as a managed service in AWS.

For the moment, we'll just show that this scheme works by using globals in a single Python process. Relevant snippets to be pulled out below, but here's the full sketch.

Rather than simulate the situation on a timeline, we can just alternate between impression and update tasks at a certain rate. I give the impression worker a bit of state so it can keep track of when it should switch back over to the updater:

        self.mean_requests_per_batch = 300
        self.batch_index = 0
        self.batch_size_dist = poisson(self.mean_requests_per_batch)
        self.batch_target = self.batch_size_dist.rvs(1)[0]

Alternating between tasks in a single Python process is accomplished with gevent. Our Actors will simply yield control at the end of each task and will take an action if they have something in their inbox.

The impression worker handles tasks with a function like:

    def receive(self, message):
        cta_idx = self.stack.pop()
        # "serve" the ad
        outcome = is_clicked(cta_idx)
        outcome_event = (cta_idx, True if outcome else False)
        # theoretically done by another service
        OUTCOMES[outcome_event] += 1
        self.batch_index += 1
        if self.batch_index == self.batch_target:
            update.inbox.put('stop')
            self.batch_target = self.batch_size_dist.rvs(1)[0]
            self.batch_index = 0
            gevent.sleep(0)
        else:
            view.inbox.put('go')
            gevent.sleep(0)

Interactions with the shared data structures - counter increments and stack pops - are conflict free, so this role can be played by many simultaneous services.

The update worker is a bit more complicated.

First, here it is without any optimizations:

    def receive(self, message):
        for idx in range(len(self.dists)):
            self.dists[idx] = beta(
                OUTCOMES[(idx, True)],
                OUTCOMES[(idx, False)])

        vals = np.vstack([dist.rvs(self.batch_size) for dist in self.dists])
        self.stack += [vals[:, i].argmax() for i in range(len(vals[0, :]))]
        view.inbox.put('render')
        gevent.sleep(0)

We re-initialize all distributions and add a predetermined number of draws to the stack.

My sketch adds two quick optimizations:

don't re-initialize unchanged betas set the batch size based on how much of the stack was popped since last update

The first is dead simple, but the second requires some decisions. It's a bit like the concept of 'capital requirements': The bank (ie, stack) should have enough elements to prevent a 'run on the bank'.

Rather than use the number of popped elements in recent batches to estimate the distribution of traffic volume, I'll take a simpler / more paranoid approach.


        # how much was popped since last update?
        just_consumed = self.target_size - len(self.stack)

        # we want batch size to be twice the largest consumption we've seen
        # obviously, over time we'll see more and more outliers and raise this higher and higher, but that's okay
        self.batch_size = max(self.batch_size, (just_consumed * 2))

        # here's the "capital requirement": determine the size of an additional reservoir beyond what's pushed each batch
        # as a guard against steeply increasing traffic volume
        self.target_size = max(self.target_size, (self.batch_size * 2))

        # since we intend to push far more entries than we're likely to pop
        # we should remove old entries to keep the stack length stable
        # (if we're pushing to reach a new, larger target size, we may not need this)
        to_trim = max((len(self.stack) + self.batch_size) - self.target_size, 0)

        # draw new entries and push them
        vals = np.vstack([dist.rvs(self.batch_size) for dist in self.dists])
        for i in range(self.batch_size):
            self.stack.append(vals[:, i].argmax())

        # trim entries from the old side
        for i in range(to_trim):
            self.stack.popleft()

        # this is only exactly true in the non-concurrent example,
        # but it is always true net of impressions served during update process
        assert len(self.stack) == self.target_size

Results

As usual, a GIF of the beta distributions during training:

Each of the 100 frames here corresponds to an 'update' batch running. After about 30 batches, the distributions are pretty well stabilized.

And how about batch size? Suppose we're expecting ~300 requests per batch -- how long of a stack are we maintaining, and how many are being pushed each update?

Our initial settings were too low, so we see this quickly get pushed to a pretty stable size. Eventually, we see a high outlier and raise the ceiling one more time.

This sketch shows a scalable way to implement online A/B testing with relatively little volume / performance knowledge up front.

Understanding Poisson

2016-04-25 21:14:20

Understanding Poisson

or: The Central Tendency is Not Enough

A fun thing about switching between programming & statisticizing is that when you're in one mode you can allow yourself to totally cut corners on the other mode, and promise yourself you'll fix it later.

For instance, if I'm programming, I might bracket a probability problem ("let's just take the average") and then solve it properly later. Coincidentally, this also makes for nice potential interview questions: "an engineer concludes X, how do you explain their error to them?"

Sound fun? Okay, here's one:

Programmer hat:

An engineer, Phil, is simulating an incoming request volume to test out a real time process. Phil knows there are on average 50 requests a second. Phil has a process that should fire every 6 seconds, and its performance may depend on how many requests have happened since it last ran. He wants to be probabilistic about it to get a better idea of real world performance.

Phil's logic goes like this:

if there are 50 requests per second on average, there'll be 300 on average between batches. if there are 300 on average, you could just draw a random 0 - 1 number every request, and if its below (1 / 300), it's the last request of the batch

How will Phil's simulation differ from what you would actually expect? How would you have modeled it?

Statistician hat:

What Phil has modeled is a process that shares only one thing with the real process: the mean.

Phil's model will produce results like these:

This thing averages to 300, but obviously the odds of being in the 200-300 range are way higher than the 300-400 range, which is probably not accurate as a model of request volume. Batch sizes as calculated from "Phil's model" are overlaid with draws from the geometric distribution, the "real thing" Phil's model reflects.

So what's the problem? Suppose a request arrives in the first microsecond of a second: is it really true that there's a 1/300 chance that no requests arrive during the next 9.9x10^5 microseconds? And that this 1/300 likelihood is also the odds of a request in the last microsecond being the last one that second? Clearly this is wrong.

What Phil missed is that the real phenomenon with that sort of even random draws is the arrival of a request in each microsecond, not likelihood of a request being the last member of a batch. It's more like (and even this is not quite right since there can be simultaneous arrivals) each microsecond has a 50/10^6 chance of containing a request (req/sec divided by microseconds/seq).

If each microsecond gets this independent shot at a uniform distribution, the arrival times take the shape that we saw above. That is, the probability of a a request arriving in the current microsecond is p, or in general, is (1 - p)^k * p, where k is the number of seconds. This arrival times model fixes our issues with the first/last microsecond thought experiment above. The odds of a first-microsecond request being the last in a batch are the same as the odds of the next arrival time being a full batch-duration away: you look way out down the right tail where the probability is very low. The odds of a last-microsecond request being the final request that batch, however, comes from the fat far left end of the distribution, as it's just the odds of not seeing another request that microsecond.

With such a large number of random draws in each batch, the number of successes is going to be much more strongly "about the mean" than the what you would see if you drew once per success. Indeed, you can simulate a bunch of batches according to this 'random draw each microsecond' logic:

mu_unif_draws = []
mu_p = 50. / 10**6

for i in range(1000):
    ct = 0
    for s in range(6):
        draws = UNIFORM.rvs(10**6)
        ct += np.sum(draws < mu_p)
    mu_unif_draws.append(ct)

For comparison, we can also draw from the Poisson distribution, which maybe you've heard of as a model of 'expected arrivals within a time period':

p = poisson(50)
poisson_draws = np.sum(p.rvs(6000).reshape(1000, 6), axis=1)

Here are those two sets of batch size draws:

Right on! And pretty symmetric about the mean!

So now Phil knows about the Poisson distribution and its relation to arrival times. And hopefully Phil knows a little better about when you can't just use the average and move along.

Scipy Matrix Tuning

2016-04-09 17:36:54

SciPy Matrix Tuning

Earlier this week, I helped code review a Markov chain generating Python app.

One issue I called out was a high space-complexity issue in the text pipeline. Though the code used a generator to supply adjacent words for matrix construction, it first read the entire corpus into memory. A missed opportunity for an all-generator pipeline!

Yielding Bigrams

A potential issue raised in response was that bigrams don't all come from the same line (last token of line n needs to paired with first token of line n+1). I wondered if this was really an issue and coded up a solution.

My full initial pass is here, but the relevant section is:

from collections import deque


def clean(token):
    return ''.join(filter(lambda v: ord(v) < 180, token)).lower()


def tokenize(line):
    return deque(map(clean, line.split()))


def generate_tokens(line_generator):
    last_token = ''
    for line in line_generator:
        tokens = tokenize(line)
        while tokens:
            current_token = tokens.popleft()
            yield (last_token, current_token)
            last_token = current_token

Our friend deque comes in very handy here, as it's efficient popleft method makes it easy to read left-to-right. Having a little bit of state in this generator (last_token) solves the line-splitting issue pretty simply.

Getting bigrams is then as simple as:

with open('file.txt', 'r') as f:
    do_stuff_with_bigrams(generate_tokens(f))

Performance Fail

I was digging these generators and functional text processing (tokenize and clean being simple ways of isolating this logic) in part because the code "felt like" the things it was trying to accomplish. This sort of expressiveness is one reason I love Python.

I went for a similar feel in actually constructing the Markov chain transition matrix:

        self.transition_matrix = np.vstack(
            [np.array(
                [self.bigram_counts[(w1, w2)] / float(self.word_counts[w1])
                 for w2 in self.word_list]
            )
                for w1 in self.word_list
            ]
        )

This felt expressive re: the content of a transition matrix:

We're going to vertically stack some things Those things are np arrays, so rows Each of those rows corresponds to a word in word_list The row is word_list long, so the matrix will be square Each of the cells in the row is the frequency of the second word following the first

It's a bit more nested than I'd like, but it felt fine for a quick implementation.

Not so "quick" after all...

But when I ran the pipeline for the first time, I discovered it took several times longer to run than the code it was supposedly an improvement on! What gives?

A little performance analysis (read: print statements) showed that nearly the entire runtime was spent constructing this matrix. This is really not surprising: sparsity is a big deal, and this matrix is highly sparse. Even the most dense rows are about 97% zeros.

When you move to sparse datastructures, you have to make some decisions. Do you value row slicing or column slicing? Will you be modifying the sparsity structure after initial construction?

Fortunately, the wonderful docs on SciPy's sparse module lay out the tradeoffs very clearly. Because this matrix is really just a collection of row-level categorical probability distributions, I went with the csr_matrix (compressed sparse row) datastructure.

Performance results

One of the recurring actions this pipeline introduces is converting between a word's string representation and it's index in the matrix. I wondered how much time is saved by creating a dictionary of words to indices vs. just calling .index repeatedly. Here's the code I used for performance testing:

def create_dict_getter(mk):
    word_dict = {w: ix for ix, w in enumerate(mk.word_list)}
    return lambda w: word_dict[w]

def create_idx_getter(mk):
    return lambda w: mk.word_list.index(w)

def create_csr(mk, create_getter):
    ix_getter = create_getter(mk)
    rows = []
    cols = []
    vals = []
    for (w1, w2), value in mk.bigram_counts.iteritems():
        rows.append(ix_getter(w1)) #word_dict(w1))
        cols.append(ix_getter(w2))
        vals.append(value / float(mk.word_counts[w1]))
    return sparse.csr_matrix(
        (np.array(vals), (np.array(rows), np.array(cols))),
        shape=tuple([len(mk.word_list)] * 2)
    ), (rows, cols, vals)

One reason for setting it up this way was to make sure the cost of constructing the dictionary would be included in the speed testing.

And here're the outcomes:

First, using index lookups:

 %timeit create_csr(mk, create_idx_getter)
1 loops, best of 3: 1.69 s per loop

Then, with a lookup table:

%timeit create_csr(mk, create_dict_getter)
10 loops, best of 3: 32.3 ms per loop

Wayyyy faster! And this was with under 6000 words!

So, great, the dictionary is worth it (though probably more worth is to simply make a clean switch to using indices earlier in the process).

Now, how bad was the initial vstack comprehension approach?

%timeit create_dense_stack(mk)

1 loops, best of 3: 42.1 s per loop

Over 1000 times worse! And that's just considering time complexity -- the dense matrix is also a memory fiend!

So, remember folks, even when your dataset has well under 10k elements, sparsity is a big honkin' deal.

Online Alternatives Testing Pt 2

2016-04-06 20:34:37

Online Alternative Testing (part II)

An Implementation

Building off the last post, here's an iPython-notebook-ready implementation:

import numpy as np
import pandas as pd
import seaborn as sns
from scipy.stats import beta, uniform
TRUE_CTR = [.02, .012, .03]
# initialize priors as if we'd seen 3 clicks and 100 impressions
CTR_DISTS = [beta(3, 97)] * len(TRUE_CTR)  
UNIFORM = uniform(0, 1)
LOGS = []

def is_clicked(cta_idx):
    return UNIFORM.rvs(1) < TRUE_CTR[cta_idx]

def draw_cta_index():
    draws = np.hstack([dist.rvs(1) for dist in CTR_DISTS])
    return draws.argmax()

def update_dist(outcome, dist):
    alpha_param, beta_param = dist.args
    if outcome:
        alpha_param += 1
    else:
        beta_param += 1
    return beta(alpha_param, beta_param)

def step():
    cta_idx = draw_cta_index()
    outcome = is_clicked(cta_idx)
    CTR_DISTS[cta_idx] = update_dist(outcome, CTR_DISTS[cta_idx])
    LOGS.append((cta_idx, outcome))

Some results:

To run this, we just need to call step however many times we'd like. Here's a gif of training (each frame is a plot after 1000 events, and the vertical lines are the 'true CTRs'): The height of each curve is a fair proxy for how much data populates it. Notice that as it becomes clear that the green distribution is the lowest performing, it basically doesn't get any higher. We stop showing this button. Initially, the blue curve looks like the winner, but again, as the red curve passes it, the height gains increasingly go to the winner.

Improvements over performance-agnostic bucketing

Another interesting comparison is: how different does this process go if you just randomly select a CTA rather than taking the CTA whose CTR distribution produced a winning draw? We can swap our draw_cta_index function for something like this:

def draw_cta_evenly():
    draws = UNIFORM.rvs(3)
    return draws.argmax()

Here's a comparison of our distributions after 6000 impressions between these models.
First, the 'winning beta' method: We have good certainty about the winner, and the losers are only estimated (at this point) well enough to know they lost.
Then, the 'even draw' method: This is just one particular run, but clearly there's no favoring of higher-performance alternatives.
The winning beta method has significant impact in engagement: even in the first 1500 impressions, it produced, in this example, 20% more clicks. That number should only go up as the learning progresses and the impression-share of the winner increases.

Validating estimator properties

One nice thing about the "in the lab" example is we can see if the online model has the desired relationship to the ground truth. In particular, are our beta distributions higher variance when they're further from the truth? We can put add some logging to measure this.


ERROR_VAR_LOGS = []

def log_err_and_var():
    beta_mean_errs = np.abs(np.array(TRUE_CTR) - ctr_dist_means())
    beta_vars = np.array([x.var() for x in CTR_DISTS])
    for mean, var in zip(beta_mean_errs, beta_vars):
        ERROR_VAR_LOGS.append((mean, var))

It makes sense to call this periodically, since only one beta is updated each step; below are some results from collecting this data every 50 impressions:

error_df = pd.DataFrame(ERROR_VAR_LOGS, columns=('mean_error', 'dist_variance'))
error_df['dist_std'] = error_df['dist_variance'].apply(np.sqrt)
ax = error_df.plot(kind='scatter', x='mean_error', y='dist_std')

Happily, we see a positive relationship between error and distribution variance, so our tightening distributions are properly increasing confidence as they increase accuracy. The next part will outline how we can separate serving and selection processes and maintain render speed performance.

Online Alternatives Testing Pt 1

2016-04-05 21:22:59

Online Alternative Testing (part I)

The old model

Working as a Data Scientist for a widget in e-commerce, one of my responsibilities was 'overseeing A/B testing.'

There were a lot of assumptions about how testing should work:

tests should be planned out in advance with specific 'bucket sizes' (control / test ratios) tests should give good estimates of the performance of each alternative tests should be turned on, measured according to plan, then turned off in favor of using the 'winner' 100% of the time

At the same time, account managers were somewhat afraid of A/B tests:

they risked losing engagement during the test due to underperforming buckets they would need to have dense quantitative conversations with clients due to the reporting cadence implied by the 'end of a test'

There must be a better way!

A new approach

For simplicity's sake, let's just talk about a button we want people to click. We want more of them to click it. At the moment, we'll ignore the 'quality' of the click (if more click but fewer buy, for instance, that could be bad).

The "human" part of alternatives testing should really be limited: picking alternatives. Once a new alternative has been selected — meaning we all agree that if it gets more clicks, we want to use it — it should just enter the mix to crash or burn.

A system that gets this right and removes the earlier obstacles should probably have these features:

real time probabilistic selection of which alternative to use (no more on / off states) performance-based selection that balances uncertainty and engagement-maximization, or specifically: use the 'winner' more and more as we become sure it's the winner use 'losers' only insofar as we're uncertain that they're losers easy to introduce new alternatives to the system

Get Bayesian

We can construct such an online system leveraging Bayesian statistics. Rather than the old model's separation of "data collection" and "certainty analysis" phases, Bayesian methods allow us to do both simultaneously.

Not writing a textbook here, so I'll just put this informally: Bayesian methods focus on conditional probability and giving distributions rather than specific estimates. This is intuitive: the ballpark performance you'd estimate for a button design given no evidence is probably wider than the ballpark you'd give after 5000 people have seen it and some percentage have clicked it.

It turns out there's a nice probability distribution, the Beta distribution, that corresponds to estimating the rate at which some binary thing is True. It (roughly) has two parameters: number of successes and number of failures.

For instance, here's a Beta(10, 990) and a Beta(100, 9900) distribution overlaid:

Both lines represent data from a 1% CTR button. The green line is the distribution after 10000 such views, so it is tightening in on the true rate, while the blue is after only 1000, and is much wider.

Importantly, while we can be sure the green line didn't come from a button with a long run CTR of 1.5%, the blue line is still substantially above zero at the 1.5% mark.

Some (pseudo)code

With the basics done, here's the pseuocode for my approach:

initialize beta distribution for each alternative
set the parameters of these to reflect 100 impressions with your 'goal' CTR

for each page load:
    draw one value from each beta distribution
    get the index of the largest draw
    show the button corresponding to that index
    update that beta distribution with the outcome of the render

At any time, you can add or remove alternatives. You just need to initialize new entries with parameters that will beat your current winner so that they make it into the mix. Removal is trivial but unnecessary, since losing alternatives will be irrelevant.

Shared State & UWSGI

2016-04-04 17:13:16

Flask WSGI

No, it ain't WYSIWYG...

App Lifecycles

I heard many times the phrase that 'a Flask app instance's lifecycle is typically only one request' and received it as a simple truth. For serving web pages that's great, but for some other uses (and designs), it may be a problem.

In particular, I was worried about the implications of this for an app with a model that's performing online learning: if every request is simply an update, and the model is supposed to live across these updates, how will that work? Where should it be stored?

So what's the deal with the 'one request' lifecycle? Is that always true? What makes that true?

Long-running Flask App

In working on my voweler project, I expected this to be a problem. The app runs a perceptron that gets an update with each request.

Ripping off mattupstate's overholt project, I was running the app in a single Python process via a module like this:

from werkzeug.serving import run_simple
from werkzeug.wsgi import DispatcherMiddleware

import voweler

application = DispatcherMiddleware(voweler.create_app())

if __name__ == "__main__":
    run_simple(
        '0.0.0.0', 5000, application,
        use_reloader=True, use_debugger=True)

Lo and behold, there was no issue with cross-request persistence! Re-reading the code here, that actually makes perfect sense: I'm invoking this program once, it calls create_app exactly once, and thus only one app instance is ever handling requests.

So what was I missing?

Serving at the will of the (UWSGI) Emperor

The fundamental problem: the statement about Flask app lifecycle's isn't actually about Flask, it's about whatever's dispatching requests to Flask.

My first experience dealing with "dispatching" was in making a little PHP API at my first job after college. We had an ubuntu server in the office and set up nginx and PHP to provide access to our "street address disambiguation service". I recall the phrase "PHP-FPM", which at the time I thought was about "forwarding" but upon googling is in fact "FastCGI Process Manager"! Wow! This GI Gateway Interface thing is back, and so are "fast processes", ie, processes with a lifecycle of one request!

So there's a pattern where web servers send interesting requests to something that can create short-lived processes written in a language like PHP or Python. In Flask land, a common answer to this appears to be UWSGI.

With UWSGI, we could again (like with the PHP-FPM example) use one nginx webserver, pass requests for non-static resources along to UWSGI, and have UWSGI spawn short-lived Flask apps.

Okay, now it's broken

Now that I know what's responsible for the short lifecycles, I'll run that and see if I can break my app.

Following the UWSGI quickstart guide, I was able to do a simple:

uwsgi --http :9090  --wsgi-file wsgi.py  --processes 4 --threads 2 --stats 127.0.0.1:9191

Using uwsgitop, I was quickly able to see that requests were being spread across the workers (though there was a clear winner handling the majority of requests, which is a mystery for another day).

My favorite application-specific way of detecting a problem involved exposing the perceptron to a data point that it already had right. Working properly, the app should do nothing: no error, no update. However, because the app instance that properly classified 'p' as a consonant wasn't the app instance that received the request to "consider the consonant 'p'", we saw something happen. In fact, even if both had classified 'p' properly, we'd likely see a visual change because the weight vectors of the two perceptrons would be different.

Can we fix it?

I discussed my problem with Michael Cordover and we talked about a few solutions:

Have a single instance of the perceptron (ie, "just don't have the problem in the first place") Fetch the current perceptron state from a shared datastore each time an app is created (redis might be appropriate if the weight adjustments were properly designed, though postgres would better handle simple read/write of the entire weight vector). By adjusting the life cycle of each app to more than one request, we could amortize this startup cost to a probably workable degree. Be a real crazy and try to share memory between Python processes. Given that I'm at RC f.k.a. Hacker School, this seems like the right approach.

Abusing UWSGI SharedArea

Despite the abundant options within proper datastores and shared caches (which is really what we're dealing with) I wanted to have the "fun" of getting closer to managing the memory myself.

UWSGI offers a shared memory feature called SharedArea which offers a Python interface to read/write values from the same memory address within the many short-lived app instances it creates to handle requests.

I definitely appreciated the safeguards UWSGI offers for doing something like this. Via proper locking/unlocking in the read/write methods, I wouldn't have to worry about building my own system of semaphores or something.

But while UWSGI sharedarea itself is thread safe, that doesn't mean your (my) code was.

Initially I had something like this:

def set_state(value):
    """
    Serializes object containing numpy arrays
    writes null bytes across sharedarea before writing real values
    """
    clear_sharedarea()  # function to write null bytes across memory
    json_string = json.dumps(_prepare_numpy_object(value))
    memory_length = len(uwsgi.sharedarea_memoryview(0))
    uwsgi.sharedarea_write(0, 0, json_string)

While this worked fine if I kept the requests to a calm manual clicking pace, when I upped the concurrency and used my 'send a bunch of requests' buttons, I got into bad states. Sometimes an app would read the memory and find nothing, sometimes it would find invalid JSON, typically with an error at the end. Seeing a string terminating with 29095]}5839]} seemed like good evidence that the memory clearing was not quite on target.

The epiphany came when I compared it to database client code: I'm wiping and writing in two different transactions, and there's no reason someone couldn't be reading (or writing) between the two! Thread safety was guaranteed within one uwsgi.sharedarea_* function call, but of course not between them!

So I needed my clearing and writing to be the same function. Here's what I ended up with:

def set_state(value):
    """
    Serializes object containing numpy arrays
    writes null bytes for remaining space in sharedarea
    """
    json_string = json.dumps(_prepare_numpy_object(value))
    memory_length = len(uwsgi.sharedarea_memoryview(0))
    uwsgi.sharedarea_write(0, 0, (
        json_string + (memory_length - len(json_string)) * b'\x00')
    )

This ends up working fine, though there's really no good reason to do this instead of using a "real" cache. I'll also note the memory_length should probably be measured once or read out of the application config (since the area is only allocated once on container startup) rather than each time we write a value. Oh well, this was a mere exercise :)

The result is on github on a multiapp branch!

hljs.initHighlightingOnLoad();

Fun with Perceptron Learning

2016-04-04 16:57:35

Perceptron Learning

A perceptron is the simplest form of 'neural network' learning. The 'neuron' is simply a vector that can be multiplied against data points to say 1 or 0. Learning consists of exposing the perceptron to data, and if it's wrong, adjusting the vector in the other direction. A simple case I wanted to try out was learning which letters are vowels. As I thought about how to represent letters, I realized I'd want a dense, somewhat high-dimensional vector. Initially I accomplished this with hashing:

In [7]: hash('h')
Out[7]: 13312040041
In [8]: vp.to_vec('h')
Out[8]: array([1, 3, 3, 1, 2, 0, 4, 0, 0, 4, 1])

This was a fine start, but one of my goals was showing the power of dimensionality for this sort of arbitrary class separation. Viewing the hash as simply a random draw in some space, I refactored the 'embedding' of letters to vectors using np.random.random to get an array of the specified size.

Training Visualization App

The real thing I wanted to get to was seeing the online learning aspect of the perceptron in operation. Putting the user in control of the sequence of learning data would illustrate that you can control the 'picture of the world' the learner has. Feed it 'consonsants', it will make pro-consonant predictions; feed it 'vowels', it will be overly vowel-ey. Here's the UI I ended up with (or really a slice of it):

To run

The project is on Github here More packaging (on both Python and JS sides) will be coming soon!

Welcome

2016-03-30 16:03:17

Wow!

You have found my RC batch blog! Congratulations!

The posts here will be very topical, mostly recaps of things I've learned since the prior post. More about me can be found here.

What is this site?

So one day, as inevitably occurs, my introductory promotional rate on my home internet expired. Suddenly I was paying nearly twice as much!

Like any responsible customer, I called and threatened to switch to a competitor. While on hold, I looked for the first time at what exactly my package contained. Lo and behold! Free hosting!

Free hosting is "as in beer", unfortunately

I like free stuff, but honestly, I probably get more out of the free as in libre than the free as in gratis. (I recently actually looked into the "as in beer" reference that comes up in open source discussions all the time and found the source so if you feel the same way check it out!).

Free hosting is a great example of gratis without libre: you can have this for free, but you can only run the LAMP stack.

For instance, I'd never seen this before:

l0stinsp4ce@lsh2101:~$ sudo echo 'fish'
-bash: sudo: command not found

So the path that led me to this outdated PHP CMS is more one of nostalgia and freegan impulses (don't let that free hosting spoil!) than GNU-style freedom.