SciPy Matrix Tuning

SciPy Matrix Tuning

Earlier this week, I helped code review a Markov chain generating Python app.

One issue I called out was a high space-complexity issue in the text pipeline. Though the code used a generator to supply adjacent words for matrix construction, it first read the entire corpus into memory. A missed opportunity for an all-generator pipeline!

Yielding Bigrams

A potential issue raised in response was that bigrams don't all come from the same line (last token of line n needs to paired with first token of line n+1). I wondered if this was really an issue and coded up a solution.

My full initial pass is here, but the relevant section is:

from collections import deque


def clean(token):
    return ''.join(filter(lambda v: ord(v) < 180, token)).lower()


def tokenize(line):
    return deque(map(clean, line.split()))


def generate_tokens(line_generator):
    last_token = ''
    for line in line_generator:
        tokens = tokenize(line)
        while tokens:
            current_token = tokens.popleft()
            yield (last_token, current_token)
            last_token = current_token

Our friend deque comes in very handy here, as it's efficient popleft method makes it easy to read left-to-right. Having a little bit of state in this generator (last_token) solves the line-splitting issue pretty simply.

Getting bigrams is then as simple as:

with open('file.txt', 'r') as f:
    do_stuff_with_bigrams(generate_tokens(f))

Performance Fail

I was digging these generators and functional text processing (tokenize and clean being simple ways of isolating this logic) in part because the code "felt like" the things it was trying to accomplish. This sort of expressiveness is one reason I love Python.

I went for a similar feel in actually constructing the Markov chain transition matrix:

        self.transition_matrix = np.vstack(
            [np.array(
                [self.bigram_counts[(w1, w2)] / float(self.word_counts[w1])
                 for w2 in self.word_list]
            )
                for w1 in self.word_list
            ]
        )

This felt expressive re: the content of a transition matrix:

It's a bit more nested than I'd like, but it felt fine for a quick implementation.

Not so "quick" after all...

But when I ran the pipeline for the first time, I discovered it took several times longer to run than the code it was supposedly an improvement on! What gives?

A little performance analysis (read: print statements) showed that nearly the entire runtime was spent constructing this matrix. This is really not surprising: sparsity is a big deal, and this matrix is highly sparse. Even the most dense rows are about 97% zeros.

When you move to sparse datastructures, you have to make some decisions. Do you value row slicing or column slicing? Will you be modifying the sparsity structure after initial construction?

Fortunately, the wonderful docs on SciPy's sparse module lay out the tradeoffs very clearly. Because this matrix is really just a collection of row-level categorical probability distributions, I went with the csr_matrix (compressed sparse row) datastructure.

Performance results

One of the recurring actions this pipeline introduces is converting between a word's string representation and it's index in the matrix. I wondered how much time is saved by creating a dictionary of words to indices vs. just calling .index repeatedly. Here's the code I used for performance testing:

def create_dict_getter(mk):
    word_dict = {w: ix for ix, w in enumerate(mk.word_list)}
    return lambda w: word_dict[w]

def create_idx_getter(mk):
    return lambda w: mk.word_list.index(w)

def create_csr(mk, create_getter):
    ix_getter = create_getter(mk)
    rows = []
    cols = []
    vals = []
    for (w1, w2), value in mk.bigram_counts.iteritems():
        rows.append(ix_getter(w1)) #word_dict(w1))
        cols.append(ix_getter(w2))
        vals.append(value / float(mk.word_counts[w1]))
    return sparse.csr_matrix(
        (np.array(vals), (np.array(rows), np.array(cols))),
        shape=tuple([len(mk.word_list)] * 2)
    ), (rows, cols, vals)

One reason for setting it up this way was to make sure the cost of constructing the dictionary would be included in the speed testing.

And here're the outcomes:

First, using index lookups:

 %timeit create_csr(mk, create_idx_getter)
1 loops, best of 3: 1.69 s per loop

Then, with a lookup table:

%timeit create_csr(mk, create_dict_getter)
10 loops, best of 3: 32.3 ms per loop

Wayyyy faster! And this was with under 6000 words!

So, great, the dictionary is worth it (though probably more worth is to simply make a clean switch to using indices earlier in the process).

Now, how bad was the initial vstack comprehension approach?

%timeit create_dense_stack(mk)

1 loops, best of 3: 42.1 s per loop

Over 1000 times worse! And that's just considering time complexity -- the dense matrix is also a memory fiend!

So, remember folks, even when your dataset has well under 10k elements, sparsity is a big honkin' deal.