08 March, 2014

Profiling python with cProfile

A couple of days ago, I had an assignment for an Information Retrieval class that basically involved:

Indexing a large corpus (Reuters from nltk)
Searching it using boolean queries (eg. bill AND gates)

For the second part of the of the assignment, performance was pretty important. Since we would want to return results for the user's queries quickly.

This blogpost is basically about how I used python's cProfile to identify and fix bottlenecks/slow parts in my code. Most of these bottlenecks would have been hard to identify without the profiler.

During my internship at Quora, one of the things I worked on was POST speed improvements for core actions across the product. It was my first brush with speed work and the main lesson I took away was the importance of measuring and profiling before attempting to optimise.

"Bottlenecks occur ins surprising places, so don't try to second guess and put in a speed hack until you have proven that's where the bottleneck is." - Rob Pike

Measure

After getting a working submission out. I proceeded to measure/benchmark my code. Since this program was meant to run as a script from the command line, I used the simple time command-line tool to roughly benchmark how fast my code was.

$ time python search.py
# real    0m1.521s
# user    0m1.250s
# sys     0m0.142s

Profile

Once I was happy with the measurement, I proceeded to profile my code using cProfile.

$ python -m cProfile -o profile_output search.py

The -o flag basically specifies an output filename for cProfile to write its output to. (Without which, it'll simply spit out the stats to the stdout, which is undesirable since we'd be unable to sort/drill down into specific functions.)

Making sense of the cProfile output

The cProfile output is a basically binary file that contains the following information:

(For each function called in the python program)

How long each call took (percall, inclusive and exclusive)
How many times it was called (ncalls)
How long it took (cumtime: includes the times of other functions it calls)
How long it actually took (tottime: excludes the times of other functions)
What functions it called (callers)
What functions called it (callees)

If you didn't specify the output, you'll basically only get a dump of the information without the caller/callees part. (Which is quite helpful in making sense of everything). You'd also lose the ability to dynamically re-sort the information based on your preferred metric (unless you re-profile it with a -s flag (I think)).

Reading the cProfile binary output file

In order to read this binary file, python provides a pstats.Stats class that happily spits out the various infomation for you in a python shell (when called with the relevant methods).

I found this rather tedious and googling around for an easier way to read this binary file yield nothing. I just wanted a simple way to:

See all the information
Sort them with a single click
Drill down to functions (and their callers and callees) with a single click

(These as oppose to manually calling methods on the Stats object each time.)

So I wrote this: CProfileV. Which is bacially a thin wrapper for viewing python cProfile output in your browser. Yay!

[UPDATE]: cProfileV has been updated. See https://github.com/ymichael/cprofilev for the latest usage instructions.

# Install cprofilev
$ sudo pip install cprofilev

# Call it with your cprofile output
$ cprofilev -f /path/to/cprofile/output

# Navigate to http://localhost:4000

cprofilev1

Finding the bottlenecks

The pstats.Stats object allows you to sort by the various keys and their combinations. Most often, I find that the most useful sort keys are:

cumulative time
total time
calls (to find unneccessary function calls (objects being created in a loop for instance))

Sorting by total time for me yielded this:

cprofilev2-tottime

The top line stood out to me. The function not_operation was taking a suspiciously long time.

Clicking into the not_operation showed that the functions it was calling were not taking a lot of time. (Implying that the slowness was due to some code within the function itself.)

cprofilev3-not-operation

# line 76 of search_index.py
def not_operation(operand, dictionary, pfile):
    """Performs the operation `NOT operand`."""

    # A list of all the documents (sorted)
    all_docs = dictionary.all_docs()

    # A list of the documents matching `operand` (sorted)
    results = get_results(operand, dictionary, pfile, force_list=True)

    return [doc for doc in all_docs if doc not in results]

So it turns out that the list comprehension in the function was basically really inefficient. It became super obvious once I narrowed down that not_operation was slow.

Optimise/Fix ineffient code

Excited to have found a possible bottleneck, I quickly implemented a fix.

# the fix.
def not_operation(operand, dictionary, pfile):
    """Performs the operation `NOT operand`."""

    # A list of all the documents (sorted)
    all_docs = dictionary.all_docs()

    # A list of the documents matching `operand` (sorted)
    results = get_results(operand, dictionary, pfile, force_list=True)

    return list_a_and_not_list_b(all_docs, results)


def list_a_and_not_list_b(a, b):
    """Returns `a AND NOT b`.

    Both a and b are expected to be sorted lists.

    """
    results = []
    idx_a = 0
    idx_b = 0
    while idx_a < len(a) and idx_b < len(b):
        if a[idx_a] < b[idx_b]:
            results.append(a[idx_a])
            idx_a += 1
        elif b[idx_b] < a[idx_a]:
            idx_b += 1
        else:
            idx_a += 1
            idx_b += 1

    while idx_a < len(a):
        results.append(a[idx_a])
        idx_a += 1

    return results

Measure again

Measuring the time taken showed promising results. With this fix, the time taken dropped from roughly 1.5s to 1.15s.

time python search.py
# real    0m1.160s
# user    0m1.018s
# sys     0m0.133s

Profiling the code showed that the slowness was no longer coming from not_operation.

cprofilev4-not-operation

Rinse and repeat

This time though, the most of the time seemed to be spent in the list_a_and_not_list_b operation.

cprofilev5-bool-operation

In particular, I seemed to be doing ~200k len operations and ~115 append operations on some list objects.

This seemed like a red-flag so I took a closer look at the list_a_and_not_list_b function.

It turns out that I really didn't need the while loop at the end of the function.

# From this.
while idx_a < len(a):
    results.append(a[idx_a])
    idx_a += 1

# To this.
results.extend(a[idx_a:])

Measuring with this new change:

time python search.py
# real    0m0.895s
# user    0m0.771s
# sys     0m0.122s

Woot! Got it to run under a second!

Summary

Overall, I'm pretty happy with the performance gain I got. I did a couple more that I didn't cover here (some other boolean operations were performing poorly under other testcases). Its a pretty cool feeling to methodically find bottlenecks and fix them.