Parallelising Python with Threading and Multiprocessing

Parallelising Python with Threading and Multiprocessing

One aspect of coding in Python that we have yet to discuss in any great detail is how to optimise the execution performance of our simulations. While NumPy, SciPy and pandas are extremely useful in this regard when considering vectorised code, we aren't able to use these tools effectively when building event-driven systems. Are there any other means available to us to speed up our code? The answer is yes - but with caveats!

In this article we are going to look at the different models of parallelism that can be introduced into our Python programs. These models work particularly well for simulations that do not need to share state. Monte Carlo simulations used for options pricing and backtesting simulations of various parameters for algorithmic trading fall into this category.

In particular we are going to consider the Threading library and the Multiprocessing library.

Concurrency in Python

One of the most frequently asked questions from beginning Python programmers when they explore multithreaded code for optimisation of CPU-bound code is "Why does my program run slower when I use multiple threads?".

The expectation is that on a multi-core machine a multithreaded code should make use of these extra cores and thus increase overall performance. Unfortunately the internals of the main Python interpreter, CPython, negate the possibility of true multi-threading due to a process known as the Global Interpreter Lock (GIL).

The GIL is necessary because the Python interpreter is not thread safe. This means that there is a globally enforced lock when trying to safely access Python objects from within threads. At any one time only a single thread can acquire a lock for a Python object or C API. The interpreter will reacquire this lock for every 100 bytecodes of Python instructions and around (potentially) blocking I/O operations. Because of this lock CPU-bound code will see no gain in performance when using the Threading library, but it will likely gain performance increases if the Multiprocessing library is used.

Parallelisation Libraries Implementation

We are now going to utilise the above two separate libraries to attempt a parallel optimisation of a "toy" problem.

Threading Library

Above we alluded to the fact that Python on the CPython interpreter does not support true multi-core execution via multithreading. However, Python DOES have a Threading library. So what is the benefit of using the library if we (supposedly) cannot make use of multiple cores?

Many programs, particularly those relating to network programming or data input/output (I/O) are often network-bound or I/O bound. This means that the Python interpreter is awaiting the result of a function call that is manipulating data from a "remote" source such as a network address or hard disk. Such access is far slower than reading from local memory or a CPU-cache.

Hence, one means of speeding up such code if many data sources are being accessed is to generate a thread for each data item needing to be accessed.

For example, consider a Python code that is scraping many web URLs. Given that each URL will have an associated download time well in excess of the CPU processing capability of the computer, a single-threaded implementation will be significantly I/O bound.

By adding a new thread for each download resource, the code can download multiple data sources in parallel and combine the results at the end of every download. This means that each subsequent download is not waiting on the download of earlier web pages. In this case the program is now bound by the bandwidth limitations of the client/server(s) instead.

However, many financial applications ARE CPU-bound since they are highly numerically intensive. They often involve large-scale numerical linear algebra solutions or random statistical draws, such as in Monte Carlo simulations. Thus as far as Python and the GIL are concerned, there is no benefit to using the Python Threading library for such tasks.

Python Implementation

The following code illustrates a multithreaded implementation for a "toy" code that sequentially adds numbers to lists. Each thread creates a new list and adds random numbers to it. This has been chosen as a toy example since it is CPU heavy.

The following code will outline the interface for the Threading library but it will not grant us any additional speedup beyond that obtainable in a single-threaded implementation. When we come to use the Multiprocessing library below, we will see that it will significantly decrease the overall runtime.

Let's examine how the code works. Firstly we import the threading library. Then we create a function list_append that takes three parameters. The first, count, determines the size of the list to create. The second, id, is the ID of the "job" (which can be useful if we are writing debug info to the console). The third parameter, out_list, is the list to append the random numbers to.

The __main__ function creates a size of $10^7$ and uses two threads to carry out the work. It then creates a jobs list, which is used to store the separate threads. The threading.Thread object takes the list_append function as a parameter and then appends it to the jobs list.

Finally, the jobs are sequentially started and then sequentially "joined". The join() method blocks the calling thread (i.e. the main Python interpreter thread) until the thread has terminated. This ensures that all of the threads are complete before printing the completion message to the console:

# thread_test.py

import random
import threading


def list_append(count, id, out_list):
    """
    Creates an empty list and then appends a 
    random number to the list 'count' number
    of times. A CPU-heavy operation!
    """
    for i in range(count):
        out_list.append(random.random())

if __name__ == "__main__":
    size = 10000000   # Number of random numbers to add
    threads = 2   # Number of threads to create

    # Create a list of jobs and then iterate through
    # the number of threads appending each thread to
    # the job list 
    jobs = []
    for i in range(0, threads):
        out_list = list()
        thread = threading.Thread(target=list_append(size, i, out_list))
        jobs.append(thread)

    # Start the threads (i.e. calculate the random number lists)
    for j in jobs:
        j.start()

    # Ensure all of the threads have finished
    for j in jobs:
        j.join()

    print "List processing complete."

We can time this code using the following console call:

time python thread_test.py

It produces the following output:

List processing complete.

real    0m2.003s
user    0m1.838s
sys     0m0.161s

Notice that the user and sys both approximately sum to the real time. This is indicative that we gained no benefit from using the Threading library. If we had then we would expect the real time to be significantly less. These concepts within concurrent programming are usually known as CPU-time and wall-clock time respectively.

Multiprocessing Library

In order to actually make use of the extra cores present in nearly all modern consumer processors we can instead use the Multiprocessing library. This works in a fundamentally different way to the Threading library, even though the syntax of the two is extremely similar.

The Multiprocessing library actually spawns multiple operating system processes for each parallel task. This nicely side-steps the GIL, by giving each process its own Python interpreter and thus own GIL. Hence each process can be fed to a separate processor core and then regrouped at the end once all processes have finished.

There are some drawbacks, however. Spawning extra processes introduces I/O overhead as data is having to be shuffled around between processors. This can add to the overall run-time. However, assuming the data is restricted to each process, it is possible to gain significant speedup. Of course, one must always be aware of Amdahl's Law!

Python Implementation

The only modifications needed for the Multiprocessing implementation include changing the import line and the functional form of the multiprocessing.Process line. In this case the arguments to the target function are passed separately. Beyond that the code is almost identical to the Threading implementation above:

# multiproc_test.py

import random
import multiprocessing


def list_append(count, id, out_list):
    """
    Creates an empty list and then appends a 
    random number to the list 'count' number
    of times. A CPU-heavy operation!
    """
    for i in range(count):
        out_list.append(random.random())

if __name__ == "__main__":
    size = 10000000   # Number of random numbers to add
    procs = 2   # Number of processes to create

    # Create a list of jobs and then iterate through
    # the number of processes appending each process to
    # the job list 
    jobs = []
    for i in range(0, procs):
        out_list = list()
        process = multiprocessing.Process(target=list_append, 
                                          args=(size, i, out_list))
        jobs.append(process)

    # Start the processes (i.e. calculate the random number lists)      
    for j in jobs:
        j.start()

    # Ensure all of the processes have finished
    for j in jobs:
        j.join()

    print "List processing complete."

We can once again time this code using a similar console call:

time python multiproc_test.py

We receive the following output:

List processing complete.

real    0m1.045s
user    0m1.824s
sys     0m0.231s

In this case you can see that while the user and sys times have reamined approximately the same, the real time has dropped by a factor of almost two. This makes sense since we're using two processes. Scaling to four processes while halving the list size for comparison gives the following output (under the assumption that you have at least four cores!):

List processing complete.

real    0m0.540s
user    0m1.792s
sys     0m0.269s

This is an approximate 3.8x speed-up with four processes. However, we must be careful of generalising this to larger, more complex programs. Data transfer, hardware cache-levels and other issues will almost certainly reduce this sort of performance gain in "real" codes.

In later articles we will be modifying the Event-Driven Backtester to use parallel techniques in order to improve the ability to carry out multi-dimensional parameter optimisation studies.