Py: Python for Performance

Py: Python for Performance#

Gotta go faster

By Henry Ma and Jacky Poon

Introduction#

For data analysis, efficiency matters when working with datasets of a sufficient size. Efficiency reduces running time and expenses when operating on the cloud, and for those limited to underpowered work laptops, it can mean the difference between being able to execute the analysis or running into frustrating delays or out of memory errors.

We share some tips and techniques for optimising performance of Python data analysis code.

Avoiding explicit loops#

Certain computations may require applying the same routines repeatedly. For example, a Monte Carlo simulation that requires generating a large number of paths. A “for loop” might be the most straightforward choice here but it would materially slow down the runtime. In fact, any explicit loops have this issue because Python, being an “interpreted” language, requires time to interpret the code on each iteration of the loop. Using vectorisation would substantially speed to the code. Let’s see an example below.

Suppose one wants to price a simple European call option using the Black-Scholes formulae (recalling Part 1 of actuarial exams).

import numpy as np
def mcBScall_loop(S,K,sgm,r,T,n):
    payoff_sim = 0.0
    
    for i in range(n):
        w = np.random.standard_normal()
        ST = S*np.exp((r-0.5*sgm**2)*T+sgm*np.sqrt(T)*w)
        payoff = ST-K
        payoff = payoff*(payoff>0)
        payoff_sim += payoff
    
    price = np.exp(-r*T) * payoff_sim / n
    return price

%%time
spot = 100
strike = 110
vol = 0.45
r = 0.02
T = 1.5
n = 500000
callprice = mcBScall_loop(spot, strike, vol, r, T, n)
print(f'The call price is {callprice} based on {n} simulations')

The call price is 19.248627288124922 based on 500000 simulations
CPU times: user 1.3 s, sys: 114 ms, total: 1.41 s
Wall time: 666 ms

A runtime of around 2 seconds might not be satisfactory, especially if pricing many options. Let’s see the comparison to the vectorised version.

def mcBScall_vect(S,K,sgm,r,T,n):
    w = np.random.standard_normal(n)
    ST = S*np.exp((r-0.5*sgm**2)*T+sgm*np.sqrt(T)*w)
    payoff = ST-K
    payoff=payoff*(payoff>0)
    price=np.exp(-r*T)*np.average(payoff)
    
    return price

%%time
callprice2 = mcBScall_vect(spot, strike, vol, r, T, n)
print(f'The call price is {callprice} based on {n} simulations')

The call price is 19.248627288124922 based on 500000 simulations
CPU times: user 12.5 ms, sys: 3.39 ms, total: 15.9 ms
Wall time: 16.5 ms

The performance improved by close to a factor of 100x as we remove the explicit looping.

Numba JIT (Just-in-time compilation)#

Python is not inherently fast – the high-performance computations in data science packages are often done by taking the data outside of Python interfacing with some fast C code. However, detailed numerical calculations written in Python can still be accelerated through compilation, with Numba (https://numba.pydata.org). Numba will take the code and create a more optimised version of it – in one example at https://numba.readthedocs.io/en/stable/user/5minguide.html, this reduced running time by 20x from 6.6s to 0.33s.

A particular use case for JIT is to speed up explicit loops, since sometimes these loops cannot be easily avoided (e.g. for a recursion). We see below how JIT reduces the runtime for the above Monte Carlo calculation with “for loops”.

from numba import jit

@jit(nopython=True)
def mcBScall_loop(S,K,sgm,r,T,n):
    payoff_sim = 0.0
    
    for i in range(n):
        w = np.random.standard_normal()
        ST = S*np.exp((r-0.5*sgm**2)*T+sgm*np.sqrt(T)*w)
        payoff = ST-K
        payoff = payoff*(payoff>0)
        payoff_sim += payoff
    
    price = np.exp(-r*T) * payoff_sim / n
    return price
callprice = mcBScall_loop(spot, strike, vol, r, T, n)

%%time
callprice = mcBScall_loop(spot, strike, vol, r, T, n)
print(f'The call price is {callprice} based on {n} simulations')

The call price is 19.28691426375314 based on 500000 simulations
CPU times: user 18.4 ms, sys: 399 µs, total: 18.8 ms
Wall time: 18.7 ms

A very noticeable speed-up. Note that this is done without changing the original function at all! Let’s what the runtime looks like when using both JIT and vectorisation:

@jit(nopython=True)
def mcBScall_vect(S,K,sgm,r,T,n):
    w = np.random.standard_normal(n)
    ST = S*np.exp((r-0.5*sgm**2)*T+sgm*np.sqrt(T)*w)
    payoff = ST-K
    payoff=payoff*(payoff>0)
    price=np.exp(-r*T)*np.mean(payoff)
    
    return price
callprice = mcBScall_vect(spot, strike, vol, r, T, n)

%%time
callprice = mcBScall_vect(spot, strike, vol, r, T, n)
print(f'The call price is {callprice} based on {n} simulations')

The call price is 19.193399229389573 based on 500000 simulations
CPU times: user 12.2 ms, sys: 1.65 ms, total: 13.8 ms
Wall time: 13.8 ms

JIT is able to additionally improve the runtime (but only slightly) of the original vectorised function.

Data transformations with faster packages#

The most popular Python package for working with data is pandas. However, when working with sizeable datasets, consider results on the h2o benchmark. https://h2oai.github.io/db-benchmark/. The Polars package is 7-10x faster on 5 GB sized datasets and is able to process larger datasets than pandas, due to its efficient, performance-orientated design.

Working with insufficient memory#

One challenge is memory when analysing larger datasets. The memory required to process may exceed the machine’s capacity for some demanding joins or modelling steps. Some tips to reduce memory requirements would be:

Use the psutil package to measure memory usage throughout the steps and identify the problem steps for optimisation.
Consider dropping from memory datasets, rows or columns that are not needed for that step. If this data is needed for later steps, save to disk and reload after the memory-consuming step instead. Feather V2 is designed for temporary data with little computational overhead, whilst Parquet is a great format for compressed data storage, especially on machines with slow hard disks.
Some packages have functionality specifically to enable larger datasets. Often this is called “out-of-core” but it has many labels – for example, Polars has “hybrid-streaming”.
For gradient boosting, LightGBM has the “two-round” parameter which reduces peak memory usage, at the cost of longer runtime. It works well.
Chunking may also reduce memory usage: it may be possible to split data into e.g. one dataset per year and each is processed individually with results saved back to disk. Tensorflow and Pytorch neural network packages, which can be applied to fit GLMs as well, are designed to fit models on batch data. This means we can avoid having to load the full dataset into memory.