Useful Python packages for Data Science#

By Zeming Yu

This article was originally published on Actuaries Digital, the magazine of the Actuaries Institute Australia, on 11 April 2019 as “My top 10 Python packages for data science”.

There are a lot of benefits of adopting these open source packages, including:

  • Everything is free

  • Most likely they are constantly being updated and improved

  • There’s a large community offering support to each other on websites like Stack Overflow

It does take some time to get familiar with these packages. However, if you are the kind of person who gets excited about learning new things, you’ll actually enjoy the process. Hopefully you find it useful!

Data processing#

pandas#

Developed by Wes McKinney more than a decade ago, this package offers powerful data table processing capabilities. For people with a SAS background, it offers something like SAS data steps functionality. You can do sorting, merging, filtering etc. The key difference is in pandas, you call a function to perform these tasks.

By the way, I was really amazed to know that Wes McKinney was able to develop pandas after only a few years of Python experience. Some people are just really gifted!

His book Python for Data Analysis is highly recommended if you are just starting out your Python data science journey.

numpy#

Pandas builds on top of another important package, numpy. So when you work with data you will often rely on this package for basic data manipulations. For example when you need to create a new column based on the age of the customer, you need to do something like:

df[‘isRetired’] = np.where(df[‘age’]>=65, ‘yes’, ‘no’)

qgrid#

An amazing package which allows you to sort, filter, and edit DataFrames in Jupyter Notebooks.

Graphing#

The next three packages are all to do with graphing — which is a key step in exploratory data analysis.

matplotlib#

This package allows you to do all sorts of graphs. If you are using it in a Jupyter Notebook, remember to run this line of code to enable the display of the graphs:

seaborn#

With the help of this package, you can make matplotlib graphs look much more attractive.

plotly#

Nowadays we come across interactive graphs everywhere. They offer a much better user experience. For example:

  • when we hover the mouse over a line plot we expect some text to pop up.

  • when we select a line, we expect it to stand out from the other lines.

  • sometimes we would like to zoom into parts of the graph. plotly allows you to build these interactive graphs easily within a Jupyter Notebook. A great way to share work with your colleagues and stakeholders is sending a webpage (a Jupyter Notebook) with beautiful, interactive plotly graphs embedded.

The best part is there is no need for the recipient to install any special software other than a modern internet browser.

Modelling#

statsmodels#

This package allows you to build Generalized Linear Models (GLMs) which are still widely used by actuaries today.

It also offers time series analysis and other statistical modelling capabilities.

scikit-learn#

This is the main machine learning package allowing you to complete most machine learning tasks, including classification, regression, clustering, and dimensionality reduction.

I also use the model selection and pre-processing functions. From k-fold cross validation to scaling data and encoding categorical features, it has so much to offer.

lightgbm#

This is one of my favourite machine learning packages for Gradient Boost Machine (GBM). I gave a talk in the 2018 Data Analytics seminar about this package.

from IPython.display import Audio, Image, YouTubeVideo
YouTubeVideo('pzwE1WBOAnU', width=800, height=300)

For a fraction of the time and effort needed to build GLMs, you could run a GBM, look at the importance matrix to find out the most important features for your model and have a good initial understanding of the problem. This can be a standalone step, or a quick first step before building a full GLM that’s more readily accepted by the stakeholders.

lime#

Model interpretation is still a challenge for machine learning models like GBM. When stakeholders don’t understand a model they can’t trust it and as a result there’s no adoption.

However, I feel model interpretation packages like lime are starting to change this. They allow you to examine each model prediction and work out what’s driving the prediction.

Conclusion#

I’ve listed my top 10 packages. Have you come across any other useful packages? Please share in your comments below.

“Exploration is really the essence of the human spirit.” – Frank Borman

This article was originally published on Medium.com