Py: K-means clustering of COVID dataset#

This notebook was originally created by Amanda Aitken for the Data Analytics Applications subject, as Exercise 6.4 - K-means clustering of COVID dataset, in the DAA M06 Unsupervised learning module.

Data Analytics Applications is a Fellowship Applications (Module 3) subject with the Actuaries Institute that aims to teach students how to apply a range of data analytics skills, such as neural networks, natural language processing, unsupervised learning and optimisation techniques, together with their professional judgement, to solve a variety of complex and challenging business problems. The business problems used as examples in this subject are drawn from a wide range of industries.

Find out more about the course here.


The following code performs K-means clustering on COVID data. Once you have read through the code, run it and inspected the output, you should try using different values of K and observe the differences in the clustering outcomes.


The dataset that is used in this exercise was sourced from Our World in Data at

This dataset was downloaded from the above link on 31 March 2021. It contains country-by-country data on confirmed coronavirus disease (COVID-19) cases and at the time of writing is updated on a daily basis.

The data contains COVID-19 and population related features for over 100 countries. These features include:

  • total cases per million people;

  • total new cases per million people;

  • total deaths per million people;

  • new deaths per million people;

  • reproduction rate of the disease;

  • positive testing rate;

  • total tests per thousand people;

  • icu patients per million people; and

  • hospital patients per million people.


This section installs packages that will be required for this exercise/case study.

import pandas as pd # Used for data management.

import matplotlib.pyplot as plt
%matplotlib inline 

# The following scikit-learn libraries will be used
# to standardise the features and run K-means clustering.
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans


This section:

  • imports the data that will be used in the modelling;

  • explores the data; and

  • prepares the data for modelling.

Import data#

# Note that the following code could be used to read the most
# recent data in directly from the Our World in Data website:

# covid = pd.read_csv('')
# However, we will use a snapshot so that the notebook keeps working even if the dataset format changes.
# Create a dataset called 'covid'.
covid = pd.read_csv('', header = 0)

Prepare data#

# Restrict the data to only look at one point in time (31-Dec-2020)
covid2 = covid[covid['date']=='2020-12-31']

# This analysis will use nine features in the clustering.
# The column 'location' is also retained to give us the country names.
# Countries that have missing values at the extract date are dropped from
# the data table using the .dropna() method.
covid3 = covid2[['location','total_cases_per_million','new_cases_per_million',

covid_data = covid3.drop(columns='location')

countries = covid3['location'].tolist()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 17 entries, 4823 to 74527
Data columns (total 9 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   total_cases_per_million    17 non-null     float64
 1   new_cases_per_million      17 non-null     float64
 2   total_deaths_per_million   17 non-null     float64
 3   new_deaths_per_million     17 non-null     float64
 4   reproduction_rate          17 non-null     float64
 5   positive_rate              17 non-null     float64
 6   total_tests_per_thousand   17 non-null     float64
 7   icu_patients_per_million   17 non-null     float64
 8   hosp_patients_per_million  17 non-null     float64
dtypes: float64(9)
memory usage: 1.3 KB
['Austria', 'Belgium', 'Bulgaria', 'Canada', 'Cyprus', 'Denmark', 'Estonia', 'Finland', 'Ireland', 'Israel', 'Italy', 'Luxembourg', 'Portugal', 'Slovenia', 'Spain', 'United Kingdom', 'United States']


Fit model#

This section performs K-means clustering.

# Perform K-means clustering on the COVID data.

# Create a scaler so that the features in the dataset can be
# scaled to have a mean of 0 and a standard deviation of 1.
scaler = StandardScaler()

# Create a KMeans model with k clusters.
# You can experiment with different values of k here.
k = 3
kmeans = KMeans(n_clusters=k)

# Create a pipeline to link together the scaler and kmeans instance.
pipeline = make_pipeline(scaler,kmeans)

# Build a K-means clustering model by fitting the pipeline to the COVID dataset.

# Predict the cluster labels for the COVID dataset.
labels = pipeline.predict(covid_data)

# Create a DataFrame, df, aligning labels and countries.
df = pd.DataFrame({'labels': labels, 'countries': countries})

# Display df sorted by cluster label.
    labels       countries
11       0      Luxembourg
0        1         Austria
14       1           Spain
13       1        Slovenia
12       1        Portugal
10       1           Italy
15       1  United Kingdom
16       1   United States
2        1        Bulgaria
1        1         Belgium
6        2         Estonia
9        2          Israel
5        2         Denmark
4        2          Cyprus
3        2          Canada
7        2         Finland
8        2         Ireland

Plot elbow curve#

One method of selecting an appropriate value for K is to plot a graph of the within-cluster sum of squares, ๐‘Š๐ถ๐‘†๐‘† or inertia, for different values of K.

Elbow curves are described in Module 6.

# Calculate the WCSS or inertia for different values of K.
WCSS = []
K = range(1,10)
for k in K:
    kmeans2 = KMeans(n_clusters=k)
    pipeline2 = make_pipeline(scaler,kmeans2)

# Plot the elbow curve. ('bx-')
plt.plot(K, WCSS,color='dodgerblue')
plt.title('Elbow curve for the COVID data')

The elbow curve suggests that a selection of K = 2 might be appropriate for this data as there is a kink in the curve at this point. However, you could also argue that a selection of K = 4 might be more appropriate because it results in a lower within cluster sum of squares (WCSS) and there is also a slight kink in the plot at this point.