Py: Socio-Economic Index Construction

Py: Socio-Economic Index Construction#

This notebook was originally created by Andres Villegas Ramirez for the Data Analytics Applications subject, as Case study 1 - The Socio-Economic Indexes for Areas in the DAA M06 Unsupervised learning module.

Data Analytics Applications is a Fellowship Applications (Module 3) subject with the Actuaries Institute that aims to teach students how to apply a range of data analytics skills, such as neural networks, natural language processing, unsupervised learning and optimisation techniques, together with their professional judgement, to solve a variety of complex and challenging business problems. The business problems used as examples in this subject are drawn from a wide range of industries.

Find out more about the course here.

Purpose:#

This case study demonstrates the use of principal component analysis (PCA) in the construction of an index to rank the socio-economic status of different geographical areas within a country (in this instance, Australia). Such index can be useful to identify areas that require funding or services, or as input for research into the relationship between socio-economic conditions and other outcomes in different areas.

In Australia, the Australian Bureau of Statistics (ABS) produces four of such indexes, referred to as the Socio-Economic Indexes for Areas (SEIFA). SEIFA ranks areas in Australia based on their relative socio-economic advantage and disadvantage.

This notebook replicates the methodology adopted by the ABS in constructing one of these indexes - Index of Economic Resources, or IER by Statistical Area 1 (SA1).

This Jupyter notebook is organised into the following sections:#

Section 1: Load required packages
Section 2: Define a function for plotting
Section 3: Import and transform data
Section 4: Apply Principal Component Analysis and calibrate SEIFA score
Section 5: Check against ABS publication

1. Load Packages#

# Import required Python packages

#Dataframe operations
import pandas as pd 

#PCA
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

#Plotting
import matplotlib.pyplot as plt

2. Define Functions#

# Define a function to obtain the explained variances from the PCA features.

## Good practice to define functions upfront.

def plot_explained_variances(pca):

    exp_var_pca = pca.explained_variance_ratio_

    plt.bar(range(1, len(exp_var_pca) + 1), exp_var_pca, alpha = 0.5, align = 'center',
            label='Individual explained variance',color='dodgerblue')
    plt.ylabel('Explained variance ratio')
    plt.xlabel('Principal component')
    plt.legend(loc='best')
    plt.tight_layout()
    plt.show()

3. Import and Transform Data#

# Import data
## File 1: Standardised socio-economic variables published by the ABS (i.e. input to the PCA model)
## File 2: IER scores by Statistical Area 1 published by the ABS


infolder = 'https://actuariesinstitute.github.io/cookbook/_static/daa_datasets/'

file1 = 'standardised_variables_seifa_2021_cut.csv'
file2 = 'IER_2021.csv'

data1 = pd.read_csv(infolder + file1)
data2 = pd.read_csv(infolder + file2)

# Inspect data

## These are numerical features (e.g. % corresponding to a variable, see dictionary below) which are standardised

data1.head()

	SA1_2021	INC_LOW	INC_HIGH	ATSCHOOL	ATUNI	CERTIFICATE	DEGREE	DIPLOMA	NOYR12ORHIGHER	NOEDU	...	HIGHCAR	NOCAR	DISABILITYU70	ENGLISHPOOR	ONEPARENT	SEP_DIVORCED	UNINCORP	CHILDJOBLESS	MORTGAGEOVER30	RENTOVER30
0	10102100701	0.031160	-0.451226	-0.307998	-0.611843	2.110423	-0.792820	1.399224	0.039552	-0.494326	...	1.124942	-0.361674	1.160679	-0.628432	-0.087881	-0.046096	0.533110	-0.797863	0.427678	-1.208146
1	10102100702	-0.745478	0.235597	0.309750	-0.718416	0.795239	-0.138695	1.122862	-0.328068	-0.494326	...	2.165534	-0.737816	-1.371078	-0.628432	-1.513208	-0.216841	1.846095	-0.797863	0.322564	-1.208146
2	10102100703	0.207462	-0.333816	0.648522	-0.373289	0.143017	0.089753	-0.024178	-0.148292	-0.494326	...	-0.474119	-0.425240	-1.064640	-0.628432	0.078716	0.963607	0.486319	-0.335393	-0.486090	-0.029516
3	10102100704	0.383411	-0.417133	-1.082034	-0.887458	-0.308372	0.417513	-0.160872	0.083635	0.108973	...	-0.407742	-0.528653	0.038035	-0.628432	0.949299	1.482164	0.808398	0.554274	-0.156349	-0.423796
4	10102100705	0.334499	0.256282	-1.018268	-0.396676	0.235498	-0.072269	-0.227208	0.356362	-0.494326	...	-0.135464	-0.477578	0.056221	-0.628432	-1.048310	1.253986	0.795448	-0.116016	-0.297469	-0.292223

5 rows × 45 columns

Variable dictionary#

Variable name	Variable description
INC_LOW	Per cent of people living in households with stated annual household equivalised income between $1 and $25,999 (approx. 1st and 2nd deciles)
LOWRENT	Per cent of occupied private dwellings paying rent less than $250 per week (excluding $0 per week)
NOCAR	Per cent of occupied private dwellings with no cars
LONE	Per cent of occupied private dwellings who are lone person occupied private dwellings
ONEPARENT	Per cent of one parent families with dependent offspring only
OVERCROWD	Per cent of occupied private dwellings requiring one or more extra bedrooms (based on Canadian National Occupancy Standard)
UNEMPLOYED_IER	Per cent of people aged 15 years and over who are unemployed
GROUP	Per cent of occupied private dwellings who are group occupied private dwellings
OWNING	Per cent of occupied private dwellings owning dwelling without a mortgage
UNINCORP	Per cent of dwellings with at least one person who is an owner of an unincorporated enterprise
INC_HIGH	Per cent of people with stated annual household equivalised income greater than $91,000
HIGHMORTGAGE	Per cent of occupied private dwellings paying mortgage greater than $2,800 per month
MORTGAGE	Per cent of occupied private dwellings owning dwelling (with a mortgage)
HIGHBED	Per cent of occupied private dwellings with four or more bedrooms

# Inspect data

## IER scored published based on 2021 Census data collection

data2.head()

	SA1_2021	IER_2021
0	10102100701	1023.037282
1	10102100702	1088.576036
2	10102100703	986.1022032
3	10102100704	965.4964701
4	10102100705	1013.432808

# Select variables ABS uses to calibrate the IER (per their methodology)

variable_IER = [
    'SA1_2021',
    'INC_LOW',
    'LOWRENT',
    'NOCAR',
    'LONE',
    'ONEPARENT',
    'OVERCROWD',
    'UNEMPLOYED_IER',
    'GROUP',
    'OWNING',
    'UNINCORP',
    'INC_HIGH',
    'HIGHMORTGAGE',
    'MORTGAGE',
    'HIGHBED'

]

data1_IER = data1[variable_IER]

# Inspect data type for these variables

data1_IER.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59430 entries, 0 to 59429
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   SA1_2021        59430 non-null  int64  
 1   INC_LOW         59427 non-null  float64
 2   LOWRENT         59400 non-null  float64
 3   NOCAR           59422 non-null  float64
 4   LONE            59430 non-null  float64
 5   ONEPARENT       59319 non-null  float64
 6   OVERCROWD       59417 non-null  float64
 7   UNEMPLOYED_IER  59429 non-null  float64
 8   GROUP           59430 non-null  float64
 9   OWNING          59423 non-null  float64
 10  UNINCORP        59430 non-null  float64
 11  INC_HIGH        59427 non-null  float64
 12  HIGHMORTGAGE    59400 non-null  float64
 13  MORTGAGE        59423 non-null  float64
 14  HIGHBED         59422 non-null  float64
dtypes: float64(14), int64(1)
memory usage: 6.8 MB

# Remove rows with missing values

print(data1_IER.isna().sum())
data1_IER_dropna = data1_IER.dropna()

SA1_2021            0
INC_LOW             3
LOWRENT            30
NOCAR               8
LONE                0
ONEPARENT         111
OVERCROWD          13
UNEMPLOYED_IER      1
GROUP               0
OWNING              7
UNINCORP            0
INC_HIGH            3
HIGHMORTGAGE       30
MORTGAGE            7
HIGHBED             8
dtype: int64

# Standardise data for PCA

## Select data from column 2 onwards (inclusive). First column is SA1 ID.
pca_data = data1_IER_dropna.iloc[:, 1:]

## Standardise these columns to mean 0 and variance 1
sc = StandardScaler()
sc.fit(pca_data)
pca_data = sc.transform(pca_data)

## Inspect outputs, each row represents a SA1 

print(data1_IER_dropna[50:55])
print(pca_data[50:55])

       SA1_2021   INC_LOW   LOWRENT     NOCAR      LONE  ONEPARENT  OVERCROWD  \
50  10102100921 -0.694978 -0.149470  0.026395  2.161618   0.016894   0.614584   
51  10102100923 -0.399570 -0.230663 -0.392438  0.346039   0.479932  -0.327006   
53  10102100925 -1.184584 -0.369694 -0.584525  1.995606   1.010923   0.163947   
54  10102100926 -0.755434  0.636737  0.194704  1.770099  -0.418760   0.828480   
55  10102100927 -0.534872 -0.113768 -0.443336 -0.248692  -0.929502   0.042116   

    UNEMPLOYED_IER     GROUP    OWNING  UNINCORP  INC_HIGH  HIGHMORTGAGE  \
50       -0.768433 -0.298797 -1.379936 -1.409706  0.565708     -0.711816   
51       -0.502225 -0.435782 -0.068099 -1.078581  0.417556     -0.309550   
53        0.393123  0.492720 -1.552475 -0.869567  0.401089     -0.914661   
54        0.562322  1.018832 -1.283807 -0.853606 -0.501616     -0.914661   
55       -0.591322  0.001803 -0.652794 -0.622710  0.248079      0.145029   

    MORTGAGE   HIGHBED  
50 -0.075949 -1.407022  
51  0.780225  0.198026  
53 -0.382142 -1.473097  
54 -1.468772 -1.443632  
55  0.547895  0.465557  
[[-6.97384349e-01 -1.49102654e-01  2.73842498e-02  2.18456511e+00
   1.68271418e-02  6.21443888e-01 -7.82538918e-01 -3.01193486e-01
  -1.38687638e+00 -1.42934823e+00  5.65712045e-01 -7.14208804e-01
  -8.01255787e-02 -1.41014150e+00]
 [-4.00578528e-01 -2.30600290e-01 -3.93807726e-01  3.52373086e-01
   4.80225191e-01 -3.29406608e-01 -5.11392783e-01 -4.39108873e-01
  -7.01834243e-02 -1.09373813e+00  4.17385805e-01 -3.11130864e-01
   7.79986710e-01  1.96337876e-01]
 [-1.18930828e+00 -3.70152778e-01 -5.86976088e-01  2.01703428e+00
   1.01162817e+00  1.66375413e-01  4.00564091e-01  4.95697733e-01
  -1.56005434e+00 -8.81893199e-01  4.00899556e-01 -9.17462999e-01
  -3.87727179e-01 -1.47627516e+00]
 [-7.58127412e-01  6.40055426e-01  1.96640397e-01  1.78946385e+00
  -4.19164718e-01  8.37443911e-01  5.72902051e-01  1.02538267e+00
  -1.29039226e+00 -8.65715951e-01 -5.02866086e-01 -9.17462999e-01
  -1.47935650e+00 -1.44678433e+00]
 [-5.36520854e-01 -1.13266264e-01 -4.44992083e-01 -2.47800227e-01
  -9.30303140e-01  4.33461879e-02 -6.02142493e-01  1.44766450e-03
  -6.57042865e-01 -6.31693191e-01  2.47710087e-01  1.44364950e-01
   5.46588117e-01  4.64107182e-01]]

4. Modelling#

This section performs the PCA to recreate the ABS’s IER Index.

# Initiate the PCA.
pca = PCA()

# Fit the PCA model to determine the transformed features.
## It took 2 lines to fit the PCA model.
pca.fit_transform(pca_data)

# Plot variance explained by principal components
plot_explained_variances(pca)

../_images/e6025e84a8357cb4ffe1b0a87021118b9371a0481477a306eac1c0efd49af5da.png

Principal component 1 (PC1) explains over 30% of the variance in the data. PC1 is used as the basis of calculating the IER.

pca_data[0]

array([ 3.21902216e-02, -1.95351071e-01, -3.62869762e-01,  3.59364682e-01,
       -8.80285299e-02,  1.99924572e-01, -6.52078395e-01, -9.12914962e-01,
        1.26794033e+00,  5.39782081e-01, -4.52416487e-01, -3.17209382e-01,
        2.45495526e-01,  4.39423025e-04])

# Reverse sign (per ABS methodology)

pca_data_transformed = -1.0*pca.fit_transform(pca_data)

# Extract first principal component in a dataframe
pca1 = pd.DataFrame(pca_data_transformed[:, 0], columns = ['IER_2021'])

# Attach SA1 ID to
IER_S1 = pd.concat([data1_IER_dropna['SA1_2021'].reset_index(drop = True), pca1]
                    , axis=1)

print(IER_S1.head(1))

      SA1_2021  IER_2021
0  10102100701  0.517295

# Calculation from first principal
## Sign implies how each variable contributes to a score
## Reverse sign to make it more comprehensive

component_df = pd.DataFrame({
    'Variable': data1_IER.columns[1:],
    'PC1 Loading': -1.0 * pca.components_[0],
    'First SA1': pca_data[0]
})

print(component_df)


raw_score_first_SA1 = (component_df['PC1 Loading'] * component_df['First SA1']).sum()


print('\n')
print('\n')
print(raw_score_first_SA1)

          Variable  PC1 Loading  First SA1
        INC_LOW    -0.327920   0.032190
        LOWRENT    -0.320014  -0.195351
          NOCAR    -0.316516  -0.362870
           LONE    -0.307799   0.359365
      ONEPARENT    -0.243785  -0.088029
      OVERCROWD    -0.231555   0.199925
 UNEMPLOYED_IER    -0.217694  -0.652078
          GROUP    -0.175738  -0.912915
         OWNING     0.152087   1.267940
       UNINCORP     0.211586   0.539782
      INC_HIGH     0.233597  -0.452416
  HIGHMORTGAGE     0.285636  -0.317209
      MORTGAGE     0.295868   0.245496
       HIGHBED     0.334995   0.000439




5172947202278048

A recap of the dataframe operations for transforming raw data to principal components#

Dataframe name	Operations
data1	Raw data we’ll be using to replicate the SEIFA IER score, numercial features (%)
data1_IER	Select features for determining IER from data1
data1_IER_dropna	Remove missing values in data1_IER
pca_data	Standardise features in data1_IER_dropna to mean 0 and s.d. 1, used to perform PCA
pca	Output of PCA, Principal Components
pca_data_transformed	Reverse sign in pca
pca1	Extract first principal component from pca_data_transformed
IER_S1	Attach SA1 identifier to pca1, compared to dataframe ABS_IER_S1 which is the published data

# Standardise calibrated IER scores to mean of 1,000 and standard deviation of 100 (per ABS methodology)

IER_S1['IER_recreated'] = (IER_S1['IER_2021']/IER_S1['IER_2021'].std())*100+1000

print(IER_S1.tail())

          SA1_2021  IER_2021  IER_recreated
90104100401  0.022187    1000.995399
90104100402 -0.055935     997.490510
90104100403 -0.785740     964.748521
90104100404 -1.702813     923.604956
90104100407 -0.116936     994.753789

5. Reconcile with ABS publication#

# Extract first two columns in the desired data type

ABS_IER_S1 = data2.iloc[:, [0, 1]]

ABS_IER_S1['IER_2021'] = pd.to_numeric(ABS_IER_S1['IER_2021'],
                                        errors='coerce')
ABS_IER_S1['SA1_2021'] = pd.to_numeric(ABS_IER_S1['SA1_2021'],
                                        errors='coerce', downcast = 'integer')


print(ABS_IER_S1.info())
print('\n')
print(ABS_IER_S1.head())
print('\n')
print(ABS_IER_S1.isna().sum())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 59421 entries, 0 to 59420
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   SA1_2021  59421 non-null  int64  
 1   IER_2021  59303 non-null  float64
dtypes: float64(1), int64(1)
memory usage: 928.6 KB
None


      SA1_2021     IER_2021
0  10102100701  1023.037282
1  10102100702  1088.576036
2  10102100703   986.102203
3  10102100704   965.496470
4  10102100705  1013.432808


SA1_2021      0
IER_2021    118
dtype: int64

## Remove rows with missing value

print(ABS_IER_S1.isna().sum())
ABS_IER_S1_dropna = ABS_IER_S1.dropna()

print(len(data1_IER_dropna))
print(len(ABS_IER_S1_dropna))

SA1_2021      0
IER_2021    118
dtype: int64
59303
59303

# Merge the recreated with the published data for reconciliation 1 - histograms

join = pd.merge(ABS_IER_S1_dropna, data1_IER_dropna, how = 'left', on = 'SA1_2021')

# Plot histogram of calibrated IER scores

IER_S1.hist(column='IER_recreated', bins=100, color='dodgerblue')
plt.title('Distribution of recreated IER scores')
plt.xlabel('IER score')
plt.ylabel('Frequency')

plt.show()

../_images/87d76f86212740a37aa7d4bf995b827403fe4aba26f23579f38f6991c78449d9.png

# Plot histogram of publisehd IER scores

ABS_IER_S1_dropna.hist(column='IER_2021', bins=100, color='dodgerblue')
plt.title('Distribution of ABS IER scores')
plt.xlabel('IER score')
plt.ylabel('Frequency')

plt.show()

ABS_IER_S1_dropna.to_csv('check_ABS_IER_S1_dropna.csv')

../_images/a1b60af8dbf9d5b93def174e8824bc7fefbd7d78c4795e91228033f5f526ff1e.png

The two histograms are very similar in shape!

# Extract columns we need

IER_S1_reduce = IER_S1[['SA1_2021', 'IER_recreated']]

# Merge the recreated with the published data for reconciliation 2 - scatter plot

IER_join = pd.merge(ABS_IER_S1_dropna, IER_S1_reduce, how = 'left', on = 'SA1_2021')

print(IER_join.tail())

          SA1_2021     IER_2021  IER_recreated
90104100401  1000.369129    1000.995399
90104100402   996.979765     997.490510
90104100403   964.139169     964.748521
90104100404   923.050144     923.604956
90104100407   994.377312     994.753789

# Plot calibrated score vs. published score

plt.scatter('IER_recreated', 'IER_2021', data = IER_join, color='dodgerblue')
plt.title('Comparison of recreated and ABS IER scores')
plt.xlabel('Recreated IER score')
plt.ylabel('ABS IER score')

plt.show()

../_images/4d6c55a11d11247cc67e81d496336ee856fa78d9a9a1405b347fb5d82531fe60.png

The above scatter plot shows that the two sets of indexes are very closely aligned.