Py: Clustering Credit Card Fraud#

This notebook was originally created by Amanda Aitken for the Data Analytics Applications subject, as Case study 2 - Credit card fraud detection in the DAA M06 Unsupervised learning module.

Data Analytics Applications is a Fellowship Applications (Module 3) subject with the Actuaries Institute that aims to teach students how to apply a range of data analytics skills, such as neural networks, natural language processing, unsupervised learning and optimisation techniques, together with their professional judgement, to solve a variety of complex and challenging business problems. The business problems used as examples in this subject are drawn from a wide range of industries.

Find out more about the course here.

Define the problem#

Fraud involves deceptive actions that are taken with the intention of illegally or unethically gaining something at the expense of others. Some examples of fraud include:

  • stealing another person’s credit card details and using these to purchase goods;

  • providing false information to an insurance company to obtain an insurance benefit that you are not entitled to; and

  • using another person’s personal information, without their permission, to commit a crime or to deceive or defraud that other person.

In July 2019, the Centre for Counter Fraud Studies at the University of Portsmouth estimated that fraud costs the global economy £3.89 trillion, with losses from fraud rising by 56% in the past decade. Fraudulent behaviour can occur in any industry, including in many of the industries that actuaries typically work in, such as banking and finance.

Fraud detection is a set of activities that are undertaken to identify fraud. Fraud detection can help to rectify the outcomes of any fraudulent behaviour and/or to prevent fraud from occurring in the future.

Identifying Fraud:#

Some methods that can be used to detect fraud are:

  • reputation list;

  • rules engine;

  • supervised machine learning; and

  • unsupervised machine learning.

A reputation list, or ‘blacklist’, is a relatively static list of known fraudulent identities. For example, a bank’s ‘blacklist’ might include a list of individuals who have previously been convicted of credit card fraud. A drawback of this method is that it is difficult to keep the reputation list up-to-date and it often requires identities to have been found committing fraudulent behaviour in the past before they can be added to the list.

A rules engine approach to fraud detection involves running a set of rules over an activity. If the activity meets one of the rules, it might be further investigated manually. For example, a retailer might use a rules engine to flag any potentially fraudulent online purchases. They might, for example, flag for further investigation any transactions that are over a certain volume or that are being requested for delivery to a foreign country. While rules engines have the benefit of being easy to understand, they can also be hard to keep up-to-date with recent fraudulent activities.

A supervised learning approach to fraud detection involves training a computer to recognise future fraud by providing it with examples of past fraud. For example, a social media site might have a historical dataset that contains a list of new accounts and their attributes (‘features’) as well as a label to indicate whether each new account was opened by a legitimate customer (‘not fraud’) or by someone pretending to be a legitimate customer (‘fraud’). The classification techniques discussed in Module 5 could be used for this supervised learning approach to fraud detection.

Supervised learning, therefore, approximates a rules engine and might be used to create a rules engine. A drawback of this method is that it requires a large dataset of past examples of fraud which can be very ‘expensive’ to obtain, as the exercise of validating historical instances of fraud can be very time-consuming. In addition, past examples of fraud may not be a good indicator of the types of fraud to which an organisation could be subject in the future. In fact, there should be some anticipation that fraudsters will change their strategies over time when they see that their existing methods of fraud are thwarted or investigated.

Unsupervised learning is described in Video 6.6 as being the cutting-edge solution to fraud detection. As discussed in this module, one branch of unsupervised learning, clustering, involves finding subsets of a population that are like each other and different from other subsets. Clustering can be used in fraud detection in one of two ways:

  • by identifying outliers that are dissimilar to other observations and do not align closely with any of the clusters found in the dataset—these outliers are potential cases of fraud; or, conversely,

  • by identifying a cluster of observations, when all other observations appear to be more random and not tightly bunched together—this method is described in Video 6.6.

This notebook explores fraud detection using clustering techniques.

References:#

The datasets that are used in this case study were sourced from the following Kaggle competition: https://www.kaggle.com/c/ieee-fraud-detection.

The aim of the competition was to improve the efficacy of fraudulent transaction identification models.

The data originates from real-world e-commerce transactions and contains two different types of datasets. Note that the meaning of some of the features has been masked by the data provider, so it is not clear what all the features mean.

The first dataset contains the following features for each credit card transaction:

  • TransactionDT: a measurement of the time since a given reference time (this is referenced on the Kaggle site as being ‘timedelta’ rather than an actual timestamp);

  • TransactionAMT: the payment amount of the transaction in USD;

  • ProductCD: the product code for each transaction;

  • card1 - card6: payment card information, such as card type, card category, issue bank, country, etc.;

  • addr1: customer’s billing region;

  • addr2: customer’s billing country;

  • dist1 - dist 2: distances between, for example,billing address, mailing address, zip code, IP address, phone area, etc.;

  • P_ and R_ emaildomain: purchaser and recipient email domain;

  • C1 - C14: counting, such as how many addresses are found to be associated with the payment card, etc. (the actual meaning of these fields was masked in the Kaggle dataset);

  • D1 - D15: timedeltas, such as days between previous transaction, etc;

  • M1 - M9: matches, such as whether the name on the card matches the name of the customer etc;

  • V1 - V339: these fields are described on the Kaggle competition site as ‘Vesta engineered rich features, including ranking, counting, and other entity relations’. For example, one of these features might represent the number of times the payment card associated with a IP and email or address appeared in the transaction list in a 24 hour time range.

The training version of the transactions dataset also contains a ‘TransactionID’ identifier for each transaction and a label ‘isFraud’ to indicate whether the transaction was fraudulent.

The second dataset contains ‘identity’ features for each transaction. These features are described on Kaggle as being ‘network connection information (IP, ISP, Proxy, etc) and digital signature (UA/browser/os/version, etc) associated with transactions’. These features were ‘collected by Vesta’s fraud protection system and digital security partners’. The actual meaning of each of these fields is masked. These features have the following names:

  • id_01 - id_38;

  • DeviceType; and

  • DeviceInfo.

Both the transactions and identity datasets also contain a ‘TransactionID’ column that allows the two datasets to be joined together.

The above information about the features contained in these datasets was sourced from the following Kaggle webpage: https://www.kaggle.com/c/ieee-fraud-detection/discussion/101203.

Packages#

This section installs packages that will be required for this exercise/case study.

# Packages for data management and mathematical operations.
import pandas as pd
import numpy as np

# Package for plotting.
import matplotlib.pyplot as plt

# Scikit-learn libraries to perform clustering and evaluation.
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, f1_score

# Package used in the confusion matrix function.
import itertools

Functions#

This section defines functions that will be used for this exercise/case study.

# Define a function that helps to format headings in the cell outputs.
def heading(string):

  '''
  This function is used to print a bold heading followed by a new line.
  '''
  print('\033[1m'+string+'\033[0m\n')
  # The '\033[1m' at the start of the string makes the font bold
  # and the '\033[0m' at the end of the string makes the font go back to normal.
  # The '\n' at the very end of the string prints a new line to give some space
  # before the next item is printed to the output box.
# Define a function to plot confusion matrices in an easy to visualise format.
def plot_confusion_matrix(cm, classes,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    '''
    This function prints and plots a nicely formatted confusion matrix.
    '''
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j],
                 horizontalalignment='center',
                 color='white' if cm[i, j] > thresh else 'black')

    plt.tight_layout()
    plt.ylabel('True response')
    plt.xlabel('Predicted response')

Data#

This section:

  • imports the data that will be used in the modelling;

  • explores the data; and

  • prepares the data for modelling.

Import data#

The code below uploads the following two csv files needed for this case study:

  • DAA_M06_CS2_data1_transaction.csv (the transaction data); and

  • DAA_M06_CS2_data2_identity.csv (the identity data).

Note that the transaction dataset is very large (667MB) hence the file is zipped - but pandas is able to read zipped csv files directly.

# Specify the folder that your datasets are saved in.
# or here, read it from the URL 
infolder = 'https://actuariesinstitute.github.io/cookbook/_static/daa_datasets/'
# Import the transaction data.
# Specify the filename.
file1 = 'DAA_M06_CS2_data1_transaction.csv.zip'

# Read in the data.
transaction = pd.read_csv (infolder+file1)
# Import the identity data.

# Specify the filename.
file2 = 'DAA_M06_CS2_data2_identity.csv.zip'

# Read in the data.
identity = pd.read_csv (infolder+file2)

Explore data (EDA)#

# Force all columns in the datasets to be shown.
pd.options.display.max_columns = 999

# Print the first 5 rows of each dataset.
heading('Transaction dataset')
print(transaction.head())
print('\n')

heading('Identity dataset')
print(identity.head())
Transaction dataset

   TransactionID  isFraud  TransactionDT  TransactionAmt ProductCD  card1  \
0        2987000        0          86400            68.5         W  13926   
1        2987001        0          86401            29.0         W   2755   
2        2987002        0          86469            59.0         W   4663   
3        2987003        0          86499            50.0         W  18132   
4        2987004        0          86506            50.0         H   4497   

   card2  card3       card4  card5   card6  addr1  addr2  dist1  dist2  \
0    NaN  150.0    discover  142.0  credit  315.0   87.0   19.0    NaN   
1  404.0  150.0  mastercard  102.0  credit  325.0   87.0    NaN    NaN   
2  490.0  150.0        visa  166.0   debit  330.0   87.0  287.0    NaN   
3  567.0  150.0  mastercard  117.0   debit  476.0   87.0    NaN    NaN   
4  514.0  150.0  mastercard  102.0  credit  420.0   87.0    NaN    NaN   

  P_emaildomain R_emaildomain   C1   C2   C3   C4   C5   C6   C7   C8   C9  \
0           NaN           NaN  1.0  1.0  0.0  0.0  0.0  1.0  0.0  0.0  1.0   
1     gmail.com           NaN  1.0  1.0  0.0  0.0  0.0  1.0  0.0  0.0  0.0   
2   outlook.com           NaN  1.0  1.0  0.0  0.0  0.0  1.0  0.0  0.0  1.0   
3     yahoo.com           NaN  2.0  5.0  0.0  0.0  0.0  4.0  0.0  0.0  1.0   
4     gmail.com           NaN  1.0  1.0  0.0  0.0  0.0  1.0  0.0  1.0  0.0   

   C10  C11  C12   C13  C14     D1     D2    D3    D4   D5  D6  D7  D8  D9  \
0  0.0  2.0  0.0   1.0  1.0   14.0    NaN  13.0   NaN  NaN NaN NaN NaN NaN   
1  0.0  1.0  0.0   1.0  1.0    0.0    NaN   NaN   0.0  NaN NaN NaN NaN NaN   
2  0.0  1.0  0.0   1.0  1.0    0.0    NaN   NaN   0.0  NaN NaN NaN NaN NaN   
3  0.0  1.0  0.0  25.0  1.0  112.0  112.0   0.0  94.0  0.0 NaN NaN NaN NaN   
4  1.0  1.0  0.0   1.0  1.0    0.0    NaN   NaN   NaN  NaN NaN NaN NaN NaN   

    D10    D11  D12  D13  D14    D15   M1   M2   M3   M4   M5   M6   M7   M8  \
0  13.0   13.0  NaN  NaN  NaN    0.0    T    T    T   M2    F    T  NaN  NaN   
1   0.0    NaN  NaN  NaN  NaN    0.0  NaN  NaN  NaN   M0    T    T  NaN  NaN   
2   0.0  315.0  NaN  NaN  NaN  315.0    T    T    T   M0    F    F    F    F   
3  84.0    NaN  NaN  NaN  NaN  111.0  NaN  NaN  NaN   M0    T    F  NaN  NaN   
4   NaN    NaN  NaN  NaN  NaN    NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN   

    M9   V1   V2   V3   V4   V5   V6   V7   V8   V9  V10  V11  V12  V13  V14  \
0  NaN  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  0.0  0.0  1.0  1.0  1.0   
1  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  0.0  0.0  1.0   
2    F  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  0.0  0.0  1.0  1.0  1.0   
3  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  1.0  1.0  1.0   
4  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN   

   V15  V16  V17  V18  V19  V20  V21  V22  V23  V24  V25  V26  V27  V28  V29  \
0  0.0  0.0  0.0  0.0  1.0  1.0  0.0  0.0  1.0  1.0  1.0  1.0  0.0  0.0  0.0   
1  0.0  0.0  0.0  0.0  1.0  1.0  0.0  0.0  1.0  1.0  1.0  1.0  0.0  0.0  0.0   
2  0.0  0.0  0.0  0.0  1.0  1.0  0.0  0.0  1.0  1.0  1.0  1.0  0.0  0.0  0.0   
3  0.0  0.0  0.0  0.0  1.0  1.0  0.0  0.0  1.0  1.0  1.0  1.0  0.0  0.0  0.0   
4  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN   

   V30  V31  V32  V33  V34  V35  V36  V37  V38  V39  V40  V41  V42  V43  V44  \
0  0.0  0.0  0.0  0.0  0.0  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN   
1  0.0  0.0  0.0  0.0  0.0  0.0  0.0  1.0  1.0  0.0  0.0  1.0  0.0  0.0  1.0   
2  0.0  0.0  0.0  0.0  0.0  1.0  1.0  1.0  1.0  0.0  0.0  1.0  0.0  0.0  1.0   
3  0.0  0.0  0.0  0.0  0.0  1.0  1.0  1.0  1.0  0.0  0.0  1.0  0.0  0.0  1.0   
4  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN   

   V45  V46  V47  V48  V49  V50  V51  V52  V53  V54  V55  V56  V57  V58  V59  \
0  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  1.0  1.0  1.0  1.0  0.0  0.0  0.0   
1  1.0  1.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  1.0  1.0  0.0  0.0  0.0   
2  1.0  1.0  1.0  0.0  0.0  0.0  0.0  0.0  1.0  1.0  1.0  1.0  0.0  0.0  0.0   
3  1.0  1.0  1.0  0.0  0.0  0.0  0.0  0.0  1.0  1.0  1.0  1.0  0.0  0.0  0.0   
4  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN   

   V60  V61  V62  V63  V64  V65  V66  V67  V68  V69  V70  V71  V72  V73  V74  \
0  0.0  1.0  1.0  0.0  0.0  1.0  1.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0   
1  0.0  1.0  1.0  0.0  0.0  1.0  1.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0   
2  0.0  1.0  1.0  0.0  0.0  1.0  1.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0   
3  0.0  1.0  1.0  0.0  0.0  1.0  1.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0   
4  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN   

   V75  V76  V77  V78  V79  V80  V81  V82  V83  V84  V85  V86  V87  V88  V89  \
0  1.0  1.0  1.0  1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  1.0  1.0  1.0  0.0   
1  0.0  0.0  1.0  1.0  0.0  0.0  0.0  1.0  1.0  0.0  0.0  1.0  1.0  1.0  0.0   
2  1.0  1.0  1.0  1.0  0.0  0.0  0.0  1.0  1.0  0.0  0.0  1.0  1.0  1.0  0.0   
3  1.0  1.0  1.0  1.0  0.0  0.0  0.0  1.0  1.0  0.0  0.0  1.0  1.0  1.0  0.0   
4  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN   

   V90  V91  V92  V93  V94  V95   V96   V97  V98   V99  V100  V101  V102  \
0  0.0  0.0  0.0  0.0  0.0  0.0   1.0   0.0  0.0   0.0   0.0   0.0   1.0   
1  0.0  0.0  0.0  0.0  0.0  0.0   0.0   0.0  0.0   0.0   0.0   0.0   0.0   
2  0.0  0.0  0.0  0.0  0.0  0.0   0.0   0.0  0.0   0.0   0.0   0.0   0.0   
3  0.0  0.0  0.0  0.0  0.0  1.0  48.0  28.0  0.0  10.0   4.0   1.0  38.0   
4  NaN  NaN  NaN  NaN  NaN  0.0   0.0   0.0  0.0   0.0   0.0   0.0   0.0   

   V103  V104  V105  V106  V107  V108  V109  V110  V111  V112  V113  V114  \
0   0.0   0.0   0.0   0.0   1.0   1.0   1.0   1.0   1.0   1.0   1.0   1.0   
1   0.0   0.0   0.0   0.0   1.0   1.0   1.0   1.0   1.0   1.0   1.0   1.0   
2   0.0   0.0   0.0   0.0   1.0   1.0   1.0   1.0   1.0   1.0   1.0   1.0   
3  24.0   0.0   0.0   0.0   1.0   1.0   1.0   1.0   1.0   1.0   1.0   1.0   
4   0.0   0.0   0.0   0.0   1.0   1.0   1.0   1.0   1.0   1.0   1.0   1.0   

   V115  V116  V117  V118  V119  V120  V121  V122  V123  V124  V125  V126  \
0   1.0   1.0   1.0   1.0   1.0   1.0   1.0   1.0   1.0   1.0   1.0   0.0   
1   1.0   1.0   1.0   1.0   1.0   1.0   1.0   1.0   1.0   1.0   1.0   0.0   
2   1.0   1.0   1.0   1.0   1.0   1.0   1.0   1.0   1.0   1.0   1.0   0.0   
3   1.0   1.0   1.0   1.0   1.0   1.0   1.0   1.0   1.0   1.0   1.0  50.0   
4   1.0   1.0   1.0   1.0   1.0   1.0   1.0   1.0   1.0   1.0   1.0   0.0   

     V127   V128  V129   V130   V131  V132    V133   V134  V135  V136  V137  \
0   117.0    0.0   0.0    0.0    0.0   0.0   117.0    0.0   0.0   0.0   0.0   
1     0.0    0.0   0.0    0.0    0.0   0.0     0.0    0.0   0.0   0.0   0.0   
2     0.0    0.0   0.0    0.0    0.0   0.0     0.0    0.0   0.0   0.0   0.0   
3  1758.0  925.0   0.0  354.0  135.0  50.0  1404.0  790.0   0.0   0.0   0.0   
4     0.0    0.0   0.0    0.0    0.0   0.0     0.0    0.0   0.0   0.0   0.0   

   V138  V139  V140  V141  V142  V143  V144   V145  V146  V147  V148  V149  \
0   NaN   NaN   NaN   NaN   NaN   NaN   NaN    NaN   NaN   NaN   NaN   NaN   
1   NaN   NaN   NaN   NaN   NaN   NaN   NaN    NaN   NaN   NaN   NaN   NaN   
2   NaN   NaN   NaN   NaN   NaN   NaN   NaN    NaN   NaN   NaN   NaN   NaN   
3   NaN   NaN   NaN   NaN   NaN   NaN   NaN    NaN   NaN   NaN   NaN   NaN   
4   0.0   0.0   0.0   0.0   0.0   6.0  18.0  140.0   0.0   0.0   0.0   0.0   

     V150  V151  V152  V153  V154  V155  V156  V157  V158          V159  \
0     NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN           NaN   
1     NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN           NaN   
2     NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN           NaN   
3     NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN           NaN   
4  1803.0  49.0  64.0   0.0   0.0   0.0   0.0   0.0   0.0  15557.990234   

            V160  V161  V162  V163   V164    V165    V166  V167  V168  V169  \
0            NaN   NaN   NaN   NaN    NaN     NaN     NaN   NaN   NaN   NaN   
1            NaN   NaN   NaN   NaN    NaN     NaN     NaN   NaN   NaN   NaN   
2            NaN   NaN   NaN   NaN    NaN     NaN     NaN   NaN   NaN   NaN   
3            NaN   NaN   NaN   NaN    NaN     NaN     NaN   NaN   NaN   NaN   
4  169690.796875   0.0   0.0   0.0  515.0  5155.0  2840.0   0.0   0.0   0.0   

   V170  V171  V172  V173  V174  V175  V176  V177  V178  V179  V180  V181  \
0   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   
1   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   
2   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   
3   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   
4   1.0   1.0   0.0   0.0   0.0   0.0   1.0   0.0   0.0   0.0   0.0   0.0   

   V182  V183  V184  V185  V186  V187  V188  V189  V190  V191  V192  V193  \
0   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   
1   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   
2   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   
3   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   
4   0.0   0.0   0.0   0.0   1.0   1.0   1.0   1.0   1.0   1.0   1.0   1.0   

   V194  V195  V196  V197  V198  V199  V200  V201  V202  V203  V204  V205  \
0   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   
1   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   
2   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   
3   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   
4   1.0   1.0   1.0   1.0   1.0   1.0   1.0   1.0   0.0   0.0   0.0   0.0   

   V206  V207  V208  V209  V210  V211  V212  V213  V214  V215  V216  V217  \
0   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   
1   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   
2   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   
3   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   
4   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   

   V218  V219  V220  V221  V222  V223  V224  V225  V226  V227  V228  V229  \
0   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   
1   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   
2   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   
3   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   
4   0.0   0.0   0.0   1.0   1.0   0.0   0.0   0.0   0.0   0.0   1.0   1.0   

   V230  V231  V232  V233  V234  V235  V236  V237  V238  V239  V240  V241  \
0   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   
1   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   
2   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   
3   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   
4   1.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   1.0   1.0   

   V242  V243  V244  V245  V246  V247  V248  V249  V250  V251  V252  V253  \
0   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   
1   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   
2   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   
3   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   
4   1.0   1.0   1.0   1.0   1.0   1.0   1.0   1.0   1.0   1.0   1.0   1.0   

   V254  V255  V256  V257  V258  V259  V260  V261  V262  V263  V264  V265  \
0   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   
1   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   
2   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   
3   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   
4   1.0   1.0   1.0   1.0   1.0   1.0   1.0   1.0   1.0   0.0   0.0   0.0   

   V266  V267  V268  V269  V270  V271  V272  V273  V274  V275  V276  V277  \
0   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   
1   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   
2   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   
3   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   
4   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   

   V278  V279  V280  V281  V282  V283  V284  V285  V286  V287  V288  V289  \
0   NaN   0.0   0.0   0.0   1.0   1.0   0.0   0.0   0.0   0.0   0.0   0.0   
1   NaN   0.0   0.0   0.0   1.0   1.0   0.0   0.0   0.0   0.0   0.0   0.0   
2   NaN   0.0   0.0   0.0   1.0   1.0   0.0   0.0   0.0   0.0   0.0   0.0   
3   NaN   1.0  28.0   0.0   0.0   0.0   0.0  10.0   0.0   4.0   0.0   0.0   
4   0.0   0.0   0.0   0.0   1.0   1.0   0.0   0.0   0.0   0.0   0.0   0.0   

   V290  V291  V292  V293  V294  V295  V296  V297  V298  V299  V300  V301  \
0   1.0   1.0   1.0   0.0   1.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   
1   1.0   1.0   1.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   
2   1.0   1.0   1.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   
3   1.0   1.0   1.0   1.0  38.0  24.0   0.0   0.0   0.0   0.0   0.0   0.0   
4   1.0   1.0   1.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   

   V302  V303  V304  V305  V306    V307   V308  V309   V310  V311   V312  \
0   0.0   0.0   0.0   1.0   0.0   117.0    0.0   0.0    0.0   0.0    0.0   
1   0.0   0.0   0.0   1.0   0.0     0.0    0.0   0.0    0.0   0.0    0.0   
2   0.0   0.0   0.0   1.0   0.0     0.0    0.0   0.0    0.0   0.0    0.0   
3   0.0   0.0   0.0   1.0  50.0  1758.0  925.0   0.0  354.0   0.0  135.0   
4   1.0   1.0   1.0   1.0   0.0     0.0    0.0   0.0    0.0   0.0    0.0   

   V313  V314  V315  V316    V317   V318  V319  V320  V321  V322  V323  V324  \
0   0.0   0.0   0.0   0.0   117.0    0.0   0.0   0.0   0.0   NaN   NaN   NaN   
1   0.0   0.0   0.0   0.0     0.0    0.0   0.0   0.0   0.0   NaN   NaN   NaN   
2   0.0   0.0   0.0   0.0     0.0    0.0   0.0   0.0   0.0   NaN   NaN   NaN   
3   0.0   0.0   0.0  50.0  1404.0  790.0   0.0   0.0   0.0   NaN   NaN   NaN   
4   0.0   0.0   0.0   0.0     0.0    0.0   0.0   0.0   0.0   0.0   0.0   0.0   

   V325  V326  V327  V328  V329  V330  V331  V332  V333  V334  V335  V336  \
0   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   
1   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   
2   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   
3   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   NaN   
4   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   0.0   

   V337  V338  V339  
0   NaN   NaN   NaN  
1   NaN   NaN   NaN  
2   NaN   NaN   NaN  
3   NaN   NaN   NaN  
4   0.0   0.0   0.0  


Identity dataset

   TransactionID  id_01     id_02  id_03  id_04  id_05  id_06  id_07  id_08  \
0        2987004    0.0   70787.0    NaN    NaN    NaN    NaN    NaN    NaN   
1        2987008   -5.0   98945.0    NaN    NaN    0.0   -5.0    NaN    NaN   
2        2987010   -5.0  191631.0    0.0    0.0    0.0    0.0    NaN    NaN   
3        2987011   -5.0  221832.0    NaN    NaN    0.0   -6.0    NaN    NaN   
4        2987016    0.0    7460.0    0.0    0.0    1.0    0.0    NaN    NaN   

   id_09  id_10  id_11     id_12  id_13  id_14  id_15     id_16  id_17  id_18  \
0    NaN    NaN  100.0  NotFound    NaN -480.0    New  NotFound  166.0    NaN   
1    NaN    NaN  100.0  NotFound   49.0 -300.0    New  NotFound  166.0    NaN   
2    0.0    0.0  100.0  NotFound   52.0    NaN  Found     Found  121.0    NaN   
3    NaN    NaN  100.0  NotFound   52.0    NaN    New  NotFound  225.0    NaN   
4    0.0    0.0  100.0  NotFound    NaN -300.0  Found     Found  166.0   15.0   

   id_19  id_20  id_21  id_22 id_23  id_24  id_25  id_26 id_27  id_28  \
0  542.0  144.0    NaN    NaN   NaN    NaN    NaN    NaN   NaN    New   
1  621.0  500.0    NaN    NaN   NaN    NaN    NaN    NaN   NaN    New   
2  410.0  142.0    NaN    NaN   NaN    NaN    NaN    NaN   NaN  Found   
3  176.0  507.0    NaN    NaN   NaN    NaN    NaN    NaN   NaN    New   
4  529.0  575.0    NaN    NaN   NaN    NaN    NaN    NaN   NaN  Found   

      id_29             id_30                id_31  id_32      id_33  \
0  NotFound       Android 7.0  samsung browser 6.2   32.0  2220x1080   
1  NotFound        iOS 11.1.2   mobile safari 11.0   32.0   1334x750   
2     Found               NaN          chrome 62.0    NaN        NaN   
3  NotFound               NaN          chrome 62.0    NaN        NaN   
4     Found  Mac OS X 10_11_6          chrome 62.0   24.0   1280x800   

            id_34 id_35 id_36 id_37 id_38 DeviceType  \
0  match_status:2     T     F     T     T     mobile   
1  match_status:1     T     F     F     T     mobile   
2             NaN     F     F     T     T    desktop   
3             NaN     F     F     T     T    desktop   
4  match_status:2     T     F     T     T    desktop   

                      DeviceInfo  
0  SAMSUNG SM-G892A Build/NRD90M  
1                     iOS Device  
2                        Windows  
3                            NaN  
4                          MacOS  

The outputs above show that there are many features in these two datasets that have missing values (NaN).

# Check the number of observations and features of the data sets.
heading('Transaction dataset')
print(transaction.info())
print('\n')

heading('Identity dataset')
print(identity.info())
Transaction dataset

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 590540 entries, 0 to 590539
Columns: 394 entries, TransactionID to V339
dtypes: float64(376), int64(4), object(14)
memory usage: 1.7+ GB
None


Identity dataset

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 144233 entries, 0 to 144232
Data columns (total 41 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   TransactionID  144233 non-null  int64  
 1   id_01          144233 non-null  float64
 2   id_02          140872 non-null  float64
 3   id_03          66324 non-null   float64
 4   id_04          66324 non-null   float64
 5   id_05          136865 non-null  float64
 6   id_06          136865 non-null  float64
 7   id_07          5155 non-null    float64
 8   id_08          5155 non-null    float64
 9   id_09          74926 non-null   float64
 10  id_10          74926 non-null   float64
 11  id_11          140978 non-null  float64
 12  id_12          144233 non-null  object 
 13  id_13          127320 non-null  float64
 14  id_14          80044 non-null   float64
 15  id_15          140985 non-null  object 
 16  id_16          129340 non-null  object 
 17  id_17          139369 non-null  float64
 18  id_18          45113 non-null   float64
 19  id_19          139318 non-null  float64
 20  id_20          139261 non-null  float64
 21  id_21          5159 non-null    float64
 22  id_22          5169 non-null    float64
 23  id_23          5169 non-null    object 
 24  id_24          4747 non-null    float64
 25  id_25          5132 non-null    float64
 26  id_26          5163 non-null    float64
 27  id_27          5169 non-null    object 
 28  id_28          140978 non-null  object 
 29  id_29          140978 non-null  object 
 30  id_30          77565 non-null   object 
 31  id_31          140282 non-null  object 
 32  id_32          77586 non-null   float64
 33  id_33          73289 non-null   object 
 34  id_34          77805 non-null   object 
 35  id_35          140985 non-null  object 
 36  id_36          140985 non-null  object 
 37  id_37          140985 non-null  object 
 38  id_38          140985 non-null  object 
 39  DeviceType     140810 non-null  object 
 40  DeviceInfo     118666 non-null  object 
dtypes: float64(23), int64(1), object(17)
memory usage: 45.1+ MB
None
# Summarise key information about numeric features in each of the datasets.
heading('Transaction dataset')
print(transaction.describe())
print('\n')

heading('Identity dataset')
print(identity.describe())
Transaction dataset

       TransactionID        isFraud  TransactionDT  TransactionAmt  \
count   5.905400e+05  590540.000000   5.905400e+05   590540.000000   
mean    3.282270e+06       0.034990   7.372311e+06      135.027176   
std     1.704744e+05       0.183755   4.617224e+06      239.162522   
min     2.987000e+06       0.000000   8.640000e+04        0.251000   
25%     3.134635e+06       0.000000   3.027058e+06       43.321000   
50%     3.282270e+06       0.000000   7.306528e+06       68.769000   
75%     3.429904e+06       0.000000   1.124662e+07      125.000000   
max     3.577539e+06       1.000000   1.581113e+07    31937.391000   

               card1          card2          card3          card5  \
count  590540.000000  581607.000000  588975.000000  586281.000000   
mean     9898.734658     362.555488     153.194925     199.278897   
std      4901.170153     157.793246      11.336444      41.244453   
min      1000.000000     100.000000     100.000000     100.000000   
25%      6019.000000     214.000000     150.000000     166.000000   
50%      9678.000000     361.000000     150.000000     226.000000   
75%     14184.000000     512.000000     150.000000     226.000000   
max     18396.000000     600.000000     231.000000     237.000000   

               addr1          addr2          dist1         dist2  \
count  524834.000000  524834.000000  238269.000000  37627.000000   
mean      290.733794      86.800630     118.502180    231.855423   
std       101.741072       2.690623     371.872026    529.053494   
min       100.000000      10.000000       0.000000      0.000000   
25%       204.000000      87.000000       3.000000      7.000000   
50%       299.000000      87.000000       8.000000     37.000000   
75%       330.000000      87.000000      24.000000    206.000000   
max       540.000000     102.000000   10286.000000  11623.000000   

                  C1             C2             C3             C4  \
count  590540.000000  590540.000000  590540.000000  590540.000000   
mean       14.092458      15.269734       0.005644       4.092185   
std       133.569018     154.668899       0.150536      68.848459   
min         0.000000       0.000000       0.000000       0.000000   
25%         1.000000       1.000000       0.000000       0.000000   
50%         1.000000       1.000000       0.000000       0.000000   
75%         3.000000       3.000000       0.000000       0.000000   
max      4685.000000    5691.000000      26.000000    2253.000000   

                  C5             C6             C7             C8  \
count  590540.000000  590540.000000  590540.000000  590540.000000   
mean        5.571526       9.071082       2.848478       5.144574   
std        25.786976      71.508467      61.727304      95.378574   
min         0.000000       0.000000       0.000000       0.000000   
25%         0.000000       1.000000       0.000000       0.000000   
50%         0.000000       1.000000       0.000000       0.000000   
75%         1.000000       2.000000       0.000000       0.000000   
max       349.000000    2253.000000    2255.000000    3331.000000   

                  C9            C10            C11            C12  \
count  590540.000000  590540.000000  590540.000000  590540.000000   
mean        4.480240       5.240343      10.241521       4.076227   
std        16.674897      95.581443      94.336292      86.666218   
min         0.000000       0.000000       0.000000       0.000000   
25%         0.000000       0.000000       1.000000       0.000000   
50%         1.000000       0.000000       1.000000       0.000000   
75%         2.000000       0.000000       2.000000       0.000000   
max       210.000000    3257.000000    3188.000000    3188.000000   

                 C13            C14             D1             D2  \
count  590540.000000  590540.000000  589271.000000  309743.000000   
mean       32.539918       8.295215      94.347568     169.563231   
std       129.364844      49.544262     157.660387     177.315865   
min         0.000000       0.000000       0.000000       0.000000   
25%         1.000000       1.000000       0.000000      26.000000   
50%         3.000000       1.000000       3.000000      97.000000   
75%        12.000000       2.000000     122.000000     276.000000   
max      2918.000000    1429.000000     640.000000     640.000000   

                  D3             D4             D5            D6  \
count  327662.000000  421618.000000  280699.000000  73187.000000   
mean       28.343348     140.002441      42.335965     69.805717   
std        62.384721     191.096774      89.000144    143.669253   
min         0.000000    -122.000000       0.000000    -83.000000   
25%         1.000000       0.000000       1.000000      0.000000   
50%         8.000000      26.000000      10.000000      0.000000   
75%        27.000000     253.000000      32.000000     40.000000   
max       819.000000     869.000000     819.000000    873.000000   

                 D7            D8            D9            D10            D11  \
count  38917.000000  74926.000000  74926.000000  514518.000000  311253.000000   
mean      41.638950    146.058108      0.561057     123.982137     146.621465   
std       99.743264    231.663840      0.316880     182.615225     186.042622   
min        0.000000      0.000000      0.000000       0.000000     -53.000000   
25%        0.000000      0.958333      0.208333       0.000000       0.000000   
50%        0.000000     37.875000      0.666666      15.000000      43.000000   
75%       17.000000    187.958328      0.833333     197.000000     274.000000   
max      843.000000   1707.791626      0.958333     876.000000     670.000000   

                D12           D13           D14            D15             V1  \
count  64717.000000  61952.000000  62187.000000  501427.000000  311253.000000   
mean      54.037533     17.901295     57.724444     163.744579       0.999945   
std      124.274558     67.614425    136.312450     202.726660       0.007390   
min      -83.000000      0.000000   -193.000000     -83.000000       0.000000   
25%        0.000000      0.000000      0.000000       0.000000       1.000000   
50%        0.000000      0.000000      0.000000      52.000000       1.000000   
75%       13.000000      0.000000      2.000000     314.000000       1.000000   
max      648.000000    847.000000    878.000000     879.000000       1.000000   

                  V2             V3             V4             V5  \
count  311253.000000  311253.000000  311253.000000  311253.000000   
mean        1.045204       1.078075       0.846456       0.876991   
std         0.240133       0.320890       0.440053       0.475902   
min         0.000000       0.000000       0.000000       0.000000   
25%         1.000000       1.000000       1.000000       1.000000   
50%         1.000000       1.000000       1.000000       1.000000   
75%         1.000000       1.000000       1.000000       1.000000   
max         8.000000       9.000000       6.000000       6.000000   

                  V6             V7             V8             V9  \
count  311253.000000  311253.000000  311253.000000  311253.000000   
mean        1.045686       1.072870       1.027704       1.041529   
std         0.239385       0.304779       0.186069       0.226339   
min         0.000000       0.000000       0.000000       0.000000   
25%         1.000000       1.000000       1.000000       1.000000   
50%         1.000000       1.000000       1.000000       1.000000   
75%         1.000000       1.000000       1.000000       1.000000   
max         9.000000       9.000000       8.000000       8.000000   

                 V10            V11            V12            V13  \
count  311253.000000  311253.000000  514467.000000  514467.000000   
mean        0.463915       0.478987       0.559711       0.599166   
std         0.521522       0.552431       0.510533       0.532185   
min         0.000000       0.000000       0.000000       0.000000   
25%         0.000000       0.000000       0.000000       0.000000   
50%         0.000000       0.000000       1.000000       1.000000   
75%         1.000000       1.000000       1.000000       1.000000   
max         4.000000       5.000000       3.000000       6.000000   

                 V14            V15            V16            V17  \
count  514467.000000  514467.000000  514467.000000  514467.000000   
mean        0.999500       0.122342       0.123460       0.134040   
std         0.022345       0.332422       0.342841       0.364456   
min         0.000000       0.000000       0.000000       0.000000   
25%         1.000000       0.000000       0.000000       0.000000   
50%         1.000000       0.000000       0.000000       0.000000   
75%         1.000000       0.000000       0.000000       0.000000   
max         1.000000       7.000000      15.000000      15.000000   

                 V18            V19            V20            V21  \
count  514467.000000  514467.000000  514467.000000  514467.000000   
mean        0.135363       0.816371       0.847843       0.129684   
std         0.371763       0.425512       0.459451       0.339060   
min         0.000000       0.000000       0.000000       0.000000   
25%         0.000000       1.000000       1.000000       0.000000   
50%         0.000000       1.000000       1.000000       0.000000   
75%         0.000000       1.000000       1.000000       0.000000   
max        15.000000       7.000000      15.000000       5.000000   

                 V22            V23            V24            V25  \
count  514467.000000  514467.000000  514467.000000  514467.000000   
mean        0.132292       1.034791       1.058097       0.977660   
std         0.359415       0.247681       0.305485       0.185245   
min         0.000000       0.000000       0.000000       0.000000   
25%         0.000000       1.000000       1.000000       1.000000   
50%         0.000000       1.000000       1.000000       1.000000   
75%         0.000000       1.000000       1.000000       1.000000   
max         8.000000      13.000000      13.000000       7.000000   

                 V26            V27            V28            V29  \
count  514467.000000  514467.000000  514467.000000  514467.000000   
mean        0.988040       0.000776       0.000830       0.387840   
std         0.209302       0.028596       0.031133       0.510652   
min         0.000000       0.000000       0.000000       0.000000   
25%         1.000000       0.000000       0.000000       0.000000   
50%         1.000000       0.000000       0.000000       0.000000   
75%         1.000000       0.000000       0.000000       1.000000   
max        13.000000       4.000000       4.000000       5.000000   

                 V30            V31            V32            V33  \
count  514467.000000  514467.000000  514467.000000  514467.000000   
mean        0.406436       0.140761       0.142417       0.130693   
std         0.554311       0.355793       0.368278       0.340900   
min         0.000000       0.000000       0.000000       0.000000   
25%         0.000000       0.000000       0.000000       0.000000   
50%         0.000000       0.000000       0.000000       0.000000   
75%         1.000000       0.000000       0.000000       0.000000   
max         9.000000       7.000000      15.000000       7.000000   

                 V34            V35            V36            V37  \
count  514467.000000  421571.000000  421571.000000  421571.000000   
mean        0.139154       0.542594       0.579198       1.108065   
std         0.357549       0.516010       0.539039       0.690571   
min         0.000000       0.000000       0.000000       0.000000   
25%         0.000000       0.000000       0.000000       1.000000   
50%         0.000000       1.000000       1.000000       1.000000   
75%         0.000000       1.000000       1.000000       1.000000   
max        13.000000       3.000000       5.000000      54.000000   

                V38            V39            V40            V41  \
count  421571.00000  421571.000000  421571.000000  421571.000000   
mean        1.16240       0.166076       0.177145       0.999269   
std         0.85892       0.451956       0.505786       0.027020   
min         0.00000       0.000000       0.000000       0.000000   
25%         1.00000       0.000000       0.000000       1.000000   
50%         1.00000       0.000000       0.000000       1.000000   
75%         1.00000       0.000000       0.000000       1.000000   
max        54.00000      15.000000      24.000000       1.000000   

                 V42            V43            V44            V45  \
count  421571.000000  421571.000000  421571.000000  421571.000000   
mean        0.156118       0.168942       1.083891       1.120779   
std         0.382896       0.433359       0.639143       0.729774   
min         0.000000       0.000000       0.000000       0.000000   
25%         0.000000       0.000000       1.000000       1.000000   
50%         0.000000       0.000000       1.000000       1.000000   
75%         0.000000       0.000000       1.000000       1.000000   
max         8.000000       8.000000      48.000000      48.000000   

                 V46            V47            V48            V49  \
count  421571.000000  421571.000000  421571.000000  421571.000000   
mean        1.022286       1.038515       0.383174       0.397724   
std         0.166719       0.231862       0.508189       0.542654   
min         0.000000       0.000000       0.000000       0.000000   
25%         1.000000       1.000000       0.000000       0.000000   
50%         1.000000       1.000000       0.000000       0.000000   
75%         1.000000       1.000000       1.000000       1.000000   
max         6.000000      12.000000       5.000000       5.000000   

                 V50            V51            V52            V53  \
count  421571.000000  421571.000000  421571.000000  513444.000000   
mean        0.164746       0.170579       0.182695       0.577586   
std         0.373995       0.403899       0.439002       0.511571   
min         0.000000       0.000000       0.000000       0.000000   
25%         0.000000       0.000000       0.000000       0.000000   
50%         0.000000       0.000000       0.000000       1.000000   
75%         0.000000       0.000000       0.000000       1.000000   
max         5.000000       6.000000      12.000000       5.000000   

                 V54            V55            V56            V57  \
count  513444.000000  513444.000000  513444.000000  513444.000000   
mean        0.619982       1.067670       1.120979       0.128312   
std         0.534635       0.391364       0.661129       0.349094   
min         0.000000       0.000000       0.000000       0.000000   
25%         0.000000       1.000000       1.000000       0.000000   
50%         1.000000       1.000000       1.000000       0.000000   
75%         1.000000       1.000000       1.000000       0.000000   
max         6.000000      17.000000      51.000000       6.000000   

                 V58            V59            V60            V61  \
count  513444.000000  513444.000000  513444.000000  513444.000000   
mean        0.132453       0.134433       0.142537       0.829785   
std         0.372907       0.379291       0.418230       0.436617   
min         0.000000       0.000000       0.000000       0.000000   
25%         0.000000       0.000000       0.000000       1.000000   
50%         0.000000       0.000000       0.000000       1.000000   
75%         0.000000       0.000000       0.000000       1.000000   
max        10.000000      16.000000      16.000000       6.000000   

                 V62            V63            V64            V65  \
count  513444.000000  513444.000000  513444.000000  513444.000000   
mean        0.867563       0.130743       0.141825       0.999663   
std         0.483803       0.355262       0.406157       0.018353   
min         0.000000       0.000000       0.000000       0.000000   
25%         1.000000       0.000000       0.000000       1.000000   
50%         1.000000       0.000000       0.000000       1.000000   
75%         1.000000       0.000000       0.000000       1.000000   
max        10.000000       7.000000       7.000000       1.000000   

                 V66            V67            V68            V69  \
count  513444.000000  513444.000000  513444.000000  513444.000000   
mean        0.981038       0.998121       0.000534       0.390200   
std         0.216235       0.245912       0.023760       0.513696   
min         0.000000       0.000000       0.000000       0.000000   
25%         1.000000       1.000000       0.000000       0.000000   
50%         1.000000       1.000000       0.000000       0.000000   
75%         1.000000       1.000000       0.000000       1.000000   
max         7.000000       8.000000       2.000000       5.000000   

                 V70            V71            V72            V73  \
count  513444.000000  513444.000000  513444.000000  513444.000000   
mean        0.407924       0.140639       0.145124       0.139982   
std         0.554499       0.364303       0.389212       0.367098   
min         0.000000       0.000000       0.000000       0.000000   
25%         0.000000       0.000000       0.000000       0.000000   
50%         0.000000       0.000000       0.000000       0.000000   
75%         1.000000       0.000000       0.000000       0.000000   
max         6.000000       6.000000      10.000000       7.000000   

                 V74            V75            V76            V77  \
count  513444.000000  501376.000000  501376.000000  501376.000000   
mean        0.152147       0.544278       0.587557       1.086893   
std         0.393274       0.514318       0.538230       0.532958   
min         0.000000       0.000000       0.000000       0.000000   
25%         0.000000       0.000000       0.000000       1.000000   
50%         0.000000       1.000000       1.000000       1.000000   
75%         0.000000       1.000000       1.000000       1.000000   
max         8.000000       4.000000       6.000000      30.000000   

                 V78            V79            V80            V81  \
count  501376.000000  501376.000000  501376.000000  501376.000000   
mean        1.144462       0.136867       0.143954       0.152696   
std         0.781474       0.379904       0.409932       0.452298   
min         0.000000       0.000000       0.000000       0.000000   
25%         1.000000       0.000000       0.000000       0.000000   
50%         1.000000       0.000000       0.000000       0.000000   
75%         1.000000       0.000000       0.000000       0.000000   
max        31.000000       7.000000      19.000000      19.000000   

                 V82            V83            V84            V85  \
count  501376.000000  501376.000000  501376.000000  501376.000000   
mean        0.844610       0.881965       0.137145       0.149788   
std         0.422529       0.470757       0.361875       0.419445   
min         0.000000       0.000000       0.000000       0.000000   
25%         1.000000       1.000000       0.000000       0.000000   
50%         1.000000       1.000000       0.000000       0.000000   
75%         1.000000       1.000000       0.000000       0.000000   
max         7.000000       7.000000       7.000000       7.000000   

                 V86            V87            V88            V89  \
count  501376.000000  501376.000000  501376.000000  501376.000000   
mean        1.064885       1.099456       0.999246       0.000902   
std         0.419807       0.511281       0.027447       0.031943   
min         0.000000       0.000000       0.000000       0.000000   
25%         1.000000       1.000000       1.000000       0.000000   
50%         1.000000       1.000000       1.000000       0.000000   
75%         1.000000       1.000000       1.000000       0.000000   
max        30.000000      30.000000       1.000000       2.000000   

                 V90            V91            V92            V93  \
count  501376.000000  501376.000000  501376.000000  501376.000000   
mean        0.401862       0.420461       0.150322       0.154812   
std         0.516153       0.561865       0.375682       0.402314   
min         0.000000       0.000000       0.000000       0.000000   
25%         0.000000       0.000000       0.000000       0.000000   
50%         0.000000       0.000000       0.000000       0.000000   
75%         1.000000       1.000000       0.000000       0.000000   
max         5.000000       6.000000       7.000000       7.000000   

                 V94            V95           V96            V97  \
count  501376.000000  590226.000000  590226.00000  590226.000000   
mean        0.137007       1.038019       3.00519       1.718933   
std         0.343878      21.034304      40.23949      27.700449   
min         0.000000       0.000000       0.00000       0.000000   
25%         0.000000       0.000000       0.00000       0.000000   
50%         0.000000       0.000000       0.00000       0.000000   
75%         0.000000       0.000000       1.00000       0.000000   
max         2.000000     880.000000    1410.00000     976.000000   

                 V98            V99           V100           V101  \
count  590226.000000  590226.000000  590226.000000  590226.000000   
mean        0.061985       0.894986       0.273504       0.889249   
std         0.284995       2.722552       0.947176      20.582571   
min         0.000000       0.000000       0.000000       0.000000   
25%         0.000000       0.000000       0.000000       0.000000   
50%         0.000000       0.000000       0.000000       0.000000   
75%         0.000000       1.000000       0.000000       0.000000   
max        12.000000      88.000000      28.000000     869.000000   

                V102           V103           V104           V105  \
count  590226.000000  590226.000000  590226.000000  590226.000000   
mean        1.827229       1.279288       0.085433       0.281145   
std        35.928456      25.685642       0.648718       3.373948   
min         0.000000       0.000000       0.000000       0.000000   
25%         0.000000       0.000000       0.000000       0.000000   
50%         0.000000       0.000000       0.000000       0.000000   
75%         0.000000       0.000000       0.000000       0.000000   
max      1285.000000     928.000000      15.000000      99.000000   

                V106           V107           V108           V109  \
count  590226.000000  590226.000000  590226.000000  590226.000000   
mean        0.164584       0.999580       1.004613       1.014816   
std         1.825665       0.020494       0.081146       0.127769   
min         0.000000       0.000000       0.000000       0.000000   
25%         0.000000       1.000000       1.000000       1.000000   
50%         0.000000       1.000000       1.000000       1.000000   
75%         0.000000       1.000000       1.000000       1.000000   
max        55.000000       1.000000       7.000000       7.000000   

                V110           V111           V112           V113  \
count  590226.000000  590226.000000  590226.000000  590226.000000   
mean        1.007739       1.002563       1.005356       1.003383   
std         0.097290       0.070830       0.084617       0.074754   
min         0.000000       0.000000       0.000000       0.000000   
25%         1.000000       1.000000       1.000000       1.000000   
50%         1.000000       1.000000       1.000000       1.000000   
75%         1.000000       1.000000       1.000000       1.000000   
max         7.000000       9.000000       9.000000       9.000000   

                V114           V115           V116           V117  \
count  590226.000000  590226.000000  590226.000000  590226.000000   
mean        1.009298       1.032450       1.015738       1.000391   
std         0.110179       0.190385       0.136770       0.035238   
min         0.000000       0.000000       0.000000       0.000000   
25%         1.000000       1.000000       1.000000       1.000000   
50%         1.000000       1.000000       1.000000       1.000000   
75%         1.000000       1.000000       1.000000       1.000000   
max         6.000000       6.000000       6.000000       3.000000   

                V118           V119           V120           V121  \
count  590226.000000  590226.000000  590226.000000  590226.000000   
mean        1.001474       1.000729       1.000874       1.004276   
std         0.041011       0.036392       0.041684       0.067097   
min         0.000000       0.000000       0.000000       0.000000   
25%         1.000000       1.000000       1.000000       1.000000   
50%         1.000000       1.000000       1.000000       1.000000   
75%         1.000000       1.000000       1.000000       1.000000   
max         3.000000       3.000000       3.000000       3.000000   

                V122           V123           V124           V125  \
count  590226.000000  590226.000000  590226.000000  590226.000000   
mean        1.001759       1.031120       1.092975       1.050415   
std         0.048636       0.228134       0.374103       0.280037   
min         0.000000       0.000000       0.000000       0.000000   
25%         1.000000       1.000000       1.000000       1.000000   
50%         1.000000       1.000000       1.000000       1.000000   
75%         1.000000       1.000000       1.000000       1.000000   
max         3.000000      13.000000      13.000000      13.000000   

                V126           V127           V128           V129  \
count  590226.000000  590226.000000  590226.000000  590226.000000   
mean      129.979417     336.611559     204.094038       8.768944   
std      2346.951681    4238.666949    3010.258774     113.832828   
min         0.000000       0.000000       0.000000       0.000000   
25%         0.000000       0.000000       0.000000       0.000000   
50%         0.000000       0.000000       0.000000       0.000000   
75%         0.000000     107.949997       0.000000       0.000000   
max    160000.000000  160000.000000  160000.000000   55125.000000   

                V130           V131           V132           V133  \
count  590226.000000  590226.000000  590226.000000  590226.000000   
mean       92.165849      31.133302     103.513188     204.889160   
std       315.960485     161.161258    2266.106140    3796.316755   
min         0.000000       0.000000       0.000000       0.000000   
25%         0.000000       0.000000       0.000000       0.000000   
50%         0.000000       0.000000       0.000000       0.000000   
75%        59.000000       0.000000       0.000000       0.000000   
max     55125.000000   55125.000000   93736.000000  133915.000000   

                V134           V135           V136           V137  \
count  590226.000000  590226.000000  590226.000000  590226.000000   
mean      145.972328      17.250132      38.821196      26.365090   
std      2772.986817     293.847563     451.808411     348.332714   
min         0.000000       0.000000       0.000000       0.000000   
25%         0.000000       0.000000       0.000000       0.000000   
50%         0.000000       0.000000       0.000000       0.000000   
75%         0.000000       0.000000       0.000000       0.000000   
max     98476.000000   90750.000000   90750.000000   90750.000000   

               V138          V139          V140          V141          V142  \
count  81945.000000  81945.000000  81945.000000  81945.000000  81945.000000   
mean       0.036439      1.073915      1.125267      0.037696      0.048581   
std        0.428490      1.333924      1.467850      0.215133      0.313888   
min        0.000000      0.000000      0.000000      0.000000      0.000000   
25%        0.000000      0.000000      0.000000      0.000000      0.000000   
50%        0.000000      1.000000      1.000000      0.000000      0.000000   
75%        0.000000      1.000000      1.000000      0.000000      0.000000   
max       22.000000     33.000000     33.000000      5.000000      9.000000   

               V143          V144          V145          V146          V147  \
count  81951.000000  81951.000000  81951.000000  81945.000000  81945.000000   
mean       8.397006      3.708484     22.112946      0.156276      0.168601   
std       55.267545     10.485633     64.371860      0.682328      0.750497   
min        0.000000      0.000000      0.000000      0.000000      0.000000   
25%        0.000000      0.000000      0.000000      0.000000      0.000000   
50%        0.000000      0.000000      0.000000      0.000000      0.000000   
75%        0.000000      0.000000      1.000000      0.000000      0.000000   
max      869.000000     62.000000    297.000000     24.000000     26.000000   

               V148          V149          V150          V151          V152  \
count  81945.000000  81945.000000  81951.000000  81951.000000  81951.000000   
mean       0.765001      0.775313    277.598028      6.460190      9.432710   
std        0.581425      0.628191    829.576922     15.232324     21.554486   
min        0.000000      0.000000      1.000000      1.000000      1.000000   
25%        0.000000      0.000000      1.000000      1.000000      1.000000   
50%        1.000000      1.000000      1.000000      1.000000      1.000000   
75%        1.000000      1.000000      1.000000      1.000000      1.000000   
max       20.000000     20.000000   3389.000000     57.000000     69.000000   

               V153          V154          V155          V156          V157  \
count  81945.000000  81945.000000  81945.000000  81945.000000  81945.000000   
mean       0.753200      0.757423      0.767381      0.777485      0.817573   
std        0.532649      0.546907      0.602678      0.647209      0.682016   
min        0.000000      0.000000      0.000000      0.000000      0.000000   
25%        0.000000      0.000000      0.000000      0.000000      0.000000   
50%        1.000000      1.000000      1.000000      1.000000      1.000000   
75%        1.000000      1.000000      1.000000      1.000000      1.000000   
max       18.000000     18.000000     24.000000     24.000000     24.000000   

               V158          V159           V160          V161          V162  \
count  81945.000000  81951.000000   81951.000000  81945.000000  81945.000000   
mean       0.833461   2719.299775   47453.181173      4.843938      6.594661   
std        0.734336   8355.445049  142076.069162     58.929757     69.195180   
min        0.000000      0.000000       0.000000      0.000000      0.000000   
25%        0.000000      0.000000       0.000000      0.000000      0.000000   
50%        1.000000      0.000000       0.000000      0.000000      0.000000   
75%        1.000000      0.000000       0.000000      0.000000      0.000000   
max       24.000000  55125.000000  641511.437500   3300.000000   3300.000000   

               V163          V164          V165           V166           V167  \
count  81945.000000  81951.000000  81951.000000   81951.000000  139631.000000   
mean       5.505955    877.888928   2239.912219     359.469437       3.929514   
std       63.077887   6049.166505   8223.258928    1244.463270      42.200980   
min        0.000000      0.000000      0.000000       0.000000       0.000000   
25%        0.000000      0.000000      0.000000       0.000000       0.000000   
50%        0.000000      0.000000      0.000000       0.000000       0.000000   
75%        0.000000      0.000000      0.000000       0.000000       0.000000   
max     3300.000000  93736.000000  98476.000000  104060.000000     872.000000   

                V168           V169           V170           V171  \
count  139631.000000  139819.000000  139819.000000  139819.000000   
mean        5.859501       0.168053       1.436779       1.690908   
std        54.032468       0.904944       1.751143       2.444748   
min         0.000000       0.000000       0.000000       0.000000   
25%         0.000000       0.000000       1.000000       1.000000   
50%         0.000000       0.000000       1.000000       1.000000   
75%         1.000000       0.000000       1.000000       1.000000   
max       964.000000      19.000000      48.000000      61.000000   

                V172           V173           V174           V175  \
count  139631.000000  139631.000000  139819.000000  139819.000000   
mean        0.132012       0.055231       0.127794       0.212639   
std         0.924894       0.264439       0.379763       0.857133   
min         0.000000       0.000000       0.000000       0.000000   
25%         0.000000       0.000000       0.000000       0.000000   
50%         0.000000       0.000000       0.000000       0.000000   
75%         0.000000       0.000000       0.000000       0.000000   
max        31.000000       7.000000       8.000000      14.000000   

                V176           V177           V178           V179  \
count  139631.000000  139631.000000  139631.000000  139631.000000   
mean        1.376979       3.529245       6.649283       4.869012   
std         1.829420      41.310622      69.447364      50.489426   
min         0.000000       0.000000       0.000000       0.000000   
25%         1.000000       0.000000       0.000000       0.000000   
50%         1.000000       0.000000       0.000000       0.000000   
75%         1.000000       0.000000       0.000000       0.000000   
max        48.000000     861.000000    1235.000000     920.000000   

                V180           V181           V182           V183  \
count  139819.000000  139631.000000  139631.000000  139631.000000   
mean        0.924123       0.253332       0.856243       0.484692   
std         6.112630       1.248974       5.792934       2.946982   
min         0.000000       0.000000       0.000000       0.000000   
25%         0.000000       0.000000       0.000000       0.000000   
50%         0.000000       0.000000       0.000000       0.000000   
75%         0.000000       0.000000       0.000000       0.000000   
max        83.000000      24.000000      83.000000      41.000000   

                V184           V185           V186           V187  \
count  139819.000000  139819.000000  139631.000000  139631.000000   
mean        0.132364       0.173903       1.148799       1.843752   
std         0.543699       0.699733       1.223529       9.884390   
min         0.000000       0.000000       0.000000       0.000000   
25%         0.000000       0.000000       1.000000       1.000000   
50%         0.000000       0.000000       1.000000       1.000000   
75%         0.000000       0.000000       1.000000       1.000000   
max        16.000000      31.000000      38.000000     218.000000   

                V188           V189           V190           V191  \
count  139819.000000  139819.000000  139631.000000  139631.000000   
mean        1.014755       1.038314       1.213083       1.058984   
std         0.671207       0.792528       1.480318       0.677527   
min         0.000000       0.000000       0.000000       0.000000   
25%         1.000000       1.000000       1.000000       1.000000   
50%         1.000000       1.000000       1.000000       1.000000   
75%         1.000000       1.000000       1.000000       1.000000   
max        30.000000      30.000000      42.000000      21.000000   

                V192           V193           V194           V195  \
count  139631.000000  139631.000000  139819.000000  139819.000000   
mean        1.237361       1.149845       0.945773       0.954227   
std         2.635961       1.714792       0.304158       0.376420   
min         0.000000       0.000000       0.000000       0.000000   
25%         1.000000       1.000000       1.000000       1.000000   
50%         1.000000       1.000000       1.000000       1.000000   
75%         1.000000       1.000000       1.000000       1.000000   
max        44.000000      37.000000       7.000000      16.000000   

                V196           V197           V198           V199  \
count  139631.000000  139819.000000  139819.000000  139631.000000   
mean        1.084580       0.949385       0.961951       1.270749   
std         1.105834       0.336871       0.493065       1.671863   
min         0.000000       0.000000       0.000000       0.000000   
25%         1.000000       1.000000       1.000000       1.000000   
50%         1.000000       1.000000       1.000000       1.000000   
75%         1.000000       1.000000       1.000000       1.000000   
max        38.000000      14.000000      21.000000      45.000000   

                V200           V201           V202           V203  \
count  139819.000000  139819.000000  139631.000000  139631.000000   
mean        1.119977       1.159106     444.147142    1078.327538   
std         1.253853       1.418618    4683.828419    9105.607991   
min         0.000000       0.000000       0.000000       0.000000   
25%         1.000000       1.000000       0.000000       0.000000   
50%         1.000000       1.000000       0.000000       0.000000   
75%         1.000000       1.000000       0.000000      30.924400   
max        45.000000      55.000000  104060.000000  139777.000000   

                V204           V205           V206           V207  \
count  139631.000000  139631.000000  139631.000000  139631.000000   
mean      686.956931      18.060417       6.189360      72.284098   
std      6048.980716     266.545451     191.474014     925.676660   
min         0.000000       0.000000       0.000000       0.000000   
25%         0.000000       0.000000       0.000000       0.000000   
50%         0.000000       0.000000       0.000000       0.000000   
75%        20.000000       0.000000       0.000000       0.000000   
max    104060.000000   55125.000000   55125.000000   55125.000000   

                V208           V209           V210           V211  \
count  139819.000000  139819.000000  139819.000000  139631.000000   
mean        8.888193      35.001872      14.391835     385.137037   
std        59.047125     250.800822      86.523724    4541.837915   
min         0.000000       0.000000       0.000000       0.000000   
25%         0.000000       0.000000       0.000000       0.000000   
50%         0.000000       0.000000       0.000000       0.000000   
75%         0.000000       0.000000       0.000000       0.000000   
max      3300.000000    8050.000000    3300.000000   92888.000000   

                V212           V213           V214           V215  \
count  139631.000000  139631.000000  139631.000000  139631.000000   
mean      765.988339     536.302802      38.437547     133.208217   
std      7496.120737    5471.664736     571.834283    1040.453748   
min         0.000000       0.000000       0.000000       0.000000   
25%         0.000000       0.000000       0.000000       0.000000   
50%         0.000000       0.000000       0.000000       0.000000   
75%         0.000000       0.000000       0.000000       0.000000   
max    129006.000000   97628.000000  104060.000000  104060.000000   

                V216           V217           V218           V219  \
count  139631.000000  130430.000000  130430.000000  130430.000000   
mean       71.107143       1.054619       1.725784       1.367132   
std       680.267625       9.547354      13.919876      12.044032   
min         0.000000       0.000000       0.000000       0.000000   
25%         0.000000       0.000000       0.000000       0.000000   
50%         0.000000       0.000000       0.000000       0.000000   
75%         0.000000       0.000000       1.000000       1.000000   
max    104060.000000     303.000000     400.000000     378.000000   

                V220           V221           V222           V223  \
count  141416.000000  141416.000000  141416.000000  130430.000000   
mean        0.169472       1.274912       1.359005       0.093153   
std         1.002463       2.728792       2.913772       0.402232   
min         0.000000       0.000000       0.000000       0.000000   
25%         0.000000       1.000000       1.000000       0.000000   
50%         0.000000       1.000000       1.000000       0.000000   
75%         0.000000       1.000000       1.000000       0.000000   
max        25.000000     384.000000     384.000000      16.000000   

                V224           V225           V226           V227  \
count  130430.000000  130430.000000  130430.000000  141416.000000   
mean        0.394748       0.191758       0.247612       0.147975   
std         3.443310       1.290820       2.756449       2.049363   
min         0.000000       0.000000       0.000000       0.000000   
25%         0.000000       0.000000       0.000000       0.000000   
50%         0.000000       0.000000       0.000000       0.000000   
75%         0.000000       0.000000       0.000000       0.000000   
max       144.000000      51.000000     242.000000     360.000000   

                V228           V229           V230           V231  \
count  130430.000000  130430.000000  130430.000000  130430.000000   
mean        1.351928       1.638710       1.465476       0.765506   
std         1.409722       4.316026       2.122783       8.877119   
min         0.000000       0.000000       0.000000       0.000000   
25%         1.000000       1.000000       1.000000       0.000000   
50%         1.000000       1.000000       1.000000       0.000000   
75%         1.000000       1.000000       1.000000       0.000000   
max        54.000000     176.000000      65.000000     293.000000   

                V232           V233           V234           V235  \
count  130430.000000  130430.000000  141416.000000  130430.000000   
mean        1.005597       0.908388       2.092847       0.184106   
std        10.460253      10.149884      11.323036       0.902639   
min         0.000000       0.000000       0.000000       0.000000   
25%         0.000000       0.000000       0.000000       0.000000   
50%         0.000000       0.000000       0.000000       0.000000   
75%         0.000000       0.000000       0.000000       0.000000   
max       337.000000     332.000000     121.000000      23.000000   

                V236           V237           V238           V239  \
count  130430.000000  130430.000000  141416.000000  141416.000000   
mean        0.307667       0.253500       0.127956       0.136965   
std         2.152332       1.716634       0.574657       0.616260   
min         0.000000       0.000000       0.000000       0.000000   
25%         0.000000       0.000000       0.000000       0.000000   
50%         0.000000       0.000000       0.000000       0.000000   
75%         0.000000       0.000000       0.000000       0.000000   
max        45.000000      39.000000      23.000000      23.000000   

                V240           V241           V242           V243  \
count  130430.000000  130430.000000  130430.000000  130430.000000   
mean        1.000997       1.000238       1.113463       1.178387   
std         0.049522       0.022663       0.660110       1.398953   
min         0.000000       0.000000       0.000000       0.000000   
25%         1.000000       1.000000       1.000000       1.000000   
50%         1.000000       1.000000       1.000000       1.000000   
75%         1.000000       1.000000       1.000000       1.000000   
max         7.000000       5.000000      20.000000      57.000000   

                V244           V245           V246           V247  \
count  130430.000000  141416.000000  130430.000000  130430.000000   
mean        1.118562       0.876619       1.183723       1.025355   
std         0.698807       1.821065       1.040363       0.281604   
min         0.000000       0.000000       0.000000       0.000000   
25%         1.000000       1.000000       1.000000       1.000000   
50%         1.000000       1.000000       1.000000       1.000000   
75%         1.000000       1.000000       1.000000       1.000000   
max        22.000000     262.000000      45.000000      18.000000   

                V248           V249           V250           V251  \
count  130430.000000  130430.000000  141416.000000  141416.000000   
mean        1.069493       1.042851       0.788588       0.794231   
std         0.888341       0.528812       0.488386       0.501240   
min         0.000000       0.000000       0.000000       0.000000   
25%         1.000000       1.000000       1.000000       1.000000   
50%         1.000000       1.000000       1.000000       1.000000   
75%         1.000000       1.000000       1.000000       1.000000   
max        36.000000      22.000000      18.000000      18.000000   

                V252           V253           V254           V255  \
count  130430.000000  130430.000000  130430.000000  141416.000000   
mean        1.032048       1.162164       1.075113       0.806656   
std         0.389546       3.612553       1.357550       0.939124   
min         0.000000       0.000000       0.000000       0.000000   
25%         1.000000       1.000000       1.000000       1.000000   
50%         1.000000       1.000000       1.000000       1.000000   
75%         1.000000       1.000000       1.000000       1.000000   
max        24.000000     163.000000      60.000000      87.000000   

                V256           V257           V258           V259  \
count  141416.000000  130430.000000  130430.000000  141416.000000   
mean        0.814950       1.250993       1.343510       0.967832   
std         0.969953       1.299956       2.015811       2.113447   
min         0.000000       0.000000       0.000000       0.000000   
25%         1.000000       1.000000       1.000000       1.000000   
50%         1.000000       1.000000       1.000000       1.000000   
75%         1.000000       1.000000       1.000000       1.000000   
max        87.000000      48.000000      66.000000     285.000000   

                V260           V261           V262           V263  \
count  130430.000000  130430.000000  130430.000000  130430.000000   
mean        0.964425       1.107161       1.013279     117.390676   
std         0.379828       1.323230       0.625455    1294.851543   
min         0.000000       0.000000       0.000000       0.000000   
25%         1.000000       1.000000       1.000000       0.000000   
50%         1.000000       1.000000       1.000000       0.000000   
75%         1.000000       1.000000       1.000000       0.000000   
max         8.000000      49.000000      20.000000  153600.000000   

                V264           V265           V266           V267  \
count  130430.000000  130430.000000  130430.000000  130430.000000   
mean      201.657617     153.520534       9.167839      36.525103   
std      2284.827492    1605.512276     208.038863     644.878586   
min         0.000000       0.000000       0.000000       0.000000   
25%         0.000000       0.000000       0.000000       0.000000   
50%         0.000000       0.000000       0.000000       0.000000   
75%        33.593498      20.897525       0.000000       0.000000   
max    153600.000000  153600.000000   55125.000000   55125.000000   

                V268           V269           V270           V271  \
count  130430.000000  130430.000000  141416.000000  141416.000000   
mean       18.813407       5.997508       7.712764       9.445133   
std       311.280181     207.317539      65.507232      74.147726   
min         0.000000       0.000000       0.000000       0.000000   
25%         0.000000       0.000000       0.000000       0.000000   
50%         0.000000       0.000000       0.000000       0.000000   
75%         0.000000       0.000000       0.000000       0.000000   
max     55125.000000   55125.000000    4000.000000    4000.000000   

                V272           V273           V274           V275  \
count  141416.000000  130430.000000  130430.000000  130430.000000   
mean        8.464571      73.825549     107.151636      88.899939   
std        69.723735     935.187927    1258.734139    1072.883139   
min         0.000000       0.000000       0.000000       0.000000   
25%         0.000000       0.000000       0.000000       0.000000   
50%         0.000000       0.000000       0.000000       0.000000   
75%         0.000000       0.000000       0.000000       0.000000   
max      4000.000000   51200.000000   66000.000000   51200.000000   

                V276           V277           V278           V279  \
count  130430.000000  130430.000000  130430.000000  590528.000000   
mean       31.797277      51.956645      42.328228       1.123061   
std       615.659714     732.145368     660.611820      21.021950   
min         0.000000       0.000000       0.000000       0.000000   
25%         0.000000       0.000000       0.000000       0.000000   
50%         0.000000       0.000000       0.000000       0.000000   
75%         0.000000       0.000000       0.000000       0.000000   
max    104060.000000  104060.000000  104060.000000     880.000000   

                V280           V281           V282           V283  \
count  590528.000000  589271.000000  589271.000000  589271.000000   
mean        1.967082       0.087783       0.817171       0.991114   
std        27.851780       0.512748       0.921880       1.558731   
min         0.000000       0.000000       0.000000       0.000000   
25%         0.000000       0.000000       0.000000       0.000000   
50%         0.000000       0.000000       1.000000       1.000000   
75%         1.000000       0.000000       1.000000       1.000000   
max       975.000000      22.000000      32.000000      68.000000   

                V284           V285           V286           V287  \
count  590528.000000  590528.000000  590528.000000  590528.000000   
mean        0.088543       1.167660       0.031492       0.358579   
std         0.338424       3.282454       0.190939       1.078995   
min         0.000000       0.000000       0.000000       0.000000   
25%         0.000000       0.000000       0.000000       0.000000   
50%         0.000000       0.000000       0.000000       0.000000   
75%         0.000000       1.000000       0.000000       0.000000   
max        12.000000      95.000000       8.000000      31.000000   

                V288           V289           V290           V291  \
count  589271.000000  589271.000000  590528.000000  590528.000000   
mean        0.184350       0.235975       1.103011       1.659811   
std         0.430989       0.599231       0.768897      16.252538   
min         0.000000       0.000000       1.000000       1.000000   
25%         0.000000       0.000000       1.000000       1.000000   
50%         0.000000       0.000000       1.000000       1.000000   
75%         0.000000       0.000000       1.000000       1.000000   
max        10.000000      12.000000      67.000000    1055.000000   

                V292           V293           V294           V295  \
count  590528.000000  590528.000000  590528.000000  590528.000000   
mean        1.239916       0.942599       2.313863       1.433424   
std         3.775050      20.588816      39.526468      25.962948   
min         1.000000       0.000000       0.000000       0.000000   
25%         1.000000       0.000000       0.000000       0.000000   
50%         1.000000       0.000000       0.000000       0.000000   
75%         1.000000       0.000000       0.000000       0.000000   
max       323.000000     869.000000    1286.000000     928.000000   

                V296           V297           V298           V299  \
count  589271.000000  590528.000000  590528.000000  590528.000000   
mean        0.328917       0.089034       0.298829       0.171655   
std         3.264745       0.628352       3.175062       1.724218   
min         0.000000       0.000000       0.000000       0.000000   
25%         0.000000       0.000000       0.000000       0.000000   
50%         0.000000       0.000000       0.000000       0.000000   
75%         0.000000       0.000000       0.000000       0.000000   
max        93.000000      12.000000      93.000000      49.000000   

                V300           V301           V302           V303  \
count  589271.000000  589271.000000  590528.000000  590528.000000   
mean        0.045507       0.052002       0.251761       0.283140   
std         0.289573       0.318310       0.481889       0.623608   
min         0.000000       0.000000       0.000000       0.000000   
25%         0.000000       0.000000       0.000000       0.000000   
50%         0.000000       0.000000       0.000000       0.000000   
75%         0.000000       0.000000       0.000000       0.000000   
max        11.000000      13.000000      16.000000      20.000000   

                V304           V305           V306           V307  \
count  590528.000000  590528.000000  590528.000000  590528.000000   
mean        0.264208       1.000007     139.748713     408.682375   
std         0.528238       0.002603    2348.849634    4391.992977   
min         0.000000       1.000000       0.000000       0.000000   
25%         0.000000       1.000000       0.000000       0.000000   
50%         0.000000       1.000000       0.000000       0.000000   
75%         0.000000       1.000000       0.000000     151.380680   
max        16.000000       2.000000  108800.000000  145765.000000   

                V308           V309           V310           V311  \
count  590528.000000  590528.000000  590528.000000  590528.000000   
mean      230.413180      10.995986     118.195658       4.202175   
std      3021.924247     116.254277     352.983093     102.374938   
min         0.000000       0.000000       0.000000       0.000000   
25%         0.000000       0.000000       0.000000       0.000000   
50%         0.000000       0.000000       0.000000       0.000000   
75%        35.970001       0.000000     107.949997       0.000000   
max    108800.000000   55125.000000   55125.000000   55125.000000   

                V312           V313           V314           V315  \
count  590528.000000  589271.000000  589271.000000  589271.000000   
mean       39.173910      21.351473      43.319174      26.806977   
std       172.128339      95.902970     173.619028     116.853222   
min         0.000000       0.000000       0.000000       0.000000   
25%         0.000000       0.000000       0.000000       0.000000   
50%         0.000000       0.000000       0.000000       0.000000   
75%         0.000000       0.000000       0.000000       0.000000   
max     55125.000000    4817.470215    7519.870117    4817.470215   

                V316           V317           V318           V319  \
count  590528.000000  590528.000000  590528.000000  590528.000000   
mean      109.818544     247.606741     162.153398      18.372476   
std      2270.033202    3980.042828    2793.343636     332.304848   
min         0.000000       0.000000       0.000000       0.000000   
25%         0.000000       0.000000       0.000000       0.000000   
50%         0.000000       0.000000       0.000000       0.000000   
75%         0.000000       0.000000       0.000000       0.000000   
max     93736.000000  134021.000000   98476.000000  104060.000000   

                V320           V321          V322          V323          V324  \
count  590528.000000  590528.000000  82351.000000  82351.000000  82351.000000   
mean       42.073133      28.326584      6.220289     13.103775      9.184612   
std       473.499307     382.053171     56.022561    106.739813     73.627893   
min         0.000000       0.000000      0.000000      0.000000      0.000000   
25%         0.000000       0.000000      0.000000      0.000000      0.000000   
50%         0.000000       0.000000      0.000000      0.000000      0.000000   
75%         0.000000       0.000000      0.000000      1.000000      0.000000   
max    104060.000000  104060.000000    880.000000   1411.000000    976.000000   

               V325          V326          V327          V328          V329  \
count  82351.000000  82351.000000  82351.000000  82351.000000  82351.000000   
mean       0.058494      0.851040      0.296633      0.336790      1.312844   
std        0.304415      3.950295      1.364356      1.580144      8.769083   
min        0.000000      0.000000      0.000000      0.000000      0.000000   
25%        0.000000      0.000000      0.000000      0.000000      0.000000   
50%        0.000000      0.000000      0.000000      0.000000      0.000000   
75%        0.000000      0.000000      0.000000      0.000000      0.000000   
max       12.000000     44.000000     18.000000     15.000000     99.000000   

               V330           V331           V332           V333  \
count  82351.000000   82351.000000   82351.000000   82351.000000   
mean       0.775874     721.741883    1375.783644    1014.622782   
std        4.727971    6217.223583   11169.275702    7955.735482   
min        0.000000       0.000000       0.000000       0.000000   
25%        0.000000       0.000000       0.000000       0.000000   
50%        0.000000       0.000000       0.000000       0.000000   
75%        0.000000       0.000000      25.000000       0.000000   
max       55.000000  160000.000000  160000.000000  160000.000000   

               V334         V335          V336           V337           V338  \
count  82351.000000  82351.00000  82351.000000   82351.000000   82351.000000   
mean       9.807015     59.16455     28.530903      55.352422     151.160542   
std      243.861391    387.62948    274.576920     668.486833    1095.034387   
min        0.000000      0.00000      0.000000       0.000000       0.000000   
25%        0.000000      0.00000      0.000000       0.000000       0.000000   
50%        0.000000      0.00000      0.000000       0.000000       0.000000   
75%        0.000000      0.00000      0.000000       0.000000       0.000000   
max    55125.000000  55125.00000  55125.000000  104060.000000  104060.000000   

                V339  
count   82351.000000  
mean      100.700882  
std       814.946722  
min         0.000000  
25%         0.000000  
50%         0.000000  
75%         0.000000  
max    104060.000000  


Identity dataset
       TransactionID          id_01          id_02         id_03  \
count   1.442330e+05  144233.000000  140872.000000  66324.000000   
mean    3.236329e+06     -10.170502  174716.584708      0.060189   
std     1.788496e+05      14.347949  159651.816856      0.598231   
min     2.987004e+06    -100.000000       1.000000    -13.000000   
25%     3.077142e+06     -10.000000   67992.000000      0.000000   
50%     3.198818e+06      -5.000000  125800.500000      0.000000   
75%     3.392923e+06      -5.000000  228749.000000      0.000000   
max     3.577534e+06       0.000000  999595.000000     10.000000   

              id_04          id_05          id_06        id_07        id_08  \
count  66324.000000  136865.000000  136865.000000  5155.000000  5155.000000   
mean      -0.058938       1.615585      -6.698710    13.285354   -38.600388   
std        0.701015       5.249856      16.491104    11.384207    26.084899   
min      -28.000000     -72.000000    -100.000000   -46.000000  -100.000000   
25%        0.000000       0.000000      -6.000000     5.000000   -48.000000   
50%        0.000000       0.000000       0.000000    14.000000   -34.000000   
75%        0.000000       1.000000       0.000000    22.000000   -23.000000   
max        0.000000      52.000000       0.000000    61.000000     0.000000   

              id_09         id_10          id_11          id_13         id_14  \
count  74926.000000  74926.000000  140978.000000  127320.000000  80044.000000   
mean       0.091023     -0.301124      99.745325      48.053071   -344.507146   
std        0.983842      2.789446       1.127602      11.774858     93.695502   
min      -36.000000   -100.000000      90.000000      10.000000   -660.000000   
25%        0.000000      0.000000     100.000000      49.000000   -360.000000   
50%        0.000000      0.000000     100.000000      52.000000   -300.000000   
75%        0.000000      0.000000     100.000000      52.000000   -300.000000   
max       25.000000      0.000000     100.000000      64.000000    720.000000   

               id_17         id_18          id_19          id_20        id_21  \
count  139369.000000  45113.000000  139318.000000  139261.000000  5159.000000   
mean      189.451377     14.237337     353.128174     403.882666   368.269820   
std        30.375360      1.561302     141.095343     152.160327   198.847038   
min       100.000000     10.000000     100.000000     100.000000   100.000000   
25%       166.000000     13.000000     266.000000     256.000000   252.000000   
50%       166.000000     15.000000     341.000000     472.000000   252.000000   
75%       225.000000     15.000000     427.000000     533.000000   486.500000   
max       229.000000     29.000000     671.000000     661.000000   854.000000   

             id_22        id_24        id_25        id_26         id_32  
count  5169.000000  4747.000000  5132.000000  5163.000000  77586.000000  
mean     16.002708    12.800927   329.608924   149.070308     26.508597  
std       6.897665     2.372447    97.461089    32.101995      3.737502  
min      10.000000    11.000000   100.000000   100.000000      0.000000  
25%      14.000000    11.000000   321.000000   119.000000     24.000000  
50%      14.000000    11.000000   321.000000   149.000000     24.000000  
75%      14.000000    15.000000   371.000000   169.000000     32.000000  
max      44.000000    26.000000   548.000000   216.000000     32.000000  
# List the columns in each dataset.
print(list(transaction.columns))
print(list(identity.columns))
['TransactionID', 'isFraud', 'TransactionDT', 'TransactionAmt', 'ProductCD', 'card1', 'card2', 'card3', 'card4', 'card5', 'card6', 'addr1', 'addr2', 'dist1', 'dist2', 'P_emaildomain', 'R_emaildomain', 'C1', 'C2', 'C3', 'C4', 'C5', 'C6', 'C7', 'C8', 'C9', 'C10', 'C11', 'C12', 'C13', 'C14', 'D1', 'D2', 'D3', 'D4', 'D5', 'D6', 'D7', 'D8', 'D9', 'D10', 'D11', 'D12', 'D13', 'D14', 'D15', 'M1', 'M2', 'M3', 'M4', 'M5', 'M6', 'M7', 'M8', 'M9', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20', 'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'V29', 'V30', 'V31', 'V32', 'V33', 'V34', 'V35', 'V36', 'V37', 'V38', 'V39', 'V40', 'V41', 'V42', 'V43', 'V44', 'V45', 'V46', 'V47', 'V48', 'V49', 'V50', 'V51', 'V52', 'V53', 'V54', 'V55', 'V56', 'V57', 'V58', 'V59', 'V60', 'V61', 'V62', 'V63', 'V64', 'V65', 'V66', 'V67', 'V68', 'V69', 'V70', 'V71', 'V72', 'V73', 'V74', 'V75', 'V76', 'V77', 'V78', 'V79', 'V80', 'V81', 'V82', 'V83', 'V84', 'V85', 'V86', 'V87', 'V88', 'V89', 'V90', 'V91', 'V92', 'V93', 'V94', 'V95', 'V96', 'V97', 'V98', 'V99', 'V100', 'V101', 'V102', 'V103', 'V104', 'V105', 'V106', 'V107', 'V108', 'V109', 'V110', 'V111', 'V112', 'V113', 'V114', 'V115', 'V116', 'V117', 'V118', 'V119', 'V120', 'V121', 'V122', 'V123', 'V124', 'V125', 'V126', 'V127', 'V128', 'V129', 'V130', 'V131', 'V132', 'V133', 'V134', 'V135', 'V136', 'V137', 'V138', 'V139', 'V140', 'V141', 'V142', 'V143', 'V144', 'V145', 'V146', 'V147', 'V148', 'V149', 'V150', 'V151', 'V152', 'V153', 'V154', 'V155', 'V156', 'V157', 'V158', 'V159', 'V160', 'V161', 'V162', 'V163', 'V164', 'V165', 'V166', 'V167', 'V168', 'V169', 'V170', 'V171', 'V172', 'V173', 'V174', 'V175', 'V176', 'V177', 'V178', 'V179', 'V180', 'V181', 'V182', 'V183', 'V184', 'V185', 'V186', 'V187', 'V188', 'V189', 'V190', 'V191', 'V192', 'V193', 'V194', 'V195', 'V196', 'V197', 'V198', 'V199', 'V200', 'V201', 'V202', 'V203', 'V204', 'V205', 'V206', 'V207', 'V208', 'V209', 'V210', 'V211', 'V212', 'V213', 'V214', 'V215', 'V216', 'V217', 'V218', 'V219', 'V220', 'V221', 'V222', 'V223', 'V224', 'V225', 'V226', 'V227', 'V228', 'V229', 'V230', 'V231', 'V232', 'V233', 'V234', 'V235', 'V236', 'V237', 'V238', 'V239', 'V240', 'V241', 'V242', 'V243', 'V244', 'V245', 'V246', 'V247', 'V248', 'V249', 'V250', 'V251', 'V252', 'V253', 'V254', 'V255', 'V256', 'V257', 'V258', 'V259', 'V260', 'V261', 'V262', 'V263', 'V264', 'V265', 'V266', 'V267', 'V268', 'V269', 'V270', 'V271', 'V272', 'V273', 'V274', 'V275', 'V276', 'V277', 'V278', 'V279', 'V280', 'V281', 'V282', 'V283', 'V284', 'V285', 'V286', 'V287', 'V288', 'V289', 'V290', 'V291', 'V292', 'V293', 'V294', 'V295', 'V296', 'V297', 'V298', 'V299', 'V300', 'V301', 'V302', 'V303', 'V304', 'V305', 'V306', 'V307', 'V308', 'V309', 'V310', 'V311', 'V312', 'V313', 'V314', 'V315', 'V316', 'V317', 'V318', 'V319', 'V320', 'V321', 'V322', 'V323', 'V324', 'V325', 'V326', 'V327', 'V328', 'V329', 'V330', 'V331', 'V332', 'V333', 'V334', 'V335', 'V336', 'V337', 'V338', 'V339']
['TransactionID', 'id_01', 'id_02', 'id_03', 'id_04', 'id_05', 'id_06', 'id_07', 'id_08', 'id_09', 'id_10', 'id_11', 'id_12', 'id_13', 'id_14', 'id_15', 'id_16', 'id_17', 'id_18', 'id_19', 'id_20', 'id_21', 'id_22', 'id_23', 'id_24', 'id_25', 'id_26', 'id_27', 'id_28', 'id_29', 'id_30', 'id_31', 'id_32', 'id_33', 'id_34', 'id_35', 'id_36', 'id_37', 'id_38', 'DeviceType', 'DeviceInfo']
# Check the proportion of fraud to non-fraud observations in the
# transaction dataset.
# Fraudulent transactions are represented by 'isFraud' = 1
# and non-fraudulent transactions are represented by 'isFraud' = 0.
fraud = sum(transaction['isFraud'])
not_fraud = len(transaction.index)-fraud

percent_fraud = fraud/(fraud+not_fraud)
print('There are '+'{:,.0f}'.format(fraud)+' fraudulent transactions out of '+
      '{:,.0f}'.format(fraud+not_fraud)+' total transactions in the dataset.')
print('These fraudulent transactions represent approximately '+'{:,.2%}'.format(percent_fraud)+
      ' of all transactions.')
There are 20,663 fraudulent transactions out of 590,540 total transactions in the dataset.
These fraudulent transactions represent approximately 3.50% of all transactions.

The following information about the datasets is noted:

  • the transaction dataset has 590,540 observations and 394 columns;

  • the response in the transaction dataset is ‘isFraud’;

  • approximately 3.5% of all transactions are fraudulent transactions;

  • the identity dataset has 144,233 observations and 40 columns; and

  • the ‘Transaction_ID’ column can be used as a key to link the transaction and identity datasets together.

# Examine how fraud transactions compare to non fraud transactions
# across some of the features.

# Calculate the average transaction amounts for fraud and non fraud transactions.
heading('Average transaction amounts for fraud and non fraud')
print(transaction.groupby('isFraud').mean()['TransactionAmt'])
print('\n')

# Calculate percentage fraud for each Product Code.
heading('Percentage fraud and total transactions for each Product Code')
product_groups = transaction[['ProductCD','isFraud']].groupby('ProductCD').agg(['mean','count'])
print(product_groups)
Average transaction amounts for fraud and non fraud

isFraud
0    134.511665
1    149.244779
Name: TransactionAmt, dtype: float64


Percentage fraud and total transactions for each Product Code

            isFraud        
               mean   count
ProductCD                  
C          0.116873   68519
H          0.047662   33024
R          0.037826   37699
S          0.058996   11628
W          0.020399  439670

The average transaction amount is slightly higher for fraud cases (\(149 v \)135).

The percentage of transactions that are fraudulent is quite different by product code. The fraud percentage is greatest for product code ‘C’ (11.7%) and smallest for product code ‘W’ (2%). These different fraud percentages for different product codes suggest that it might be helpful to look at instances of fraud separately for each of the 5 product codes.

The rest of this case study will focus on fraud for product code ‘C’.

You could then repeat the analysis below for each of the other product codes.

Prepare data#

# Join the transaction and identity datasets using a left join so that all
# observations in the transaction dataset are retained
# (the transaction dataset contains the response 'isFraud'), and any matching
# observations in the identity dataset are joined into the merge.

product = 'C'
fraud_data = pd.merge(transaction[transaction['ProductCD']=='C'],identity,on='TransactionID',how='left')
print(fraud_data.info())
<class 'pandas.core.frame.DataFrame'>
Int64Index: 68519 entries, 0 to 68518
Columns: 434 entries, TransactionID to DeviceInfo
dtypes: float64(399), int64(4), object(31)
memory usage: 227.4+ MB
None

The fraud dataset created above contains 68,519 observations of fraud for the product being investigated (product code ‘C’) and 434 columns (‘TransactionID’, ‘isFraud’ and 432 features).

The number of features should be reduced to make the model fitting more tractable.

The features used in the clustering algorithm should be numeric and not contain missing values, so that Euclidean distances between the observations can be calculated.

# Explore the feature types and missingness of the fraud data.

# Explore the first 14 features.
print(fraud_data[['TransactionDT','TransactionAmt','card1','card2','card3','card4',
                  'card5','card6','addr1','addr2','dist1','dist2']].info())

# Explore the 'C#' features.
filter_col = [col for col in fraud_data if col.startswith('C')]
print(fraud_data[filter_col].info())

# Explore the 'D#' features.
filter_col = [col for col in fraud_data if col.startswith('D')]
print(fraud_data[filter_col].info())

# Explore the 'M#' features.
filter_col = [col for col in fraud_data if col.startswith('M')]
print(fraud_data[filter_col].info())

# Explore the first 100 'V#' features.
filter_col = [col for col in fraud_data if col.startswith('V')][0:100]
print(fraud_data[filter_col].info())

# Explore the next 100 'V#' features.
filter_col = [col for col in fraud_data if col.startswith('V')][100:200]
print(fraud_data[filter_col].info())

# Explore the next 100 'V#' features.
filter_col = [col for col in fraud_data if col.startswith('V')][200:300]
print(fraud_data[filter_col].info())

# Explore the final 39 'V#' features.
filter_col = [col for col in fraud_data if col.startswith('V')][300:]
print(fraud_data[filter_col].info())

# Explore the id_## features.
filter_col = [col for col in fraud_data if col.startswith('id')]
print(fraud_data[filter_col].info())
<class 'pandas.core.frame.DataFrame'>
Int64Index: 68519 entries, 0 to 68518
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   TransactionDT   68519 non-null  int64  
 1   TransactionAmt  68519 non-null  float64
 2   card1           68519 non-null  int64  
 3   card2           67988 non-null  float64
 4   card3           68326 non-null  float64
 5   card4           68324 non-null  object 
 6   card5           67896 non-null  float64
 7   card6           68326 non-null  object 
 8   addr1           3400 non-null   float64
 9   addr2           3400 non-null   float64
 10  dist1           0 non-null      float64
 11  dist2           26741 non-null  float64
dtypes: float64(8), int64(2), object(2)
memory usage: 6.8+ MB
None
<class 'pandas.core.frame.DataFrame'>
Int64Index: 68519 entries, 0 to 68518
Data columns (total 14 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   C1      68519 non-null  float64
 1   C2      68519 non-null  float64
 2   C3      68519 non-null  float64
 3   C4      68519 non-null  float64
 4   C5      68519 non-null  float64
 5   C6      68519 non-null  float64
 6   C7      68519 non-null  float64
 7   C8      68519 non-null  float64
 8   C9      68519 non-null  float64
 9   C10     68519 non-null  float64
 10  C11     68519 non-null  float64
 11  C12     68519 non-null  float64
 12  C13     68519 non-null  float64
 13  C14     68519 non-null  float64
dtypes: float64(14)
memory usage: 7.8 MB
None
<class 'pandas.core.frame.DataFrame'>
Int64Index: 68519 entries, 0 to 68518
Data columns (total 17 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   D1          68324 non-null  float64
 1   D2          20184 non-null  float64
 2   D3          26800 non-null  float64
 3   D4          64356 non-null  float64
 4   D5          32584 non-null  float64
 5   D6          64352 non-null  float64
 6   D7          32503 non-null  float64
 7   D8          29436 non-null  float64
 8   D9          29436 non-null  float64
 9   D10         65021 non-null  float64
 10  D11         0 non-null      float64
 11  D12         64717 non-null  float64
 12  D13         57948 non-null  float64
 13  D14         51083 non-null  float64
 14  D15         64581 non-null  float64
 15  DeviceType  61015 non-null  object 
 16  DeviceInfo  41593 non-null  object 
dtypes: float64(15), object(2)
memory usage: 9.4+ MB
None
<class 'pandas.core.frame.DataFrame'>
Int64Index: 68519 entries, 0 to 68518
Data columns (total 9 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   M1      0 non-null      object
 1   M2      0 non-null      object
 2   M3      0 non-null      object
 3   M4      66785 non-null  object
 4   M5      0 non-null      object
 5   M6      0 non-null      object
 6   M7      0 non-null      object
 7   M8      0 non-null      object
 8   M9      0 non-null      object
dtypes: object(9)
memory usage: 5.2+ MB
None
<class 'pandas.core.frame.DataFrame'>
Int64Index: 68519 entries, 0 to 68518
Data columns (total 100 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   V1      0 non-null      float64
 1   V2      0 non-null      float64
 2   V3      0 non-null      float64
 3   V4      0 non-null      float64
 4   V5      0 non-null      float64
 5   V6      0 non-null      float64
 6   V7      0 non-null      float64
 7   V8      0 non-null      float64
 8   V9      0 non-null      float64
 9   V10     0 non-null      float64
 10  V11     0 non-null      float64
 11  V12     65021 non-null  float64
 12  V13     65021 non-null  float64
 13  V14     65021 non-null  float64
 14  V15     65021 non-null  float64
 15  V16     65021 non-null  float64
 16  V17     65021 non-null  float64
 17  V18     65021 non-null  float64
 18  V19     65021 non-null  float64
 19  V20     65021 non-null  float64
 20  V21     65021 non-null  float64
 21  V22     65021 non-null  float64
 22  V23     65021 non-null  float64
 23  V24     65021 non-null  float64
 24  V25     65021 non-null  float64
 25  V26     65021 non-null  float64
 26  V27     65021 non-null  float64
 27  V28     65021 non-null  float64
 28  V29     65021 non-null  float64
 29  V30     65021 non-null  float64
 30  V31     65021 non-null  float64
 31  V32     65021 non-null  float64
 32  V33     65021 non-null  float64
 33  V34     65021 non-null  float64
 34  V35     64356 non-null  float64
 35  V36     64356 non-null  float64
 36  V37     64356 non-null  float64
 37  V38     64356 non-null  float64
 38  V39     64356 non-null  float64
 39  V40     64356 non-null  float64
 40  V41     64356 non-null  float64
 41  V42     64356 non-null  float64
 42  V43     64356 non-null  float64
 43  V44     64356 non-null  float64
 44  V45     64356 non-null  float64
 45  V46     64356 non-null  float64
 46  V47     64356 non-null  float64
 47  V48     64356 non-null  float64
 48  V49     64356 non-null  float64
 49  V50     64356 non-null  float64
 50  V51     64356 non-null  float64
 51  V52     64356 non-null  float64
 52  V53     67067 non-null  float64
 53  V54     67067 non-null  float64
 54  V55     67067 non-null  float64
 55  V56     67067 non-null  float64
 56  V57     67067 non-null  float64
 57  V58     67067 non-null  float64
 58  V59     67067 non-null  float64
 59  V60     67067 non-null  float64
 60  V61     67067 non-null  float64
 61  V62     67067 non-null  float64
 62  V63     67067 non-null  float64
 63  V64     67067 non-null  float64
 64  V65     67067 non-null  float64
 65  V66     67067 non-null  float64
 66  V67     67067 non-null  float64
 67  V68     67067 non-null  float64
 68  V69     67067 non-null  float64
 69  V70     67067 non-null  float64
 70  V71     67067 non-null  float64
 71  V72     67067 non-null  float64
 72  V73     67067 non-null  float64
 73  V74     67067 non-null  float64
 74  V75     64581 non-null  float64
 75  V76     64581 non-null  float64
 76  V77     64581 non-null  float64
 77  V78     64581 non-null  float64
 78  V79     64581 non-null  float64
 79  V80     64581 non-null  float64
 80  V81     64581 non-null  float64
 81  V82     64581 non-null  float64
 82  V83     64581 non-null  float64
 83  V84     64581 non-null  float64
 84  V85     64581 non-null  float64
 85  V86     64581 non-null  float64
 86  V87     64581 non-null  float64
 87  V88     64581 non-null  float64
 88  V89     64581 non-null  float64
 89  V90     64581 non-null  float64
 90  V91     64581 non-null  float64
 91  V92     64581 non-null  float64
 92  V93     64581 non-null  float64
 93  V94     64581 non-null  float64
 94  V95     68447 non-null  float64
 95  V96     68447 non-null  float64
 96  V97     68447 non-null  float64
 97  V98     68447 non-null  float64
 98  V99     68447 non-null  float64
 99  V100    68447 non-null  float64
dtypes: float64(100)
memory usage: 52.8 MB
None
<class 'pandas.core.frame.DataFrame'>
Int64Index: 68519 entries, 0 to 68518
Data columns (total 100 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   V101    68447 non-null  float64
 1   V102    68447 non-null  float64
 2   V103    68447 non-null  float64
 3   V104    68447 non-null  float64
 4   V105    68447 non-null  float64
 5   V106    68447 non-null  float64
 6   V107    68447 non-null  float64
 7   V108    68447 non-null  float64
 8   V109    68447 non-null  float64
 9   V110    68447 non-null  float64
 10  V111    68447 non-null  float64
 11  V112    68447 non-null  float64
 12  V113    68447 non-null  float64
 13  V114    68447 non-null  float64
 14  V115    68447 non-null  float64
 15  V116    68447 non-null  float64
 16  V117    68447 non-null  float64
 17  V118    68447 non-null  float64
 18  V119    68447 non-null  float64
 19  V120    68447 non-null  float64
 20  V121    68447 non-null  float64
 21  V122    68447 non-null  float64
 22  V123    68447 non-null  float64
 23  V124    68447 non-null  float64
 24  V125    68447 non-null  float64
 25  V126    68447 non-null  float64
 26  V127    68447 non-null  float64
 27  V128    68447 non-null  float64
 28  V129    68447 non-null  float64
 29  V130    68447 non-null  float64
 30  V131    68447 non-null  float64
 31  V132    68447 non-null  float64
 32  V133    68447 non-null  float64
 33  V134    68447 non-null  float64
 34  V135    68447 non-null  float64
 35  V136    68447 non-null  float64
 36  V137    68447 non-null  float64
 37  V138    0 non-null      float64
 38  V139    0 non-null      float64
 39  V140    0 non-null      float64
 40  V141    0 non-null      float64
 41  V142    0 non-null      float64
 42  V143    0 non-null      float64
 43  V144    0 non-null      float64
 44  V145    0 non-null      float64
 45  V146    0 non-null      float64
 46  V147    0 non-null      float64
 47  V148    0 non-null      float64
 48  V149    0 non-null      float64
 49  V150    0 non-null      float64
 50  V151    0 non-null      float64
 51  V152    0 non-null      float64
 52  V153    0 non-null      float64
 53  V154    0 non-null      float64
 54  V155    0 non-null      float64
 55  V156    0 non-null      float64
 56  V157    0 non-null      float64
 57  V158    0 non-null      float64
 58  V159    0 non-null      float64
 59  V160    0 non-null      float64
 60  V161    0 non-null      float64
 61  V162    0 non-null      float64
 62  V163    0 non-null      float64
 63  V164    0 non-null      float64
 64  V165    0 non-null      float64
 65  V166    0 non-null      float64
 66  V167    59669 non-null  float64
 67  V168    59669 non-null  float64
 68  V169    59885 non-null  float64
 69  V170    59885 non-null  float64
 70  V171    59885 non-null  float64
 71  V172    59669 non-null  float64
 72  V173    59669 non-null  float64
 73  V174    59885 non-null  float64
 74  V175    59885 non-null  float64
 75  V176    59669 non-null  float64
 76  V177    59669 non-null  float64
 77  V178    59669 non-null  float64
 78  V179    59669 non-null  float64
 79  V180    59885 non-null  float64
 80  V181    59669 non-null  float64
 81  V182    59669 non-null  float64
 82  V183    59669 non-null  float64
 83  V184    59885 non-null  float64
 84  V185    59885 non-null  float64
 85  V186    59669 non-null  float64
 86  V187    59669 non-null  float64
 87  V188    59885 non-null  float64
 88  V189    59885 non-null  float64
 89  V190    59669 non-null  float64
 90  V191    59669 non-null  float64
 91  V192    59669 non-null  float64
 92  V193    59669 non-null  float64
 93  V194    59885 non-null  float64
 94  V195    59885 non-null  float64
 95  V196    59669 non-null  float64
 96  V197    59885 non-null  float64
 97  V198    59885 non-null  float64
 98  V199    59669 non-null  float64
 99  V200    59885 non-null  float64
dtypes: float64(100)
memory usage: 52.8 MB
None
<class 'pandas.core.frame.DataFrame'>
Int64Index: 68519 entries, 0 to 68518
Data columns (total 100 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   V201    59885 non-null  float64
 1   V202    59669 non-null  float64
 2   V203    59669 non-null  float64
 3   V204    59669 non-null  float64
 4   V205    59669 non-null  float64
 5   V206    59669 non-null  float64
 6   V207    59669 non-null  float64
 7   V208    59885 non-null  float64
 8   V209    59885 non-null  float64
 9   V210    59885 non-null  float64
 10  V211    59669 non-null  float64
 11  V212    59669 non-null  float64
 12  V213    59669 non-null  float64
 13  V214    59669 non-null  float64
 14  V215    59669 non-null  float64
 15  V216    59669 non-null  float64
 16  V217    52036 non-null  float64
 17  V218    52036 non-null  float64
 18  V219    52036 non-null  float64
 19  V220    59160 non-null  float64
 20  V221    59160 non-null  float64
 21  V222    59160 non-null  float64
 22  V223    52036 non-null  float64
 23  V224    52036 non-null  float64
 24  V225    52036 non-null  float64
 25  V226    52036 non-null  float64
 26  V227    59160 non-null  float64
 27  V228    52036 non-null  float64
 28  V229    52036 non-null  float64
 29  V230    52036 non-null  float64
 30  V231    52036 non-null  float64
 31  V232    52036 non-null  float64
 32  V233    52036 non-null  float64
 33  V234    59160 non-null  float64
 34  V235    52036 non-null  float64
 35  V236    52036 non-null  float64
 36  V237    52036 non-null  float64
 37  V238    59160 non-null  float64
 38  V239    59160 non-null  float64
 39  V240    52036 non-null  float64
 40  V241    52036 non-null  float64
 41  V242    52036 non-null  float64
 42  V243    52036 non-null  float64
 43  V244    52036 non-null  float64
 44  V245    59160 non-null  float64
 45  V246    52036 non-null  float64
 46  V247    52036 non-null  float64
 47  V248    52036 non-null  float64
 48  V249    52036 non-null  float64
 49  V250    59160 non-null  float64
 50  V251    59160 non-null  float64
 51  V252    52036 non-null  float64
 52  V253    52036 non-null  float64
 53  V254    52036 non-null  float64
 54  V255    59160 non-null  float64
 55  V256    59160 non-null  float64
 56  V257    52036 non-null  float64
 57  V258    52036 non-null  float64
 58  V259    59160 non-null  float64
 59  V260    52036 non-null  float64
 60  V261    52036 non-null  float64
 61  V262    52036 non-null  float64
 62  V263    52036 non-null  float64
 63  V264    52036 non-null  float64
 64  V265    52036 non-null  float64
 65  V266    52036 non-null  float64
 66  V267    52036 non-null  float64
 67  V268    52036 non-null  float64
 68  V269    52036 non-null  float64
 69  V270    59160 non-null  float64
 70  V271    59160 non-null  float64
 71  V272    59160 non-null  float64
 72  V273    52036 non-null  float64
 73  V274    52036 non-null  float64
 74  V275    52036 non-null  float64
 75  V276    52036 non-null  float64
 76  V277    52036 non-null  float64
 77  V278    52036 non-null  float64
 78  V279    68512 non-null  float64
 79  V280    68512 non-null  float64
 80  V281    68324 non-null  float64
 81  V282    68324 non-null  float64
 82  V283    68324 non-null  float64
 83  V284    68512 non-null  float64
 84  V285    68512 non-null  float64
 85  V286    68512 non-null  float64
 86  V287    68512 non-null  float64
 87  V288    68324 non-null  float64
 88  V289    68324 non-null  float64
 89  V290    68512 non-null  float64
 90  V291    68512 non-null  float64
 91  V292    68512 non-null  float64
 92  V293    68512 non-null  float64
 93  V294    68512 non-null  float64
 94  V295    68512 non-null  float64
 95  V296    68324 non-null  float64
 96  V297    68512 non-null  float64
 97  V298    68512 non-null  float64
 98  V299    68512 non-null  float64
 99  V300    68324 non-null  float64
dtypes: float64(100)
memory usage: 52.8 MB
None
<class 'pandas.core.frame.DataFrame'>
Int64Index: 68519 entries, 0 to 68518
Data columns (total 39 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   V301    68324 non-null  float64
 1   V302    68512 non-null  float64
 2   V303    68512 non-null  float64
 3   V304    68512 non-null  float64
 4   V305    68512 non-null  float64
 5   V306    68512 non-null  float64
 6   V307    68512 non-null  float64
 7   V308    68512 non-null  float64
 8   V309    68512 non-null  float64
 9   V310    68512 non-null  float64
 10  V311    68512 non-null  float64
 11  V312    68512 non-null  float64
 12  V313    68324 non-null  float64
 13  V314    68324 non-null  float64
 14  V315    68324 non-null  float64
 15  V316    68512 non-null  float64
 16  V317    68512 non-null  float64
 17  V318    68512 non-null  float64
 18  V319    68512 non-null  float64
 19  V320    68512 non-null  float64
 20  V321    68512 non-null  float64
 21  V322    0 non-null      float64
 22  V323    0 non-null      float64
 23  V324    0 non-null      float64
 24  V325    0 non-null      float64
 25  V326    0 non-null      float64
 26  V327    0 non-null      float64
 27  V328    0 non-null      float64
 28  V329    0 non-null      float64
 29  V330    0 non-null      float64
 30  V331    0 non-null      float64
 31  V332    0 non-null      float64
 32  V333    0 non-null      float64
 33  V334    0 non-null      float64
 34  V335    0 non-null      float64
 35  V336    0 non-null      float64
 36  V337    0 non-null      float64
 37  V338    0 non-null      float64
 38  V339    0 non-null      float64
dtypes: float64(39)
memory usage: 20.9 MB
None
<class 'pandas.core.frame.DataFrame'>
Int64Index: 68519 entries, 0 to 68518
Data columns (total 38 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   id_01   62192 non-null  float64
 1   id_02   61015 non-null  float64
 2   id_03   27141 non-null  float64
 3   id_04   27141 non-null  float64
 4   id_05   59809 non-null  float64
 5   id_06   59809 non-null  float64
 6   id_07   491 non-null    float64
 7   id_08   491 non-null    float64
 8   id_09   29436 non-null  float64
 9   id_10   29436 non-null  float64
 10  id_11   61016 non-null  float64
 11  id_12   62192 non-null  object 
 12  id_13   60121 non-null  float64
 13  id_14   2396 non-null   float64
 14  id_15   61016 non-null  object 
 15  id_16   51136 non-null  object 
 16  id_17   60724 non-null  float64
 17  id_18   15052 non-null  float64
 18  id_19   60701 non-null  float64
 19  id_20   60656 non-null  float64
 20  id_21   495 non-null    float64
 21  id_22   495 non-null    float64
 22  id_23   495 non-null    object 
 23  id_24   274 non-null    float64
 24  id_25   484 non-null    float64
 25  id_26   494 non-null    float64
 26  id_27   495 non-null    object 
 27  id_28   61016 non-null  object 
 28  id_29   61016 non-null  object 
 29  id_30   0 non-null      object 
 30  id_31   60898 non-null  object 
 31  id_32   0 non-null      float64
 32  id_33   0 non-null      object 
 33  id_34   0 non-null      object 
 34  id_35   61016 non-null  object 
 35  id_36   61016 non-null  object 
 36  id_37   61016 non-null  object 
 37  id_38   61016 non-null  object 
dtypes: float64(23), object(15)
memory usage: 20.4+ MB
None

Based on the information above, the number of features could be reduced by only selecting those that are numeric and have at least 60,000 observations.

# Select the features that are numeric and have at least 60,000 observations.
fraud_numeric = fraud_data.select_dtypes(['number']).drop(columns='TransactionID')

# Select only the features that have at least 60,000 non-missing values.
fraud_numeric2 = pd.DataFrame()
for (columnName, columnData) in fraud_numeric.iteritems():
  enough_data = columnData.notna().sum() > 60000
  if enough_data:
      fraud_numeric2[columnName] = columnData

# Then select only the rows that do not have any missing values.
fraud_numeric3 = fraud_numeric2.dropna(axis='rows')
print(fraud_numeric3.info())
print(list(fraud_numeric3.columns))

# Check that there are no missing values in this new fraud data
print('There are '+str(fraud_numeric3.isna().sum().sum())+' missing values')
<class 'pandas.core.frame.DataFrame'>
Int64Index: 51660 entries, 0 to 68518
Columns: 203 entries, isFraud to id_20
dtypes: float64(200), int64(3)
memory usage: 80.4 MB
None
['isFraud', 'TransactionDT', 'TransactionAmt', 'card1', 'card2', 'card3', 'card5', 'C1', 'C2', 'C3', 'C4', 'C5', 'C6', 'C7', 'C8', 'C9', 'C10', 'C11', 'C12', 'C13', 'C14', 'D1', 'D4', 'D6', 'D10', 'D12', 'D15', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20', 'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'V29', 'V30', 'V31', 'V32', 'V33', 'V34', 'V35', 'V36', 'V37', 'V38', 'V39', 'V40', 'V41', 'V42', 'V43', 'V44', 'V45', 'V46', 'V47', 'V48', 'V49', 'V50', 'V51', 'V52', 'V53', 'V54', 'V55', 'V56', 'V57', 'V58', 'V59', 'V60', 'V61', 'V62', 'V63', 'V64', 'V65', 'V66', 'V67', 'V68', 'V69', 'V70', 'V71', 'V72', 'V73', 'V74', 'V75', 'V76', 'V77', 'V78', 'V79', 'V80', 'V81', 'V82', 'V83', 'V84', 'V85', 'V86', 'V87', 'V88', 'V89', 'V90', 'V91', 'V92', 'V93', 'V94', 'V95', 'V96', 'V97', 'V98', 'V99', 'V100', 'V101', 'V102', 'V103', 'V104', 'V105', 'V106', 'V107', 'V108', 'V109', 'V110', 'V111', 'V112', 'V113', 'V114', 'V115', 'V116', 'V117', 'V118', 'V119', 'V120', 'V121', 'V122', 'V123', 'V124', 'V125', 'V126', 'V127', 'V128', 'V129', 'V130', 'V131', 'V132', 'V133', 'V134', 'V135', 'V136', 'V137', 'V279', 'V280', 'V281', 'V282', 'V283', 'V284', 'V285', 'V286', 'V287', 'V288', 'V289', 'V290', 'V291', 'V292', 'V293', 'V294', 'V295', 'V296', 'V297', 'V298', 'V299', 'V300', 'V301', 'V302', 'V303', 'V304', 'V305', 'V306', 'V307', 'V308', 'V309', 'V310', 'V311', 'V312', 'V313', 'V314', 'V315', 'V316', 'V317', 'V318', 'V319', 'V320', 'V321', 'id_01', 'id_02', 'id_11', 'id_13', 'id_17', 'id_19', 'id_20']
There are 0 missing values
# Separate response ('isFraud') from features in the fraud data.
fraud_response = fraud_numeric3['isFraud']
fraud_features = fraud_numeric3.drop(columns=['isFraud'])
print(fraud_features.info())
<class 'pandas.core.frame.DataFrame'>
Int64Index: 51660 entries, 0 to 68518
Columns: 202 entries, TransactionDT to id_20
dtypes: float64(200), int64(2)
memory usage: 80.0 MB
None

The fraud data now contains 51,660 observations on 202 features with no missing values.

If fraud assessors who were familiar with this dataset were available to speak to, they could provide qualitative information about the features that are likely to be good predictors of fraud. This information could then be used to refine the feature selection process.

Assume that you have access to these fraud assessors and they have suggested you the following features might be the most helpful in identifying fraud:

  • id_02;

  • id_19;

  • id_20;

  • V44;

  • V45;

  • V86;

  • V87;

  • C1;

  • C2;

  • C8;

  • C11;

  • C12;

  • C13;

  • C14;

  • card1;

  • card2;

  • TransactionAmt; and

  • TransactionDT.

You will use this reduced feature list to perform your initial modelling.

# Reduce the feature list to those suggested above.
fraud_features_selected = fraud_features[['id_02', 'id_19', 'id_20',
                                           'V44','V45', 'V86', 'V87',
                                           'C1','C2', 'C8', 'C11', 'C12',
                                           'C13', 'C14', 'card1', 'card2',
                                           'TransactionAmt', 'TransactionDT']]

print(fraud_features_selected.head())
      id_02  id_19  id_20  V44  V45  V86  V87   C1   C2   C8  C11  C12  C13  \
0  191631.0  410.0  142.0  1.0  1.0  1.0  1.0  1.0  4.0  1.0  2.0  2.0  2.0   
1  221832.0  176.0  507.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0   
2  116098.0  410.0  142.0  2.0  2.0  2.0  2.0  2.0  5.0  1.0  2.0  2.0  3.0   
3  257037.0  484.0  507.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0   
4  287959.0  254.0  507.0  1.0  1.0  1.0  1.0  1.0  2.0  2.0  1.0  1.0  2.0   

   C14  card1  card2  TransactionAmt  TransactionDT  
0  1.0  16496  352.0          75.887          86549  
1  1.0   4461  375.0          16.495          86555  
2  2.0  13329  569.0          75.887          87209  
3  1.0  15885  545.0          42.294          87317  
4  1.0  12730  266.0           3.595          87317  
# Create a scaler so that the features in the dataset can be
# scaled to have a mean of 0 and a standard deviation of 1. 
scaler = StandardScaler()
fraud_features_scaled = scaler.fit_transform(fraud_features_selected)

Modelling#

This section:

  • fits a model;

  • evaluates the fitted model;

  • improves the model; and

  • selects a final model.

Fit model#

# Perform K-means clustering on the fraud dataset.

# Create KMeans models with values of k in the range 1 to 9.
clusters = range(1, 10)
seed = 50
kmeans = [KMeans(n_clusters=k,random_state=seed) for k in clusters]
  # Setting random_state to a fixed value makes sure that the same results are
  # achieved each time the model is run for each value of K.
# Fit the models and create an elbow curve to help select a value of
# K that provides a good clustering outcome, wihtout overfitting the data.

# The cluster score for each value of k is calculated to determine
# the one that gives the best clustering outcome.
# The cluster score (WCSS) measures the (negative) sum of squared
# distances of observations to their closest cluster centroid,
# so a smaller score indicates a better clustering of the data.

# Note that this step may take a while to run as the kmeans model needs to be
# fitted 10 times.
score = [kmeans[k].fit(fraud_features_scaled).score(fraud_features_scaled) for
         k in range(len(kmeans))]

# Plot the elbow curve.
plt.plot(clusters,score)
plt.xlabel('Number of clusters (K)')
plt.ylabel('Cluster score')
plt.title('Elbow curve')
plt.show()
../_images/d9ad9d2b092c887916ec2cc86ad04cd0a6fdc71640828d78012f299d4d16368f.png

The elbow curve shows a slight ‘kink’ at K = 2 and another at K = 3, suggesting that these might be good values of K to use such that the clustering produces a small sum of squared distances but does not overfit the training data.

The model below will use a value of K = 3.

# Find outliers in the clusters to identify potential cases of fraud.

# The steps below will flag cases as fraud if they are far from their
# cluster centroid. In other words, they will be flagged as fraud if they
# look unusual compared to other observations in their cluster.

# Extract the selected kmeans model with K = 3 from the list of all
# kmeans models fitted.
# Note: because indexing starts at 0, the model with K = 3 has the list index 2.
kmeans_selected = kmeans[2]

# Get the cluster number for each observation in the fraud dataset.
labels = kmeans_selected.predict(fraud_features_scaled)

# Get the cluster centroids so that the distance from each observation to the
# centroid can be calculated to select cases to be flagged as fraud.
centroids = kmeans_selected.cluster_centers_

# Calculate the distance between each observation and its cluster centroid.
distance = [np.linalg.norm(X-Z) for X,Z in zip(fraud_features_scaled,centroids[labels])]
  # np.linalg.norm is the 'norm' function from Numpy's linear algebra package
  # and is used to calculate the Euclidean distance between two points,
  # in this case between each observation (X) and its cluster centroid (Z).

# From the exploration of data above, it was seen that approximately 12% of
# Product Code C transactions were fraudulent. The code below will therefore
# classify an observation as fraudulent if its distance from its
# cluster centroid is above the 88th percentile of all distances in the cluster.

fraud_boundary = 88
fraud_prediction = np.array(distance)
fraud_prediction[distance>=np.percentile(distance, fraud_boundary)] = 1
  # Assign an observation as being fraudulent (1) if it is outside the
  # 88th percentile of observations in its cluster.
fraud_prediction[distance<np.percentile(distance, fraud_boundary)] = 0
  # Otherwise assigne the observation as being non fraudulent (0).
print(fraud_prediction)
[0. 0. 1. ... 0. 1. 0.]

Evaluate model#

Unsupervised learning often involves working with datasets that do not contain response variable. In these cases, the output of a clustering algorithm can be evaluated using internal or manual evaluation techniques as described in Section 6.3.5 of Module 6.

An example of internal evaluation was shown above when the clustering scor (WCSS) was calculated based on the sum of squared distances between observations and their centroid clusters.

In fraud detection exercises, manual evaluation might be carried out by showing a fraud expert the observations in the training dataset that were classified as fraud and asking them to evaluate the likelihood that these observations were in fact fraud.

In this case study, response variables are available in the dataset, so these can be used to conduct external evaluation by comparing known cases of fraud to predicted cases of fraud. This external evaluation is performed below.

# Calculate and plot a confusion matrix to show True Positives, False Positives,
# True Negatives and False Negatives.

# Compute the confusion matrix.
cm = confusion_matrix(fraud_response, fraud_prediction) 
# Plot the confusion matrix.
plot_confusion_matrix(cm, classes = ['Not fraud','Fraud'])

# Calculate the F1 score for these predictions.
# F1 = 2 x (precision * recall) / (precision + recall).
print('The F1 score is '+str(f1_score(fraud_response, fraud_prediction)))
The F1 score is 0.44971996990721397
../_images/25cd5e87fedf9ac43fe91a5ca90f25d9d30bcaa8851e9d53b19e27823b45603d.png

The confusion matrix shows that 2,690 (47%) of fraudulent transactions have been correctly classified as ‘Fraud’. It also shows that 3,510 (8%) of non fraudulent transactions that predicted to be fraud.

This suggests that there is room to improve the model to make it a more helpful tool for the fraud detection team to use.

Improve model#

The model might be improved by trying different values for the fraud_boundary or K, or by experimenting with different features used in the clustering.

The code below tries some different values for the fraud_boundary and K. You should also experiment to see if you can obtain better fraud predictions from the model, either by trying other values for fraud_coundary or K, or by trying different features to use in the clustering.

# Try different values for K.

# Extract the selected kmeans model with K = 4 from the list of all kmeans
# models fitted.
# Note: because indexing starts at 0, the model with K = 4 has the list index 3.
kmeans2 = kmeans[3]

# Get the cluster number for each observation in the fraud dataset.
labels2 = kmeans2.predict(fraud_features_scaled)

# Get the cluster centroids so that the distance from each observation to the
# centroid can be calculated to select cases to be flagged as fraud.
centroids2 = kmeans2.cluster_centers_

# Calculate the distance between each observation and the cluster centroid.
distance2 = [np.linalg.norm(X-Z) for X,Z in zip(fraud_features_scaled,
                                                centroids2[labels2])]

# This time the fraud_boundary will be set to 80 so that more transactions will be
# identified as fraud.
fraud_boundary2 = 80
fraud_prediction2 = np.array(distance2)
fraud_prediction2[distance2>=np.percentile(distance2, fraud_boundary)] = 1
fraud_prediction2[distance2<np.percentile(distance2, fraud_boundary)] = 0
print(fraud_prediction2)

# Calculate and plot a confusion matrix to show True Positives, False Positives,
# True Negatives and False Negatives.

cm = confusion_matrix(fraud_response, fraud_prediction2) 
plot_confusion_matrix(cm, classes = ['Not fraud','Fraud'])

# Compare the predictions from Kmeans1 and Kmeans2
{'KMeans1 F1':f1_score(fraud_response, fraud_prediction),
 'KMeans2 F1':f1_score(fraud_response, fraud_prediction2),
 }
[0. 0. 0. ... 0. 0. 0.]
{'KMeans1 F1': 0.44971996990721397, 'KMeans2 F1': 0.46225863077823287}
../_images/4fcb36be838517df042e815ffc9697eff53cae3297ab709e9c9f97a50c351688.png

This model has a slightly higher F1 score, indicating it provides a slightly better fraud prediction than the original model.

Evaluation / observations#

This fraud detection model:

  • detects about 48% of true fraud cases in the training data (2,766 out of 5,763); and

  • classifies about 7.5% of true non-fraud cases in the training data as being fraudulent (3,434 out of 45,897).

Further, of the 6,200 transactions classified as fraud, only 45% of them are actual cases of fraud.

Therefore, this model probably requires further refinement to avoid missing too many true cases of fraud and to avoid classifying too many non-fraud cases as fraud.

It is likely that another classifier, such as a GBM or a neural network, would make better fraud predictions based on the training data. However, the power of the model presented above is that it does not need examples of fraud to be able to make a reasonable attempt at detecting them. This means that this ‘outlier’ classifier might be better at identifying new types of fraud that emerge over time than another classifier might be.

Further, while this case study has not provided a very good classification of fraudulent transactions based on the training data, it has demonstrated how clustering might be used to detect outliers in a dataset.

Afterword#

The output of fraud detection is a ‘short-list’ of possible outliers. Incorrectly accusing someone of fraudulent behaviour can have very detrimental financial and reputational impacts on an organisation, as well as likely causing the accused individual significant stress. For this reason, the results of a fraud detection activity usually need to be reviewed manually by a human to determine whether fraud has occurred or is likely to occur in each case. Further monitoring and/or investigation might be required before deciding whether fraud has occurred. Again, when conducting fraud investigations, it is important to consider the potential harm that can be caused to someone undergoing investigation.

At the same time, letting fraud go undetected and uncorrected can also be very costly for an organisation. Therefore, it is important to track how well any fraud detection algorithm performs over time. As the nature of fraud changes over time, it is important to assess whether the algorithm is keeping pace with the changing environment.