R: Baudry ML Reserving Pt 3#

This article was originally created by Nigel Carpenter and published in the General Insurance Machine Learning for Reserving Working Party (“MLR-WP”) blog. The MLR-WP is an international research group on machine learning techniques to reserving, with over 50 actuaries from around the globe. The goal of the group is to bring machine learning techniques into widespread adoption ‘on the ground’ by identifying what the barriers are, communicating any benefits, and helping develop the research techniques in pragmatic ways. Whilst some articles have been brought into this cookbook, consider exploring the blog further for additional content including detailed walkthroughs of more advanced models.

Introduction#

This is the third notebook of a series of three that outlines and elaborates upon code used to replicate the central scenario in the paper of Maximilien Baudry “NON-PARAMETRIC INDIVIDUAL CLAIM RESERVING IN INSURANCE” (Presentation Paper)

In this notebook we step through the process to apply machine learning techniques in order to create reserve estimates following the techniques set out in sections 3 and 4 of Baudry’s paper.

Preparations before modelling#

The reserving data structures built in this notebook are from a simulated phone insurance dataset. The creation of that simulated data and machine learning data structures for reserving have been set out in the first and second notebooks of this series.

Rather than repeat that code in this notebook the key routines and functions have been put into the r script Baudry_functions_v2.R enabling us to quickly load them for use in this notebook.

# Importing packages
library(knitr)
library(rmdformats)
library(data.table)
library(magrittr)
library(lubridate)
library(ggplot2)
library(cowplot)
library(repr)
library(kableExtra)
library(formattable)
library(IRdisplay) # displays tables when in ipynb format

library(xgboost)
library(tidymodels)
library(SHAPforxgboost)

source("./Baudry_functions_v2.R")

Simulate data and prepare database#

We can now quickly repeat the process from Notebooks 1 and 2 using the functions in the script.

Simulate policy and claim data#

We start with a simulated phone insurance policy and claim dataset. I am calling the function from Notebook 1 of this series to create the dataset. Using a fixed seed will ensure you get a reproducible simulated dataset.

dt_PhoneData <- simulate_central_scenario(1234)

Join policy and claim data#

We then join the simulated phone insurance policy and claim dataset. I have taken the code from Notebook 2 and wrapped it in to a function to perform the join and subsequent data tidying.

Note that if you are working with your own company data you would likely need to amend this function to work with the specifics of your data.

dt_polclaim <- join_policy_claim(dt_PhoneData,
                                 date_pol_start = "date_UW",
                                 date_pol_end = "date_lapse",
                                 date_occur = "date_occur")

Timeslice policy claim#

We then time slice the joined policy and claim dataset. Again, I have taken the code from Notebook 2 and wrapped it into a function to perform the time slicing, given a set of dates that define the time slice periods.

Again, note that this function would need to be re-written to work with your own company data and more importantly it currently would not work correctly with partial payment claims.

lst_Date_slice <- floor_date(seq(as.Date("2016/1/1"), as.Date("2019/06/30"), by = 30), unit= "second") 
dt_polclaim <- time_slice_polclaim(dt_polclaim, lst_Date_slice)

Select valuation date#

Before we create the RBNS and IBNER datasets we must pick a valuation point from the available timeslices. I’m selecting the 10th point for illustration, which is the 10th 30 day period from 01/01/2016, ie a valuation date of 27/09/2016.

i <- valuation <- 10
t_i <- lst_Date_slice[i] 
delta <- min(i, length(lst_Date_slice) - i + 1)

Creating RBNS dataset#

We start with the RBNS dataset and by defining the list of features that will be used in the RBNS model dataset.

As you follow this example you may observe that we are not using the historic claim payment information as an explanatory feature in the RBNS reserve prediction model. Baudry’s paper and approach does allow for the inclusion of such information so it may seem somewhat strange not to include it in this worked example.

I believe the reason that it is not included is a result of the way data has been simulated, specifically that simulated claims are settled with a single payment. That assumption combined with the fact that this method is being applied to individual claim level data prevents the previous time period’s payment information being used by a machine learning model. That is because the previous period’s payment amount can only ever take the value zero for a claim requiring an RBNS reserve. (Claims where the previous paid is non-zero have been paid and settled ie no longer require an RBNS reserve). Consequently, in this central simulated example, prior payment amount is uninformative in the machine learning process and therefore cannot be included as a feature in the RBNS dataset.

The final point to note is that the code shared here deals only with the simplified payment process from Baudry’s central scenario, ie claims are settled with a single payment. Real world data with partial payments and prior payments as explanatory features would require material changes to be made to the code we have shared. Such changes are left for interested readers to make.

#define modelVars
RBNS_model_vars <- c("clm_number",
                     "pol_number",
                     "j",
                     "k",
                     "date_pol_start",
                     "date_occur",
                     "date_report",
                     "date_pay",
                     "Cover",
                     "claim_type",
                     "Brand",
                     "Model",
                     "Price",
                     "target"
    )


# Create a combined TRAIN dataset for k = 1 and all valid j delay values
dt_RBNS_train <- RBNS_Train(dt_polclaim, t_i, i, k = 1, lst_Date_slice, RBNS_model_vars)

# Create a combined TEST dataset for k = 1 and all valid j delay values
dt_RBNS_test <- RBNS_Test(dt_polclaim, t_i, delta, k = 1, lst_Date_slice, RBNS_model_vars)

The train and test datasets are then joined into a single dataset and a small amount of tidying is done to make them ready for use.

# Add a flag to determine which rows are from the trainset and which from the test set
dt_RBNS_train[, flgTrain := 1]
dt_RBNS_test[, flgTrain := 0]

# combine into a single RBNS dataset   
dt_All_RBNS <- rbind(dt_RBNS_train, dt_RBNS_test)
#write.csv(dt_All_RBNS,"dt_All_RBNS.csv", row.names = F)

The important aspects of the tidying relate to creating useable delay metrics from the numerous dates and converting some character features such as cover and claim type into factors.

# order and create some delay fields
setkey(dt_All_RBNS, clm_number, k, j)
    
dt_All_RBNS[, Count := .N , by =clm_number]

#create delay measures and convert time measures from intervals of seconds to intervals of days
dt_All_RBNS[, ':='(
  delay_uw_occ = ifelse(year(date_occur) == 2199,
                        -1,
                        ceiling((as.numeric(date_occur) - as.numeric(date_pol_start)) / (24 * 60 * 60))
                        ),
  delay_occ_rep = ifelse(year(date_occur) == 2199,
                         -1,
                         ceiling((as.numeric(date_report) - as.numeric(date_occur)) / (24 * 60 * 60))
                         ),
  delay_uw_val = ceiling((as.numeric(t_i) - as.numeric(date_pol_start)) / (24 * 60 * 60)),
  delay_rep_pay = ceiling((as.numeric(date_pay) - as.numeric(date_report)) / (24 * 60 * 60)),
  
  date_uw = ceiling(as.numeric(date_pol_start) / (24 *  60 * 60)),
  Cover = as.factor(Cover),
  claim_type = as.factor(claim_type)
  )]
  

Creating IBNR dataset#

We can then create the IBNR dataset following a similar procedure starting by defining the list of features that will be used in the IBNR model. Then we call the functions that convert the timesliced policy and claim data into the IBNR train and test datasets.

#define IBNR modelVars
IBNR_model_vars <- c("clm_number",
                     "pol_number",
                     "j",
                     "k",
                     "exposure",
                     "date_pol_start",
                     "date_occur",
                     "date_report",
                     "date_pay",
                     "Cover",
                     "Brand",
                     "Model",
                     "Price",
                     "target")
    
# Create a combined TRAIN dataset for k = 1 and all valid j delay values
lst_IBNR_train <- IBNR_Train(dt_polclaim, t_i, i, k = 1,lst_Date_slice, IBNR_model_vars)

# Create a combined TEST dataset for k = 1 and all valid j delay values
dt_IBNR_test <- IBNR_Test(dt_polclaim, t_i, delta, k = 1,lst_Date_slice, IBNR_model_vars)

The train and test datasets are then joined into a single dataset and a small amount of tidying is done to make them ready for use.

lst_IBNR_train$Freq[, flgTrain := 1]
lst_IBNR_train$Loss[, flgTrain := 2]
dt_IBNR_test[, flgTrain := 0]

dt_All_IBNR <- rbind(lst_IBNR_train$Freq,
                     lst_IBNR_train$Loss,
                     dt_IBNR_test)
#write.csv(dt_All_IBNR,"dt_All_IBNR.csv", row.names = F)
# tidy up
rm(lst_IBNR_train)
rm(dt_IBNR_test)
gc()
A matrix: 2 × 7 of type dbl
used(Mb)gc trigger(Mb)limit (Mb)max used(Mb)
Ncells 2945462157.4 5633064 300.9 NA 4008351214.1
Vcells75769710578.11495177381140.816384123984472946.0

The important aspects of the tidying relate to creating useable delay metrics from the numerous dates and converting some character features such as cover and claim type into factors.

# order and create some delay fields
setkey(dt_All_IBNR, clm_number, k, j)
    
dt_All_IBNR[, Count := .N , by =clm_number]
dt_All_IBNR[,':='( delay_uw_occ = ifelse(year(date_occur) == 2199,
                                        -1,
                                        ceiling((as.numeric(date_occur) - as.numeric(date_pol_start))
                                                  /(24*60*60))
                                          ),
                   delay_occ_rep = ifelse(year(date_occur) == 2199,
                                          -1,
                                          ceiling((as.numeric(date_report) - as.numeric(date_occur))
                                                  /(24*60*60))
                                          ),
                   delay_rep_pay = ifelse(year(date_occur) == 2199,
                                          -1,
                                          ceiling((as.numeric(date_pay) - as.numeric(date_report))
                                                  /(24*60*60))
                                          ),
                   delay_uw_val = ceiling((as.numeric(t_i) - as.numeric(date_pol_start))/(24*60*60)),
                   date_uw = ceiling(as.numeric(date_pol_start)/(24*60*60)),
                   Cover = as.factor(Cover))]

That finishes the data preparation. We can now move onto the process of building models of RBNS and IBNR and using them to predict the current reserve requirements.

Model build#

I will use the xgboost R machine learning package to build the models. In Baudry’s original paper he used the extraTrees package; a variant upon Random Forests.

Xgboost has been selected as it is the most popular implementation of the gradient boosting machine (GBM) algorithm. GBMs have been known since around 2015 to be the most accurate algorithmic technique for regression and classification tasks.

The code and modeling process for each reserve type is very similar so they are shown in the tabbed sections below rather that repeating in-line.

RBNS model#

Starting with the RBNS reserves we will first build a model upon the training data using a cross validated approach to enable selection of key xgboost hyperparameters.

Then using optimal parameters we retrain on the complete training dataset. This trained model is then scored against the test data in order to create predictions from which the RBNS reserve can be calculated.

Creating xgboost dataset#

Xgboost requires it’s data to be in a specific format; a special matrix form called a DMatrix, for which there is a function xgb.DMatrix. Not all variable types can be passed to xgb.DMatrix in particular categorical or nominal variables such as Brand have to be converted from text to numeric values.

For this example I’ve chosen to use the parsnip package and create a recipe that converts nominal predictor values into a one-hot-encoded form.

RBNS_predictors <- c("j",
                     "k",
                     "Cover",
                     "claim_type",
                     "Brand",
                     "Model",
                     "Price",
                     #"date_uw",
                     #"delay_uw_occ",
                     "delay_occ_rep")

rowList_RBNS <- list(train=dt_All_RBNS[, which(flgTrain==1)],
                test=dt_All_RBNS[, which(flgTrain==0)],
                all = 1:nrow(dt_All_RBNS))


RBNS_rec <- recipe( ~ ., data = dt_All_RBNS[, RBNS_predictors, with = FALSE]) %>%
  step_dummy(all_nominal(), one_hot = TRUE) %>%
  prep()


df.RBNS_train <- bake(RBNS_rec, new_data = dt_All_RBNS[rowList_RBNS$train,] )
df.RBNS_test <- bake(RBNS_rec, new_data = dt_All_RBNS[rowList_RBNS$test,] )
df.RBNS_all <- bake(RBNS_rec, new_data = dt_All_RBNS )


xgb.RBNS_DMat.train <- xgb.DMatrix(data = as.matrix(df.RBNS_train),
                              label = dt_All_RBNS[rowList_RBNS$train, target])

xgb.RBNS_DMat.test <- xgb.DMatrix(data = as.matrix(df.RBNS_test),
                              label = dt_All_RBNS[rowList_RBNS$test, target])

xgb.RBNS_DMat.all <- xgb.DMatrix(data = as.matrix(df.RBNS_all),
                             label = dt_All_RBNS[, target])

Fit initial model using cross validation#

Having prepared the data for xgboost I can now fit an initial model. I’ve used a simple set of hyper-parameters and used cross validation to select the optimal number of boosted trees (nrounds) for these hyper-parameter selections by calling the xgb.cv function with early stopping.

I have used the reg:tweedie objective function based upon inspection of the target variable.

summary(dt_All_RBNS[rowList_RBNS$train, target])
hist(dt_All_RBNS[rowList_RBNS$train, target])
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    0.00    0.00   76.44    0.00  913.00 
../_images/e95dc9ed6aad307844a34d2e09a7adff2b4577afdbd70b415ae8f4ef9115f713.png

In a real world example a process of hyper-parameter tuning would be undertaken in order to select more optimal xgboost hyper-parameters.

param <- list(
  objective = "reg:tweedie",
  max_depth = 2L,            # tree-depth
  subsample = 0.75,          # randomly sample rows before fitting each tree
  colsample_bytree = 0.8,    # randomly sample columns before fitting each tree
  min.child.weight = 10,     # minimum weight per leaf
  eta = 0.1                  # Learning rate
)


# Train model with cross validation
set.seed(1984) # for repeatability

xgb_RBNS_CV <- xgb.cv(
  params                 = param,
  data                   = xgb.RBNS_DMat.train,
  nrounds                = 500,        # Maximum number of trees to build
  nfold = 5,
  early_stopping_rounds  = 10L,        # Stops algorithm early if performance has not improved in 
  print_every_n          = 10L,        # How often to print to console
  prediction             = TRUE        # Keeps the predictions
)
[1]	train-tweedie-nloglik@1.5:197.200147+1.066135	test-tweedie-nloglik@1.5:197.212884+4.843435 
Multiple eval metrics are present. Will use test_tweedie_nloglik@1.5 for early stopping.
Will train until test_tweedie_nloglik@1.5 hasn't improved in 10 rounds.

[11]	train-tweedie-nloglik@1.5:76.197815+0.211998	test-tweedie-nloglik@1.5:76.203892+2.163219 
[21]	train-tweedie-nloglik@1.5:34.536194+0.258204	test-tweedie-nloglik@1.5:34.527821+1.059959 
[31]	train-tweedie-nloglik@1.5:22.707742+0.180634	test-tweedie-nloglik@1.5:22.711464+0.644633 
[41]	train-tweedie-nloglik@1.5:20.142523+0.042449	test-tweedie-nloglik@1.5:20.169496+0.415564 
[51]	train-tweedie-nloglik@1.5:19.651481+0.052902	test-tweedie-nloglik@1.5:19.680118+0.344643 
[61]	train-tweedie-nloglik@1.5:19.508132+0.061030	test-tweedie-nloglik@1.5:19.550804+0.321765 
[71]	train-tweedie-nloglik@1.5:19.449044+0.065773	test-tweedie-nloglik@1.5:19.501728+0.312207 
[81]	train-tweedie-nloglik@1.5:19.419057+0.070073	test-tweedie-nloglik@1.5:19.484397+0.310898 
[91]	train-tweedie-nloglik@1.5:19.400698+0.071544	test-tweedie-nloglik@1.5:19.473326+0.308729 
[101]	train-tweedie-nloglik@1.5:19.388735+0.071780	test-tweedie-nloglik@1.5:19.470725+0.307247 
[111]	train-tweedie-nloglik@1.5:19.376003+0.072908	test-tweedie-nloglik@1.5:19.467906+0.299997 
[121]	train-tweedie-nloglik@1.5:19.360954+0.073267	test-tweedie-nloglik@1.5:19.473449+0.295966 
Stopping. Best iteration:
[112]	train-tweedie-nloglik@1.5:19.374443+0.072647	test-tweedie-nloglik@1.5:19.467368+0.300002

Having fit the model we store the out-of-fold predictions.

dt_All_RBNS[rowList_RBNS$train, preds_oof := xgb_RBNS_CV$pred]

Fit final model on all training data#

Having fit the model using 5 fold cross validation we observe the optimum number of fitting rounds to be r xgb_RBNS_CV$best_iteration.

We can then use this to train a final model on all the data.

xgb_RBNS_Fit <- xgb.train(
  params                 = param,
  data                   = xgb.RBNS_DMat.train,
  nrounds                = xgb_RBNS_CV$best_iteration,
# base_score             = 1,
  watchlist              = list(train=xgb.RBNS_DMat.train, test=xgb.RBNS_DMat.test) ,
  print_every_n          = 10
)

dt_All_RBNS[, preds_full := predict(xgb_RBNS_Fit,xgb.RBNS_DMat.all)]
[1]	train-tweedie-nloglik@1.5:197.147781	test-tweedie-nloglik@1.5:82.766273 
[11]	train-tweedie-nloglik@1.5:75.850830	test-tweedie-nloglik@1.5:32.106102 
[21]	train-tweedie-nloglik@1.5:34.154110	test-tweedie-nloglik@1.5:14.514384 
[31]	train-tweedie-nloglik@1.5:22.479488	test-tweedie-nloglik@1.5:9.615996 
[41]	train-tweedie-nloglik@1.5:20.106234	test-tweedie-nloglik@1.5:8.624282 
[51]	train-tweedie-nloglik@1.5:19.618027	test-tweedie-nloglik@1.5:8.393838 
[61]	train-tweedie-nloglik@1.5:19.490944	test-tweedie-nloglik@1.5:8.321967 
[71]	train-tweedie-nloglik@1.5:19.442141	test-tweedie-nloglik@1.5:8.283491 
[81]	train-tweedie-nloglik@1.5:19.419815	test-tweedie-nloglik@1.5:8.265779 
[91]	train-tweedie-nloglik@1.5:19.399960	test-tweedie-nloglik@1.5:8.258656 
[101]	train-tweedie-nloglik@1.5:19.387503	test-tweedie-nloglik@1.5:8.257319 
[111]	train-tweedie-nloglik@1.5:19.373995	test-tweedie-nloglik@1.5:8.257586 
[112]	train-tweedie-nloglik@1.5:19.373476	test-tweedie-nloglik@1.5:8.257521 

Inspect model fit#

Having fitted the full model we can then inspect the model fit. The traditional way of inspecting global model feature importance is to use the gains chart.

#default feature importance by gain
featImp_RBNS <- xgb.importance(xgb_RBNS_Fit, feature_names = colnames(xgb.RBNS_DMat.train))
xgb.plot.importance(featImp_RBNS, main="Feature Importance - RBNS")
../_images/68cb5aca06fa1989ac6e6f6d7ebe81a13cad416164784946bd17bd9453fe1dac.png

An increasingly popular and more robust approach is to use SHAP values https://github.com/slundberg/shap. The SHAP equivalent of the feature importance chart is shown below.

# Return the SHAP values and ranked features by mean|SHAP|
shap_values <- shap.values(xgb_model = xgb_RBNS_Fit, X_train = as.matrix(df.RBNS_train))

# Prepare the long-format data:
shap_long <- shap.prep(shap_contrib = shap_values$shap_score, X_train =  as.matrix(df.RBNS_train))

# **SHAP summary plot**
shap.plot.summary(shap_long, dilute = nrow(df.RBNS_train)/10000)
../_images/64912111d87ace6bba3773b253904eada4cad17475f89e03ca4819e5a8c06e96.png

A second useful chart is the partial dependence plot. This shows how the values of a predictive (input) feature influence the predicted (output) value, while holding the values of all other predictive (input) features constant. It is also known as the marginal effect.

Here we show the partial dependence plots for the top 4 SHAP features.

fig_list <- lapply(names(shap_values$mean_shap_score)[1:4], 
                   shap.plot.dependence,
                   data_long = shap_long,
                   dilute = nrow(shap_long)/ 10000)

gridExtra::grid.arrange(grobs = fig_list, ncol = 2)
`geom_smooth()` using formula 'y ~ x'

`geom_smooth()` using formula 'y ~ x'

Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
“pseudoinverse used at -0.005”
Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
“neighborhood radius 1.005”
Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
“reciprocal condition number  3.0861e-30”
Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
“There are other near singularities as well. 1.01”
`geom_smooth()` using formula 'y ~ x'

`geom_smooth()` using formula 'y ~ x'

Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
“pseudoinverse used at 0.985”
Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
“neighborhood radius 2.015”
Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
“reciprocal condition number  1.3565e-15”
Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
“There are other near singularities as well. 4”
../_images/2e39e4c3f985d4bbaaad4ee81556a5c8257e4704409d2b6d8910f368c7010c30.png

The feature importance and partial dependency plots provide quick insight into the model.

  • We see that claim development period, j, is the most important feature and that the RBNS reserve is smaller for larger values of j.

  • The second most important feature is claimtype where breakage claims are associated with lower RBNS reserves. This is as expected from the data generating process where breakage claims come from a Beta distribution with a lower mean.

  • The third most important feature is phone price where there is linear increase in RBNS reserve with increasing phone price. This is also as expected from the data generating process.

  • The fourth feature is phone brand which again follows expectations from the data generating process.

SHAP values can also be used to show the components of a single prediction. In the following plot we show the top 4 components for each row of the data and zoom in at row 500.

# choose to show top 4 features by setting `top_n = 4`, set 6 clustering groups.  
plot_data <- shap.prep.stack.data(shap_contrib = shap_values$shap_score,
                                  data_percent = 10000/nrow(shap_long),
                                  top_n = 4,
                                  n_groups = 6)
  
# choose to zoom in at location 500, set y-axis limit using `y_parent_limit`  
# it is also possible to set y-axis limit for zoom-in part alone using `y_zoomin_limit`  
shap.plot.force_plot(plot_data, zoom_in_location = 500, y_parent_limit = c(-1,1))
  
The SHAP values of the Rest 8 features were summed into variable 'rest_variables'.


Data has N = 833 | zoom in length is 83 at location 500.
../_images/20f1b7c05520411ac480321bd305de8bec3d8d917aef4e1ad1d7568dc6f0e741.png

Summarising RBNS reserves#

By comparing the model predictions to the simulated claims run-off we can get a feel for the accuracy of the machine learning approach. Below I aggregate the RBNS predictions by claim occurrence month and compare them to the known simulated claim run-off.

dt_All_RBNS [, date_occur_YYYYMM := as.character(year(date_occur) + month(date_occur)/100 )]

dt_RBNS_summary <- dt_All_RBNS[rowList_RBNS$test,.(preds = sum(preds_full), target = sum(target)), keyby = date_occur_YYYYMM]
date_occur_YYYYMM preds target Diff Diff_pcnt
2016.04 1,341 911 430 47.2%
2016.05 5,442 5,188 254 4.9%
2016.06 12,704 14,257 -1,553 -10.9%
2016.07 56,579 53,815 2,764 5.1%
2016.08 325,344 326,631 -1,287 -0.4%
2016.09 619,875 622,179 -2,304 -0.4%
Total 1,021,285 1,022,981 -1,696 -0.2%

You can now jump back up to the beginning of the modeling section and select the IBNR frequency modeling tab.

IBNR Frequency model#

IBNR reserves are built by multiplying the outpts of a claim frequency and claim severity model. The general process of building the model follows that of the RBNS reserves.

Creating xgboost dataset#

Again we convert the data into a specific matrix form called a DMatrix. However before doing so we aggregate the data for efficiency of model fit times. There could be a loss of accuracy in aggregating data, so in practice this is something you would want to experiment with.

The other point to note is the values of flgTrain used to identify the training and test rows in our dataset. Recall from Notebook 2, in the IBNR dataset training rows for the frequency model have a flgtrain value of 1 whereas the Severity training rows have a value of 2.

IBNR_predictors <- c("j",
                     "k",
                     "Cover",
                     "Brand",
                     "Model",
                     "Price",
                     "date_uw")

# aggregate the data ... does this lead to loss of variance and accuracy?
dt_All_IBNR [, date_pol_start_YYYYMM := as.character(year(date_pol_start) + month(date_pol_start)/100 )]

dt_All_IBNR_F <- dt_All_IBNR[, .(exposure = sum(exposure),
                                   target_cost = sum(target),
                                   target_count = sum(target>0)),
                               by= c(IBNR_predictors, "date_pol_start_YYYYMM", "flgTrain")]

dt_All_IBNR_F <- dt_All_IBNR_F[exposure>0]


# setup train and test rows
rowList_IBNR_F <- list(train=dt_All_IBNR_F[, which(flgTrain==1)],
                     test=dt_All_IBNR_F[, which(flgTrain==0)],
                     all = dt_All_IBNR_F[, which(flgTrain!=2)])

# setup data for xgboost
IBNR_rec <- recipe( ~ ., data = dt_All_IBNR_F[, IBNR_predictors, with = FALSE]) %>%
  step_dummy(all_nominal(), one_hot = TRUE) %>%
  prep()

df.IBNR_F_train <- bake(IBNR_rec, new_data = dt_All_IBNR_F[rowList_IBNR_F$train,] )
df.IBNR_F_test <- bake(IBNR_rec, new_data = dt_All_IBNR_F[rowList_IBNR_F$test,] )
df.IBNR_F_all <- bake(IBNR_rec, new_data = dt_All_IBNR_F[rowList_IBNR_F$all,] )

xgb.IBNR_F_DMat.train <- xgb.DMatrix(data = as.matrix(df.IBNR_F_train),
                              weight = dt_All_IBNR_F[rowList_IBNR_F$train, exposure],
                              label = dt_All_IBNR_F[rowList_IBNR_F$train, target_count])

xgb.IBNR_F_DMat.test <- xgb.DMatrix(data = as.matrix(df.IBNR_F_test),
                             weight = dt_All_IBNR_F[rowList_IBNR_F$test, exposure],
                             label = dt_All_IBNR_F[rowList_IBNR_F$test, target_count])

xgb.IBNR_F_DMat.all <- xgb.DMatrix(data = as.matrix(df.IBNR_F_all),
                            weight = dt_All_IBNR_F[rowList_IBNR_F$all, exposure],
                            label = dt_All_IBNR_F[rowList_IBNR_F$all, target_count])

Fit initial model using cross validation#

Having prepared the data for xgboost I can now fit an initial model. I’ve used a simple set of hyper-parameters and used cross validation to select the optimal number of boosted trees (nrounds) for these hyper-parameter selections by calling the xgb.cv function with early stopping.

I have used the count:poisson objective function based upon inspection of the target variable.

hist(dt_All_IBNR_F[rowList_IBNR_F$train, target_count])
../_images/db061b2d416efdd76331e1e36e4fb6ed28d56f405d33cd833ec319d963566a27.png
param <- list(
  objective = "count:poisson",
  max_depth = 2L,           # tree-depth
  subsample = 0.7,          # randomly sample rows before fitting each tree
  colsample_bytree = 0.8,   # randomly sample columns before fitting each tree
  min.child.weight = 10,    # minimum weight per leaf
  eta = 0.1               # Learning rate
  #monotone_constraints = monotone_Vec # Monotonicity constraints
)


# Train model with cross validation
set.seed(1984) # for repeatability

xgb_IBNR_F_CV <- xgb.cv(
 params                 = param,
 data                   = xgb.IBNR_F_DMat.train,
 nrounds                = 2000,        # Maximum number of trees to build
 nfold = 5,

 early_stopping_rounds  = 50L,        # Stops algorithm early if performance has not improved in n rounds
 print_every_n          = 50L,        # How often to print to console
 #base_score             = 0.001,       # Model starting point
   prediction             = TRUE        # Keeps the predictions
)
[1]	train-poisson-nloglik:0.509639+0.000413	test-poisson-nloglik:0.509646+0.001705 
Multiple eval metrics are present. Will use test_poisson_nloglik for early stopping.
Will train until test_poisson_nloglik hasn't improved in 50 rounds.

[51]	train-poisson-nloglik:0.161843+0.002167	test-poisson-nloglik:0.162117+0.005715 
[101]	train-poisson-nloglik:0.136649+0.002226	test-poisson-nloglik:0.137008+0.008028 
[151]	train-poisson-nloglik:0.132533+0.002259	test-poisson-nloglik:0.133765+0.008356 
[201]	train-poisson-nloglik:0.130754+0.002232	test-poisson-nloglik:0.133048+0.008725 
[251]	train-poisson-nloglik:0.129527+0.002201	test-poisson-nloglik:0.132851+0.008859 
[301]	train-poisson-nloglik:0.128530+0.002113	test-poisson-nloglik:0.132896+0.009030 
Stopping. Best iteration:
[266]	train-poisson-nloglik:0.129220+0.002207	test-poisson-nloglik:0.132820+0.008904

Having fit the model we store the out-of-fold predictions for both the claim frequency and the claim counts.

 dt_All_IBNR_F[rowList_IBNR_F$train, preds_oof_IBNR_F := xgb_IBNR_F_CV$pred]
 dt_All_IBNR_F[rowList_IBNR_F$train, preds_oof_IBNR_Nos := exposure * preds_oof_IBNR_F]

Fit final model on all training data#

Having fit the model using 5 fold cross validation we observe the optimum number of fitting rounds to be r xgb_IBNR_F_CV$best_iteration.

We can then use this to train a final model on all the data.

xgb_IBNR_F_Fit <- xgb.train(
   params                 = param,
   data                   = xgb.IBNR_F_DMat.train,
   nrounds                = xgb_IBNR_F_CV$best_iteration,
# base_score             = 1,
   watchlist              = list(train=xgb.IBNR_F_DMat.train, test=xgb.IBNR_F_DMat.test) ,
   print_every_n          = 50
 )
 
 dt_All_IBNR_F[rowList_IBNR_F$all, preds_full_IBNR_Nos := predict(xgb_IBNR_F_Fit,xgb.IBNR_F_DMat.all)]
[1]	train-poisson-nloglik:0.509602	test-poisson-nloglik:0.497374 
[51]	train-poisson-nloglik:0.161682	test-poisson-nloglik:0.119256 
[101]	train-poisson-nloglik:0.136589	test-poisson-nloglik:0.087910 
[151]	train-poisson-nloglik:0.132685	test-poisson-nloglik:0.083935 
[201]	train-poisson-nloglik:0.131112	test-poisson-nloglik:0.083199 
[251]	train-poisson-nloglik:0.129959	test-poisson-nloglik:0.083135 
[266]	train-poisson-nloglik:0.129688	test-poisson-nloglik:0.083093 

Inspect model fit#

Having fitted the full model we can then inspect the model fit. The traditional way of inspecting global model feature importance is to use the gains chart.

#default feature importance by gain
featImp_IBNR_F <- xgb.importance(xgb_IBNR_F_Fit, feature_names = colnames(xgb.IBNR_F_DMat.train))
xgb.plot.importance(featImp_IBNR_F, main="Feature Importance - IBNR Frequency")
../_images/8d7fc6581e9500422bc9b9458211d1cbee52c3bd5f6395316a347ae85eecbdeb.png

An increasingly popular and more robust approach is to use SHAP values https://github.com/slundberg/shap. The SHAP equivalent of the feature importance chart is shown below.

# Return the SHAP values and ranked features by mean|SHAP|
shap_values <- shap.values(xgb_model = xgb_IBNR_F_Fit, X_train = as.matrix(df.IBNR_F_train))

# Prepare the long-format data:
shap_long <- shap.prep(shap_contrib = shap_values$shap_score, X_train =  as.matrix(df.IBNR_F_train))

# **SHAP summary plot**
shap.plot.summary(shap_long, dilute = nrow(df.IBNR_F_train)/10000)


  
../_images/425a775318a2ae73a3c00a5d5f8c144d52f793652fbe6a2301cfdaecc966a95e.png

A second useful chart is the partial dependence plot. This shows how the values of a predictive (input) feature influence the predicted (output) value, while holding the values of all other predictive (input) features constant. It is also known as the marginal effect.

Here we show the partial dependence plots for the top 4 SHAP features.

fig_list <- lapply(names(shap_values$mean_shap_score)[1:4], 
                   shap.plot.dependence,
                   data_long = shap_long,
                   dilute = nrow(shap_long)/ 10000)

gridExtra::grid.arrange(grobs = fig_list, ncol = 2)
`geom_smooth()` using formula 'y ~ x'

`geom_smooth()` using formula 'y ~ x'

`geom_smooth()` using formula 'y ~ x'

Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
“pseudoinverse used at -0.015”
Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
“neighborhood radius 2.015”
Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
“reciprocal condition number  2.3627e-15”
Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
“There are other near singularities as well. 4.0602”
`geom_smooth()` using formula 'y ~ x'

Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
“pseudoinverse used at -0.005”
Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
“neighborhood radius 1.005”
Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
“reciprocal condition number  2.6301e-29”
Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
“There are other near singularities as well. 1.01”
../_images/c243f53a25f5e38f5b99f2e008d3a3ff269be1be8ccd3b7d859b6f0cd50923c1.png

The feature importance and partial dependency plots provide quick insight into the model.

  • We see that claim development period, j, is the most important feature and that the IBNR frequency is smaller for larger values of j.

  • The second most important feature is phone price with the IBNR claim count increasing as price increases. Although this is not a direct feature in the data generating process there is a link with higher theft frequencies and higher phone prices.

  • The third most important feature is phone model with IBNR frequency increasing with model type. This is expected from the data generating process as theft claim frequency increases with model type.

  • The fourth feature is phone cover and it should be no suprise to see that Breakage only, the most basic cover level is associated with lower IBNR claim counts.

SHAP values can also be used to show the components of a single predction. In the following plot we show the top 4 components for each row of the data and zoom in at row 500.

# choose to show top 4 features by setting `top_n = 4`, set 6 clustering groups.  
plot_data <- shap.prep.stack.data(shap_contrib = shap_values$shap_score,
                                  data_percent = 10000/nrow(shap_long),
                                  top_n = 4,
                                  n_groups = 6)
  
# choose to zoom in at location 500, set y-axis limit using `y_parent_limit`  
# it is also possible to set y-axis limit for zoom-in part alone using `y_zoomin_limit`  
shap.plot.force_plot(plot_data, zoom_in_location = 500, y_parent_limit = c(-1,1))
  
The SHAP values of the Rest 5 features were summed into variable 'rest_variables'.


Data has N = 1111 | zoom in length is 111 at location 500.
../_images/2bb952cd72ef5afe8f5a0ed56fd58e76c0bac5dcf85441c0a5fa755c0a248034.png

Summarising IBNR claim counts#

By comparing the model predictions to the simulated claims run-off we can get a feel for the accuracy of the machine learning approach. Below I aggregate the IBNR claim count predictions by claim occurrence month and compare them to the known simulated claim run-off.

dt_All_IBNR_F_summary <- dt_All_IBNR_F[rowList_IBNR_F$test,.(preds = sum(preds_full_IBNR_Nos), target = sum(target_count)), keyby = date_pol_start_YYYYMM]
date_pol_start_YYYYMM preds target Diff Diff_pcnt
2016.01 216 168 48 28.7%
2016.02 189 176 13 7.4%
2016.03 187 160 27 16.9%
2016.04 181 164 17 10.4%
2016.05 173 195 -22 -11.4%
2016.06 161 179 -18 -10.0%
2016.07 113 170 -57 -33.5%
2016.08 93 162 -69 -42.7%
2016.09 90 59 31 53.2%
Total 1,404 1,433 -29 -2.0%

You can now jump back up to the beginning of the modeling section and select the IBNR severity modeling tab.

IBNR Severity model#

IBNR reserves are built by multiplying the outputs of a claim frequency and claim severity model. The general process of building the model follows that of the RBNS reserves.

Creating xgboost dataset#

Again we convert the data into a specific matrix form called a DMatrix.

The other point to note is the values of flgTrain used to identify the training and test rows in our dataset. Recall from Notebook 2, in the IBNR dataset training rows for the frequency model have a flgtrain value of 1 whereas the Severity training roes have a value of 2.

IBNR_predictors <- c("j",
                     "k",
                     "Cover",
                     "Brand",
                     "Model",
                     "Price",
                     "date_uw")

dt_All_IBNR [, date_pol_start_YYYYMM := as.character(year(date_pol_start) + month(date_pol_start)/100 )]

dt_All_IBNR_S <- dt_All_IBNR[, .(exposure = sum(target>0),
                                 target_cost = sum(target)),
                            by= c(IBNR_predictors, "date_pol_start_YYYYMM", "flgTrain")]

dt_All_IBNR_S <- dt_All_IBNR_S[exposure>0]

# setup train and test rows
rowList_IBNR_S <- list(train=dt_All_IBNR_S[, which(flgTrain==2)],
                     test=dt_All_IBNR_S[, which(flgTrain==0)],
                     all = dt_All_IBNR_S[, which(flgTrain!=1)])

# setup data for xgboost

IBNR_rec <- recipe( ~ ., data = dt_All_IBNR_S[, IBNR_predictors, with = FALSE]) %>%
  step_dummy(all_nominal(), one_hot = TRUE) %>%
  prep()

df.IBNR_S_train <- bake(IBNR_rec, new_data = dt_All_IBNR_S[rowList_IBNR_S$train,] )
df.IBNR_S_test <- bake(IBNR_rec, new_data = dt_All_IBNR_S[rowList_IBNR_S$test,] )
df.IBNR_S_all <- bake(IBNR_rec, new_data = dt_All_IBNR_S[rowList_IBNR_S$all,] )



xgb.IBNR_S_DMat.train <- xgb.DMatrix(data = as.matrix(df.IBNR_S_train),
                              weight = dt_All_IBNR_S[rowList_IBNR_S$train, exposure],
                              label = dt_All_IBNR_S[rowList_IBNR_S$train, target_cost])

xgb.IBNR_S_DMat.test <- xgb.DMatrix(data = as.matrix(df.IBNR_S_test),
                             weight = dt_All_IBNR_S[rowList_IBNR_S$test, exposure],
                             label = dt_All_IBNR_S[rowList_IBNR_S$test, target_cost])

xgb.IBNR_S_DMat.all <- xgb.DMatrix(data = as.matrix(df.IBNR_S_all),
                            weight = dt_All_IBNR_S[rowList_IBNR_S$all, exposure],
                            label = dt_All_IBNR_S[rowList_IBNR_S$all, target_cost])

Fit initial model using cross validation#

Having prepared the data for xgboost I can now fit an initial model. I’ve used a simple set of hyper-parameters and used cross validation to select the optimal number of boosted trees (nrounds) for these hyper-parameter selections by calling the xgb.cv function with early stopping.

I have used the reg:gamma objective function based upon inspection of the target variable.

summary(dt_All_IBNR_S[rowList_IBNR_S$train, target_cost])
hist(dt_All_IBNR_S[rowList_IBNR_S$train, target_cost])
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    3.0   125.5   255.0   353.9   486.5  2011.0 
../_images/a1e97cfbed344bbe63b4a003fbb23de629f8719ad9ba516b83074143dea7626f.png
param <- list(
  objective = "reg:gamma",
  max_depth = 2L,           # tree-depth
  subsample = 0.7,          # randomly sample rows before fitting each tree
  colsample_bytree = 0.8,   # randomly sample columns before fitting each tree
  min.child.weight = 10,    # minimum weight per leaf
  eta = 0.1               # Learning rate
  #monotone_constraints = monotone_Vec # Monotonicity constraints
)

# Train model with cross validation
set.seed(1984) # for repeatability

xgb_IBNR_S_CV <- xgb.cv(
 params                 = param,
 data                   = xgb.IBNR_S_DMat.train,
 nrounds                = 2000,        # Maximum number of trees to build
 nfold = 5,

 early_stopping_rounds  = 50L,        # Stops algorithm early if performance has not improved in n rounds
 print_every_n          = 50L,        # How often to print to console
 #base_score             = 0.001,       # Model starting point
   prediction             = TRUE        # Keeps the predictions
) 
[1]	train-gamma-nloglik:727.366760+11.680331	test-gamma-nloglik:726.538232+48.084103 
Multiple eval metrics are present. Will use test_gamma_nloglik for early stopping.
Will train until test_gamma_nloglik hasn't improved in 50 rounds.

[51]	train-gamma-nloglik:6.846936+0.079307	test-gamma-nloglik:6.850900+0.394632 
[101]	train-gamma-nloglik:2.031303+0.001389	test-gamma-nloglik:2.039524+0.065692 
[151]	train-gamma-nloglik:2.001972+0.001599	test-gamma-nloglik:2.018300+0.062185 
Stopping. Best iteration:
[133]	train-gamma-nloglik:2.002924+0.001819	test-gamma-nloglik:2.016062+0.064368

Having fitted an initial model the out-of-fold predictions are stored.

dt_All_IBNR_S[rowList_IBNR_S$train, preds_oof_IBNR_S := xgb_IBNR_S_CV$pred]
dt_All_IBNR_S[rowList_IBNR_S$train, preds_oof_IBNR_Cost := exposure * preds_oof_IBNR_S]

Fit final model on all training data#

Having fit the model using 5 fold cross validation we observe the optimum number of fitting rounds to be r xgb_IBNR_S_CV$best_iteration.

We can then us this to train a final model on all the data.

xgb_IBNR_S_Fit <- xgb.train(
   params                 = param,
   data                   = xgb.IBNR_S_DMat.train,
   nrounds                = xgb_IBNR_S_CV$best_iteration,
# base_score             = 1,
   watchlist              = list(train=xgb.IBNR_F_DMat.train, test=xgb.IBNR_F_DMat.test) ,
   print_every_n          = 50
 )
[1]	train-gamma-nloglik:1.072412	test-gamma-nloglik:1.046510 
[51]	train-gamma-nloglik:1.000593	test-gamma-nloglik:1.000379 
[101]	train-gamma-nloglik:1.000120	test-gamma-nloglik:1.000075 
[133]	train-gamma-nloglik:1.000126	test-gamma-nloglik:1.000079 

Having trained the model the predictions are stored.

dt_All_IBNR_S[rowList_IBNR_S$all, preds_full_IBNR_Cost := predict(xgb_IBNR_S_Fit,xgb.IBNR_S_DMat.all)]

Inspect model fit#

Having fitted the full model we can then inspect the model fit. The traditional way of inspecting global model feature importance is to use the gains chart.

#default feature importance by gain
featImp_IBNR_S <- xgb.importance(xgb_IBNR_S_Fit, feature_names = colnames(xgb.IBNR_S_DMat.train))
xgb.plot.importance(featImp_IBNR_S, main="Feature Importance - IBNR Severity")
../_images/0897709340e19514ced77cbf3de3d24995bf2383cbeb8730257be0c4fe2710e2.png

An increasingly popular and more robust approach is to use SHAP values https://github.com/slundberg/shap. The SHAP equivalent of the feature importance chart is shown below.

# Return the SHAP values and ranked features by mean|SHAP|
shap_values <- shap.values(xgb_model = xgb_IBNR_S_Fit, X_train = as.matrix(df.IBNR_S_train))

# Prepare the long-format data:
shap_long <- shap.prep(shap_contrib = shap_values$shap_score, X_train =  as.matrix(df.IBNR_S_train))

# **SHAP summary plot**
shap.plot.summary(shap_long, dilute = max(nrow(df.IBNR_S_train),10000)/10000)
../_images/a82c4b00acb3dbd257446b6db23f66c00b6254fa835dc5e8960bf08dbb0bd590.png

A second useful chart is the partial dependence plot. This shows how the values of a predictive (input) feature influence the predicted (output) value, while holding the values of all other predictive (input) features constant. It is also known as the marginal effect.

Here we show the partial dependence plots for the top 4 SHAP features.

fig_list <- lapply(names(shap_values$mean_shap_score)[1:4], 
                   shap.plot.dependence,
                   data_long = shap_long,
                   dilute = nrow(shap_long)/ 10000)

gridExtra::grid.arrange(grobs = fig_list, ncol = 2)
`geom_smooth()` using formula 'y ~ x'

`geom_smooth()` using formula 'y ~ x'

Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
“pseudoinverse used at -0.005”
Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
“neighborhood radius 1.005”
Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
“reciprocal condition number  0”
Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
“There are other near singularities as well. 1.01”
`geom_smooth()` using formula 'y ~ x'

Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
“at  -0.005”
Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
“radius  2.5e-05”
Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
“all data on boundary of neighborhood. make span bigger”
Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
“pseudoinverse used at -0.005”
Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
“neighborhood radius 0.005”
Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
“reciprocal condition number  1”
Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
“There are other near singularities as well. 1.01”
Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
“zero-width neighborhood. make span bigger”
Warning message:
“Computation failed in `stat_smooth()`:
NA/NaN/Inf in foreign function call (arg 5)”
`geom_smooth()` using formula 'y ~ x'

Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
“pseudoinverse used at 0.985”
Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
“neighborhood radius 2.015”
Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
“reciprocal condition number  3.4701e-15”
Warning message in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
“There are other near singularities as well. 4”
../_images/bbb83609d4b5bec5cfd53d462f248efd69419dbc868871bfd7f02d93dccdeabf.png

The feature importance and partial dependency plots provide quick insight into the model.

  • We see that phone price is the most important feature and has a linear relationship with IBNR severity as we would expect from the data generating process.

  • The second and third most important features relate to phone cover. We see that cover BOT is associated with higher IBNR costs whereas cover B is associated with lower.

  • The fourth feature is phone brand and follows the pattern we would expect from the data simulation process.

SHAP values can also be used to show the components of a single prediction. In the following plot we show the top 4 components for each row of the data and zoom in at row 500.

# choose to show top 4 features by setting `top_n = 4`, set 6 clustering groups.  
plot_data <- shap.prep.stack.data(shap_contrib = shap_values$shap_score,
                                  data_percent = 10000/max(nrow(shap_long),10000),
                                  top_n = 4,
                                  n_groups = 6)
  
# choose to zoom in at location 500, set y-axis limit using `y_parent_limit`  
# it is also possible to set y-axis limit for zoom-in part alone using `y_zoomin_limit`  
shap.plot.force_plot(plot_data, zoom_in_location = 500, y_parent_limit = c(-1,1))
  
The SHAP values of the Rest 5 features were summed into variable 'rest_variables'.


Data has N = 1111 | zoom in length is 111 at location 500.
../_images/ac303e5545cb6eb2f6a6ea48b00c58e7a3b0955c736fd0c6537030c85f4a07eb.png

Summarising IBNR claim costs#

By comparing the model predictions to the simulated claims run-off we can get a feel for the accuracy of the machine learning approach. Below I aggregate the IBNR claim cost predictions by claim occurrence month and compare them to the known simulated claim run-off.

dt_All_IBNR_S_summary <- dt_All_IBNR_S[rowList_IBNR_S$test,.(preds = sum(preds_full_IBNR_Cost), target = sum(target_cost)), keyby = date_pol_start_YYYYMM]
date_pol_start_YYYYMM preds target Diff Diff_pcnt
2016.01 57,417 52,782 4,635 8.8%
2016.02 55,351 55,615 -264 -0.5%
2016.03 52,700 47,200 5,500 11.7%
2016.04 55,108 56,975 -1,867 -3.3%
2016.05 64,834 64,833 1 0.0%
2016.06 53,147 52,989 158 0.3%
2016.07 51,445 50,276 1,169 2.3%
2016.08 51,825 53,502 -1,677 -3.1%
2016.09 18,725 18,989 -264 -1.4%
Total 460,551 453,161 7,390 1.6%

Summary#

In this third notebook we have stepped through the process to apply machine learning techniques in order to create reserve estimates following the techniques set out in sections 3 and 4 of Baudry’s paper.

Specifically we have shown how to apply the xgboost machine learning algorithm and illustrated how it can, with little human input, create reasonable reserve estimates as shown below.

Reserve Prediction Ground_Truth Diff Diff_pcnt
RBNS 1,021,285 1,022,981 -1,696 -0.2%
IBNR 460,551 453,161 7,390 1.6%
TOTAL 1,481,836 1,476,142 5,694 0.4%

In addition we have seen how we can create explanations for the reserve predictions using feature importance and partial dependence plots. We have also used SHAP values which can be used to explain both global features and individual predictions.

In so doing, it is hoped that this series of notebooks encourage further exploration of Baudry’s paper and the wider field of machine learning in reserving. The code supporting this notebook along with the supporting files and folders available in a zip file here.

Download and extract the zip file to a local directory and then open the R project file Baudry_3.rproj in your local R software installation. In the root of the project folder you will see two files;

  • Notebook_3_ApplyMachineLearningReserving_v1.Rmd - which is the source code used to recreate this notebook

  • Notebook_3_ApplyMachineLearningReserving_v1.R - the equivalent code provided as an R script

Please note that, depending upon your R installation , you may have to install R libraries before you can run the code provided. R will warn you if you have missing dependencies and you can then install them from CRAN.