R: Baudry ML Reserving Pt 1

R: Baudry ML Reserving Pt 1#

This article was originally created by Nigel Carpenter and published in the General Insurance Machine Learning for Reserving Working Party (“MLR-WP”) blog. The MLR-WP is an international research group on machine learning techniques to reserving, with over 50 actuaries from around the globe. The goal of the group is to bring machine learning techniques into widespread adoption ‘on the ground’ by identifying what the barriers are, communicating any benefits, and helping develop the research techniques in pragmatic ways. Whilst some articles have been brought into this cookbook, consider exploring the blog further for additional content including detailed walkthroughs of more advanced models.

Introduction#

This is the first notebook of a series of three that outlines and elaborates upon code used to replicate the central scenario in the paper of Maximilien Baudry “NON-PARAMETRIC INDIVIDUAL CLAIM RESERVING IN INSURANCE” (Presentation Paper)

Below we step through the process to create a single simulated dataset, as set out in Section 5 of Baudry’s paper. Having stepped through the data creation process, at the end, we present the code in the form of a function that returns a simulated policy and claim dataset.

The second notebook details the process for creating a reserving database and the third outlines the process for creating reserves using machine learning.

Before we start we import a few pre-requisite R packages.

# Importing packages
library(data.table)
library(magrittr)
library(lubridate)
library(ggplot2)
library(cowplot)
library(repr)
library(kableExtra)
library(IRdisplay)

Attaching package: ‘lubridate’

The following objects are masked from ‘package:data.table’:

    hour, isoweek, mday, minute, month, quarter, second, wday, week,
    yday, year

The following objects are masked from ‘package:base’:

    date, intersect, setdiff, union

Attaching package: ‘cowplot’

The following object is masked from ‘package:lubridate’:

    stamp

Create Policy Data set#

We start by simulating number of policies sold by day and by policy coverage type.

Policy count by date#

# polices sold between start 2016 to end 2017
dt_policydates <- data.table(date_UW = seq(as.Date("2016/1/1"), as.Date("2017/12/31"), "day"))

# number of policies per day follows Poisson process with mean 700 (approx 255,500 pols per annum)
dt_policydates[, ':='(policycount = rpois(.N,700),
                      date_lapse = date_UW %m+% years(1),
                      expodays = as.integer(date_UW %m+% years(1) - date_UW),
                      pol_prefix = year(date_UW)*10000 + month(date_UW)*100 + mday(date_UW))]

Policy covers by date#

We then add policy coverage columns in proportions 25% Breakage, 45% Breakage and Oxidation and 30% Breakage, Oxidation and Theft.

# Add columns defining Policy Covers   
dt_policydates[, Cover_B := round(policycount * 0.25)]
dt_policydates[, Cover_BO := round(policycount * 0.45)]
dt_policydates[, Cover_BOT := policycount - Cover_B - Cover_BO]

kable(head(dt_policydates), "html") %>% 
  kable_styling("striped") %>%
  scroll_box(width = "100%") %>%
  as.character() %>%
  display_html()

date_UW	policycount	date_lapse	expodays	pol_prefix	Cover_B	Cover_BO	Cover_BOT
2016-01-01	693	2017-01-01	366	20160101	173	312	208
2016-01-02	690	2017-01-02	366	20160102	172	310	208
2016-01-03	716	2017-01-03	366	20160103	179	322	215
2016-01-04	714	2017-01-04	366	20160104	178	321	215
2016-01-05	727	2017-01-05	366	20160105	182	327	218
2016-01-06	763	2017-01-06	366	20160106	191	343	229

Policy transaction file#

We then create a policy transaction file containing 1 row per policy, with columns to indicate policy coverage details.

Policy date & number#

The first step is to create a policy table with 1 row per policy and underwriting date.

# repeat rows for each policy by UW-Date
dt_policy <- dt_policydates[rep(1:.N, policycount),c("date_UW", "pol_prefix"), with = FALSE][,pol_seq:=1:.N, by=pol_prefix]

# Create a unique policy number 
dt_policy[, pol_number := as.character(pol_prefix * 10000 + pol_seq)]

kable(head(dt_policy), "html") %>%
  kable_styling("striped") %>%
  scroll_box(width = "100%") %>%
  as.character() %>%
  display_html()

date_UW	pol_prefix	pol_seq	pol_number
2016-01-01	20160101	1	201601010001
2016-01-01	20160101	2	201601010002
2016-01-01	20160101	3	201601010003
2016-01-01	20160101	4	201601010004
2016-01-01	20160101	5	201601010005
2016-01-01	20160101	6	201601010006

Policy coverage type#

Then we add the policy coverage details appropriate to each row.

# set join keys
setkey(dt_policy,'date_UW')
setkey(dt_policydates,'date_UW')  

# remove pol_prefix before join
dt_policydates[, pol_prefix := NULL]  

# join cover from summary file (dt_policydates)
dt_policy <- dt_policy[dt_policydates]  

# now create Cover field for each policy row
dt_policy[,Cover := 'BO']
dt_policy[pol_seq <= policycount- Cover_BO,Cover := 'BOT']
dt_policy[pol_seq <= Cover_B,Cover := 'B']  

dt_policy[, Cover_B := as.factor(Cover)]  

# remove interim calculation fields
dt_policy[, ':='(pol_prefix = NULL,
                 policycount = NULL,
                 pol_seq = NULL,
                 Cover_B = NULL,
                 Cover_BOT = NULL,
                 Cover_BO = NULL)]

# check output
kable(head(dt_policy), "html") %>%
  kable_styling("striped") %>%
  scroll_box(width = "100%") %>%
  as.character() %>%
  display_html()

date_UW	pol_number	date_lapse	expodays	Cover
2016-01-01	201601010001	2017-01-01	366	B
2016-01-01	201601010002	2017-01-01	366	B
2016-01-01	201601010003	2017-01-01	366	B
2016-01-01	201601010004	2017-01-01	366	B
2016-01-01	201601010005	2017-01-01	366	B
2016-01-01	201601010006	2017-01-01	366	B

Policy Brand, Price, Model features#

Now further details can be added to the policy dataset; such as policy brand, model and price details.

# Add remaining policy details
dt_policy[, Brand := rep(rep(c(1,2,3,4), c(9,6,3,2)), length.out = .N)]
dt_policy[, Base_Price := rep(rep(c(600,550,300,150), c(9,6,3,2)), length.out = .N)]

# models types and model cost multipliers
for (eachBrand in unique(dt_policy$Brand)) {
  dt_policy[Brand == eachBrand, Model := rep(rep(c(3,2,1,0), c(10, 7, 2, 1)), length.out = .N)]
  dt_policy[Brand == eachBrand, Model_mult := rep(rep(c(1.15^3, 1.15^2, 1.15^1, 1.15^0), c(10, 7, 2, 1)), length.out = .N)]
}

dt_policy[, Price := ceiling (Base_Price * Model_mult)]

# check output
kable(head(dt_policy), "html") %>%
  kable_styling("striped") %>% 
  scroll_box(width = "100%") %>%
  as.character() %>%
  display_html()

date_UW	pol_number	date_lapse	expodays	Cover	Brand	Base_Price	Model	Model_mult	Price
2016-01-01	201601010001	2017-01-01	366	B	1	600	3	1.520875	913
2016-01-01	201601010002	2017-01-01	366	B	1	600	3	1.520875	913
2016-01-01	201601010003	2017-01-01	366	B	1	600	3	1.520875	913
2016-01-01	201601010004	2017-01-01	366	B	1	600	3	1.520875	913
2016-01-01	201601010005	2017-01-01	366	B	1	600	3	1.520875	913
2016-01-01	201601010006	2017-01-01	366	B	1	600	3	1.520875	913

Tidy and save#

The final step is to keep only columns of interest and save the resulting policy file.

# colums to keep
cols_policy <- c("pol_number",
                 "date_UW",
                 "date_lapse",
                 "Cover",
                 "Brand",
                 "Model",
                 "Price")

dt_policy <- dt_policy[, cols_policy, with = FALSE]

# check output
kable(head(dt_policy), "html") %>%
  kable_styling("striped") %>%
  scroll_box(width = "100%") %>%
  as.character() %>%
  display_html()

save(dt_policy, file = "./dt_policy.rda")

pol_number	date_UW	date_lapse	Cover	Brand	Model	Price
201601010001	2016-01-01	2017-01-01	B	1	3	913
201601010002	2016-01-01	2017-01-01	B	1	3	913
201601010003	2016-01-01	2017-01-01	B	1	3	913
201601010004	2016-01-01	2017-01-01	B	1	3	913
201601010005	2016-01-01	2017-01-01	B	1	3	913
201601010006	2016-01-01	2017-01-01	B	1	3	913

Create claims file#

Sample Claims from Policies#

We now simulate claims arising from the policy coverages. Claim rates vary by policy coverage and type.

Breakage Claims#

We start with breakages claims and sample from the policies data set to create claims.

# All policies have breakage cover
# claims uniformly sampled from policies
claim <- sample(nrow(dt_policy), size = floor(nrow(dt_policy) * 0.15))

# Claim severity multiplier sampled from beta distribution
dt_claim <- data.table(pol_number = dt_policy[claim, pol_number],
                       claim_type = 'B',
                       claim_count = 1,
                       claim_sev = rbeta(length(claim), 2,5))

# check output
kable(head(dt_claim), "html") %>%
  kable_styling("striped") %>%
  scroll_box(width = "100%") %>%
  as.character() %>%
  display_html()

pol_number	claim_type	claim_count	claim_sev
201706250197	B	1	0.0066006
201702160566	B	1	0.1305747
201703080185	B	1	0.3325863
201709090113	B	1	0.2690556
201604260297	B	1	0.2531805
201707310143	B	1	0.1357897

Oxidation Claims#

Oxidation claims follow a similar process, just with different incidence rates and severities.

# identify all policies with Oxidation cover
cov <- which(dt_policy$Cover != 'B')

# sample claims from policies with cover
claim <- sample(cov, size = floor(length(cov) * 0.05))

# add claims to table 
dt_claim <- rbind(dt_claim,
                  data.table(pol_number = dt_policy[claim, pol_number],
                             claim_type = 'O',
                             claim_count = 1,
                             claim_sev = rbeta(length(claim), 5,3)))

# check output
kable(head(dt_claim), "html") %>%
  kable_styling("striped") %>%
  scroll_box(width = "100%") %>%
  as.character() %>%
  display_html()

pol_number	claim_type	claim_count	claim_sev
201706250197	B	1	0.0066006
201702160566	B	1	0.1305747
201703080185	B	1	0.3325863
201709090113	B	1	0.2690556
201604260297	B	1	0.2531805
201707310143	B	1	0.1357897

Theft Claims#

In the original paper the distribution for Theft severity claims is stated to be Beta(alpha = 5, beta = 0.5).

# identify all policies with Theft cover
# for Theft claim frequency varies by Brand
# So need to consider each in turn...

for(myModel in 0:3) {

    cov <- which(dt_policy$Cover == 'BOT' & dt_policy$Model == myModel)
    claim <- sample(cov, size = floor(length(cov) * 0.05*(1 + myModel)))
  
    dt_claim <- rbind(dt_claim,
                      data.table(pol_number = dt_policy[claim, pol_number],
                                 claim_type = 'T',
                                 claim_count = 1,
                                 claim_sev = rbeta(length(claim), 5,.5)))
}

# check output
kable(head(dt_claim), "html") %>% 
  kable_styling("striped") %>%
  scroll_box(width = "100%") %>%
  as.character() %>%
  display_html()

kable(tail(dt_claim), "html") %>%
  kable_styling("striped") %>%
  scroll_box(width = "100%") %>%
  as.character() %>%
  display_html()

pol_number	claim_type	claim_count	claim_sev
201706250197	B	1	0.0066006
201702160566	B	1	0.1305747
201703080185	B	1	0.3325863
201709090113	B	1	0.2690556
201604260297	B	1	0.2531805
201707310143	B	1	0.1357897

pol_number	claim_type	claim_count	claim_sev
201708230205	T	1	0.9648917
201701280297	T	1	0.8245351
201701020337	T	1	0.7794176
201605080388	T	1	0.8335502
201608140365	T	1	0.9262032
201710040235	T	1	0.6904011

Claim dates and costs#

Policy UW_date and value#

We now need to add details to claims, such as policy underwritng date and phone cost. These details come from the policy table.

# set join keys
setkey(dt_policy, pol_number)
setkey(dt_claim, pol_number)

#join Brand and Price from policy to claim
dt_claim[dt_policy,
         on = 'pol_number',
         ':='(date_UW = i.date_UW,
              Price = i.Price,
              Brand = i.Brand)]

# check output
kable(head(dt_claim), "html") %>%
  kable_styling("striped") %>%
  scroll_box(width = "100%") %>%
  as.character() %>%
  display_html()

pol_number	claim_type	claim_count	claim_sev	date_UW	Price	Brand
201601010003	B	1	0.2786004	2016-01-01	913	1
201601010019	B	1	0.3383702	2016-01-01	229	4
201601010021	B	1	0.2924503	2016-01-01	913	1
201601010028	B	1	0.2193067	2016-01-01	794	1
201601010032	B	1	0.2106113	2016-01-01	837	2
201601010047	B	1	0.1108387	2016-01-01	913	1

Add simulated Claim occrrence, reporting and payment delays#

The occurrence delay is assumed uniform over policy exposure period. Reporting and payment delays are assumed to follow Beta distributions.

# use lubridate %m+% date addition operator 
dt_claim[, date_lapse := date_UW %m+% years(1)]
dt_claim[, expodays := as.integer(date_lapse - date_UW)]
dt_claim[, occ_delay_days := floor(expodays * runif(.N, 0,1))]

dt_claim[ ,delay_report := floor(365 * rbeta(.N, .4, 10))]  
dt_claim[ ,delay_pay := floor(10 + 40* rbeta(.N, 7,7))]  

dt_claim[, date_occur := date_UW %m+% days(occ_delay_days)]
dt_claim[, date_report := date_occur %m+% days(delay_report)]
dt_claim[, date_pay := date_report %m+% days(delay_pay)]

dt_claim[, claim_cost := round(Price * claim_sev)]

# check output
head(dt_claim)

A data.table: 6 × 16
pol_number	claim_type	claim_count	claim_sev	date_UW	Price	Brand	date_lapse	expodays	occ_delay_days	delay_report	delay_pay	date_occur	date_report	date_pay	claim_cost
<chr>	<chr>	<dbl>	<dbl>	<date>	<dbl>	<dbl>	<date>	<int>	<dbl>	<dbl>	<dbl>	<date>	<date>	<date>	<dbl>
201601010003	B	1	0.2786004	2016-01-01	913	1	2017-01-01	366	173	71	28	2016-06-22	2016-09-01	2016-09-29	254
201601010019	B	1	0.3383702	2016-01-01	229	4	2017-01-01	366	139	6	33	2016-05-19	2016-05-25	2016-06-27	77
201601010021	B	1	0.2924503	2016-01-01	913	1	2017-01-01	366	203	2	30	2016-07-22	2016-07-24	2016-08-23	267
201601010028	B	1	0.2193067	2016-01-01	794	1	2017-01-01	366	186	1	29	2016-07-05	2016-07-06	2016-08-04	174
201601010032	B	1	0.2106113	2016-01-01	837	2	2017-01-01	366	199	33	29	2016-07-18	2016-08-20	2016-09-18	176
201601010047	B	1	0.1108387	2016-01-01	913	1	2017-01-01	366	141	0	29	2016-05-21	2016-05-21	2016-06-19	101

Claim key and tidy#

The final stage is to do some simple tidying and add a unique claim key. The file is then saved for future use.

Note that the original paper spoke of using a “competing hazards model” for simulating claims. I have taken this to mean that a policy can only give rise to one claim. Where the above process has simulated multiple claims against the same policy I keep only the first occurring claim.

# add a unique claimkey based upon occurence date

dt_claim[, clm_prefix := year(date_occur)*10000 + month(date_occur)*100 + mday(date_occur)]
dt_claim[, clm_seq := seq_len(.N), by = clm_prefix]
dt_claim[, clm_number := as.character(clm_prefix * 10000 + clm_seq)]


# keep only first claim against policy (competing hazards)
setkeyv(dt_claim, c("pol_number", "clm_prefix"))
dt_claim[, polclm_seq := seq_len(.N), by = .(pol_number)]
dt_claim <- dt_claim[polclm_seq == 1,]

# colums to keep
cols_claim <- c("clm_number",
                "pol_number",
                "claim_type",
                "claim_count",
                "claim_sev",
                "date_occur",
                "date_report",
                "date_pay",
                "claim_cost")

dt_claim <- dt_claim[, cols_claim, with = FALSE]

# check output
kable(head(dt_claim), "html") %>%
  kable_styling("striped") %>%
  scroll_box(width = "100%") %>%
  as.character() %>%
  display_html()

save(dt_claim, file = "./dt_claim.rda")

clm_number	pol_number	claim_type	claim_count	claim_sev	date_occur	date_report	date_pay	claim_cost
201606220001	201601010003	B	1	0.2786004	2016-06-22	2016-09-01	2016-09-29	254
201605190001	201601010019	B	1	0.3383702	2016-05-19	2016-05-25	2016-06-27	77
201607220001	201601010021	B	1	0.2924503	2016-07-22	2016-07-24	2016-08-23	267
201607050001	201601010028	B	1	0.2193067	2016-07-05	2016-07-06	2016-08-04	174
201607180001	201601010032	B	1	0.2106113	2016-07-18	2016-08-20	2016-09-18	176
201605210001	201601010047	B	1	0.1108387	2016-05-21	2016-05-21	2016-06-19	101

Checking exhibits#

Baudry’s paper produced a number of summary exhibits from the simulated data. Let’s recreate them to get some comfort that we have correctly recreated the data.

You can see that the severity exhibit, Chart B, is inconsistent with that presented in the original paper. The cause of that difference is unclear at the time of writing. It’s likely to be because something nearer to a Beta(alpha = 5, beta = 0.05) has been used. However using that creates other discrepancies likely to be due to issues with the competing hazards implementation. For now we note the differences and continue with the data as created here.

../_images/c72006526ce254833c589c89de75c96a6e92f181c238809b8cc228fe7d2e88a3.png

dt_claim[, Days_reportdelay := as.numeric(difftime(date_report, date_occur, units="days"))]

hist(dt_claim[, Days_reportdelay],breaks = 50)

../_images/04c875686d7aa1d82cad13d2c0146fed8446991a6f3ef5b48660bfd149cf3f22.png

dt_claim[, Days_paymentdelay := as.numeric(difftime(date_pay, date_report, units="days"))]
hist(dt_claim[, Days_paymentdelay],breaks = 60)

../_images/100162051cbd3830c589243684ae44a9013d9f45924f14d7545eed46e205a503.png

The final set of exhibits are those on slide 44. The only difference of note here is in Chart B, the Claim Rate by phone brand.

Baudry’s exhibits show Brand 1 to have a 5% higher claim frequency than other Brands. From reading the paper I can’t see why we should expect that to be the case. Claim rate varies by phone Model but Model incidence doesn’t vary by Brand. Therefore I can’t see how the Chart B equivalent could be correct given the details in the paper.

I leave the code as is noting the difference but recognising that it will not affect the wider aims of the paper.

../_images/d4f7b052cc5a5c1b5dad88fadc29e4b329e208f1295135a3a8c6b55b010ad58e.png

Function to create policy and claim data#

The above code can be wrapped into a function which returns a list containing the policy and claim datasets.

By calling the function with a seed you can simulate policy and claim datasets.

tmp <- simulate_central_scenario(1234)

kable(head(tmp$dt_claim), "html") %>%
  kable_styling("striped") %>%
  scroll_box(width = "100%") %>%
  as.character() %>%
  display_html()

clm_number	pol_number	claim_type	claim_count	claim_sev	date_occur	date_report	date_pay	claim_cost
201606080001	201601010001	B	1	0.3337923	2016-06-08	2016-06-08	2016-07-21	305
201609150001	201601010014	B	1	0.3692034	2016-09-15	2016-09-15	2016-10-17	309
201609090001	201601010025	B	1	0.4496012	2016-09-09	2016-09-09	2016-10-07	357
201602190001	201601010027	B	1	0.4019731	2016-01-25	2016-02-19	2016-03-21	319
201605140001	201601010043	B	1	0.2146653	2016-05-14	2016-05-14	2016-06-15	196
201612110001	201601010045	B	1	0.2783313	2016-12-11	2016-12-11	2017-01-06	254

Download the code and try it yourself#

To finish, if you wish to try out this code in your own local instance of R then we have made this code and the supporting files and folders available in a zip file here.

Download and extract the zip file to a local directory and then open the R project file Baudry_1.rproj in your local R software installation. In the root of the project folder you will see two files;

Notebook_1_SimulateData_v1.Rmd - which is the source code used to recreate this notebook
Notebook_1_SimulateData_v1.R - the equivalent code provided as an R script

Please note that, depending upon your R installation , you may have to install R libraries before you can run the code provided. R will warn you if you have missing dependencies and you can then install them from CRAN.