Py: Investigating Neural Network Biases with ChatGPT

Py: Investigating Neural Network Biases with ChatGPT#

But not in the way you think…

An Analytics Snippet By Jacky Poon.

In this notebook we will use the assistance of ChatGPT to use neural network models with early stopping to predict claims data and see if it is biased.

Here, bias is referring to the model making predictions that are systematically higher or lower. This is important because in many practical actuarial applications, whilst it is helpful to be able to predict which are the high risk segments and which are the low risk segments, having the whole portfolio systematically underpriced or overpriced will lead to unfortunate outcomes.

We will see if we can replicate similar findings to Wuthrich 2019. In the paper Wuthrich describes how use of the early stopping technique to limit over-fitting introduced biases into estimates of the mean, and considers the use of regularisation techniques to prevent this.

Early stopping is a common method in machine learning to prevent overfitting. You select a validation dataset and then during the model training process, if we observe at some point that further training is not improving performance, we stop.

Our investigation follows my earlier notebook. In the article we described how to make incremental changes to a chain ladder to progress to a probabilistic neural network. We also shared several techniques that help the model training process, with faster convergence and better stability. In that notebook, early stopping was not used, so estimates aligned well to the mean within the training data. Instead, regularisation was used to try to prevent overfitting. However, regularisation requires hyperparameter searches to find the value that finds the right balance between under- and over-fitting, whereas early stopping would seem to automatically find the right balance.

So it would be interesting to test to see if the Wuthrich findings of biases apply to our approach as well.

ChatGPT prompt:#

We used the following prompt to start our code.

Write a python script that fits a feedforward neural network with one hidden layer on a pandas tabular dataset. Use pytorch, with early stopping. The response column is claims. There is an exposure weight column expo. Numerical features are age, ac, power, dens Categorical features are gas, brand, area, and ct. Explain the code.

This was run on ChatGPT’s free version in mid May 2023. You may get different output as the large language model does not appear to be deterministic - different runs will produce different code, sometimes a different style, and often different bugs or functionality.

We will include ChatGPT’s code descriptions, which were originally at the end, in-line with the code in italics to make it easier to follow, and annotate any adjustments we made.

Response:#

Certainly! Below is a Python script that fits a feedforward neural network with embeddings on categorical features and one hidden layer using PyTorch on a Pandas tabular dataset:

Import the necessary libraries, including torch for deep learning operations, numpy for numerical computations, and pandas for data manipulation.

import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, MinMaxScaler  # We added MinMaxScaler for later
from sklearn.metrics import mean_squared_error

from matplotlib import pyplot as plt  # Added this for extra diagnostics. 

num_models = 50

Define a class called FeedForwardNet, which represents the feedforward neural network model. It inherits from the nn.Module class, the base class for all neural network modules in PyTorch.

Inside the FeedForwardNet class, define the neural network architecture in the init method. It consists of an embedding layer for each categorical feature, followed by a hidden layer and an output layer.

Implement the forward pass of the neural network in the forward method. The forward pass involves embedding the categorical features, concatenating them with the numerical features, and passing the concatenated tensor through the hidden and output layers.

The prompt does a good job and produces what is asked for. There is an undocumented dropout layer with 20% dropout which was not part of the ask, which we will take out to simplify the logic.

In Wuthrich’s paper, a multi-layer network is used with (20, 15, 10) hidden neurons, in our prompt we have only asked for a single hidden layer. We can change this to take a list of hidden sizes by asking:

Modify this code to take in a list of hidden layer sizes for a multi-layer model: and including the code for the FeedForwardNet class.

We also make some amendments based on our earlier work.

In our prompt, we neglected to mention we wanted to add an exponential transform at the end to have non-negative estimates analogous to a log-link GLM, so we amend the code to add torch.exp at end.

Additionally, we initialise as follows to improve model training:

Final layer weights to zero
Final layer bias to an additional parameter init_bias.

# Define the neural network class
class FeedForwardNet(nn.Module):
    def __init__(self, num_numerical_feats, num_categorical_feats, embedding_sizes, hidden_sizes, init_bias):  # was hidden_size originally
        super(FeedForwardNet, self).__init__()
        self.embeddings = nn.ModuleList([
            nn.Embedding(num_classes, emb_size) for num_classes, emb_size in embedding_sizes
        ])
        self.num_numerical_feats = num_numerical_feats
        self.num_categorical_feats = num_categorical_feats
        self.total_embed_size = sum([emb_size for _, emb_size in embedding_sizes])
        self.input_size = self.num_numerical_feats + self.total_embed_size

        # self.fc1 = nn.Linear(self.input_size, hidden_size)
        self.hidden_layers = nn.ModuleList([
            nn.Linear(self.input_size if i == 0 else hidden_sizes[i - 1], hidden_size)
            for i, hidden_size in enumerate(hidden_sizes)
        ])

        # self.fc2 = nn.Linear(hidden_size, 1)
        self.fc2 = nn.Linear(hidden_sizes[-1], 1)

        nn.init.zeros_(self.fc2.weight)                          # Initialise to zero
        self.fc2.bias.data = torch.tensor(init_bias)
        # self.dropout = nn.Dropout(p=0.2)
        
    def forward(self, x_numerical, x_categorical):
        embedded_x = [embedding(x_categorical[:, i]) for i, embedding in enumerate(self.embeddings)]
        embedded_x = torch.cat(embedded_x, dim=1)
        x = torch.cat([embedded_x, x_numerical], dim=1)
        # x = self.dropout(x)
        # x = torch.relu(self.fc1(x))

        for hidden_layer in self.hidden_layers:
            x = torch.relu(hidden_layer(x))

        # x = self.dropout(x)
        x = self.fc2(x)
        return torch.exp(x)      #Exp output

Load the dataset using pd.read_csv() and split it into training and validation sets using train_test_split() from scikit-learn.

We did not give the dataset a file, so it was naming it by a placeholder 'your_dataset.csv'. We are using the data from Wu ̈thrich–Buser, the link is here.

# Load the dataset
data = pd.read_csv("MTPL_data.csv", sep=";").set_index("id")

# Split the dataset into training and validation sets
train_data, val_data = train_test_split(data, test_size=0.2, random_state=42)

The dataset looks like this by the way:

data

	claims	expo	age	ac	power	gas	brand	area	dens	ct	truefreq
id
1	0	0.33	66	4	3	Regular	B12	B	83	BE	0.059941
2	0	0.08	31	1	7	Diesel	B1	A	34	BL	0.119216
3	0	0.92	60	6	5	Diesel	B1	C	223	AG	0.074344
4	0	1.00	66	4	2	Diesel	B1	C	283	FR	0.092829
5	0	0.63	63	3	5	Regular	B12	B	74	VS	0.049981
...	...	...	...	...	...	...	...	...	...	...	...
499996	0	0.15	32	3	2	Diesel	B1	C	204	FR	0.100843
499997	0	0.86	40	2	2	Regular	B12	D	617	FR	0.167433
499998	0	0.74	46	7	4	Regular	B1	C	238	VD	0.084457
499999	0	1.00	72	5	3	Diesel	B2	B	94	VS	0.070113
500000	0	0.64	31	1	1	Diesel	B12	C	323	BL	0.165031

500000 rows × 11 columns

Define the categorical features and numerical features based on the dataset columns.

Encode the categorical features using LabelEncoder() from scikit-learn. This is necessary to convert the categorical features into numerical representations suitable for training the neural network.

ChatGPT lists the features as instructed, but we discover after the fact that Wuthrich has area as continuous, so we’ll adjust the code accordingly.

We’ll create the label encoders based on the full dataset instead of the training dataset only, to avoid errors from missing levels.

# Define the categorical features and numerical features
categorical_feats = ['gas', 'brand', 'area', 'ct']
numerical_feats = ['age', 'ac', 'power', 'dens']

# Encode the categorical features using LabelEncoder
label_encoders = {}
for feat in categorical_feats:
    label_encoders[feat] = LabelEncoder()
    label_encoders[feat].fit(data[feat])
    train_data[feat] = label_encoders[feat].transform(train_data[feat])
    val_data[feat] = label_encoders[feat].transform(val_data[feat])

ChatGPT only label encodes the categorical variables, but we should standardise our numerical inputs:

# Standardize the categorical features using LabelEncoder
scaler = {}
for feat in numerical_feats:
    scaler[feat] = MinMaxScaler()
    train_data[feat] = scaler[feat].fit_transform(train_data[feat].values.reshape(-1, 1))
    val_data[feat] = scaler[feat].transform(val_data[feat].values.reshape(-1, 1))

Define the embedding sizes for each categorical feature. These sizes determine the dimensions of the embedding vectors for each categorical feature. Usually, the embedding size is chosen to be smaller than the number of unique values in the feature but larger than half of it.

# Define the embedding sizes for categorical features
embedding_sizes = [(len(train_data[feat].unique()), min(50, (len(train_data[feat].unique()) + 1) // 2)) for feat in categorical_feats]

Convert the dataset columns into PyTorch tensors, which are required for training the neural network model.

We neglected to mention this in the prompt, but we actually want to divide claims by exposure here to get y.

# Convert the dataset to PyTorch tensors
x_train_numerical = torch.tensor(train_data[numerical_feats].values, dtype=torch.float32)
x_train_categorical = torch.tensor(train_data[categorical_feats].values, dtype=torch.long)
# y_train = torch.tensor(train_data['claims'].values, dtype=torch.float32)
y_train = torch.tensor(train_data['claims'].values / train_data['expo'].values, dtype=torch.float32)
exposure_train = torch.tensor(train_data['expo'].values, dtype=torch.float32)

x_val_numerical = torch.tensor(val_data[numerical_feats].values, dtype=torch.float32)
x_val_categorical = torch.tensor(val_data[categorical_feats].values, dtype=torch.long)
# y_val = torch.tensor(val_data['claims'].values, dtype=torch.float32)
y_val = torch.tensor(val_data['claims'].values / val_data['expo'].values, dtype=torch.float32)
exposure_val = torch.tensor(val_data['expo'].values, dtype=torch.float32)

We did not mention it in the prompt, but let’s calculate our mean so that we can see that it is aligning to that value.

avg_claim = train_data['claims'].values.sum() / train_data['expo'].values.sum()
avg_claim

0.10265948983820225

(y_train * exposure_train).sum() / train_data['expo'].values.sum()

tensor(0.1027)

Define the hyperparameters such as the hidden layer size, learning rate, batch size, number of epochs, and early stopping epochs.

The batch_size is generated, but it does not look like it is actually used in the later generated code. We prefer using the whole dataset, so this is fine.

We make a few adjustments to hyperparams.

# Define the hyperparameters
# hidden_size = 64
hidden_size = [20, 15, 10]  # Replace with the multi-layer parameters.

# learning_rate = 0.001
# batch_size = 32
# num_epochs = 100
# early_stopping_epochs = 10

# Overwrite those hyperparameters with these
learning_rate = 0.01
num_epochs = 9999  # should not be a factor, we train until early stopping kicks in
early_stopping_epochs = 10

Create an instance of the FeedForwardNet model.

We pass on the init_bias parameter here.

# Create an instance of the FeedForwardNet model
model = FeedForwardNet(len(numerical_feats), len(categorical_feats), embedding_sizes, hidden_size, init_bias = np.log(avg_claim).astype(np.float32))

# Test that the init_bias works, what is the initial mean?
y_pred = model(x_train_numerical, x_train_categorical)

(y_pred.squeeze() * exposure_train).sum() / exposure_train.sum()

tensor(0.1027, grad_fn=<DivBackward0>)

Define the loss function (mean squared error) and the optimizer (Adam optimizer) to train the model.

We would like to do the exposure weights a little differently. We will have reduction=’none’ and keep the loss output as individual loss values per row instead of the sum. This allows us to multiply the loss per record by the exposure weight per record later.

We’ll also swap to Poisson Loss.

# Define the loss function
# criterion = nn.MSELoss(reduction='sum')

criterion = nn.PoissonNLLLoss(reduction='none', log_input=False)

# Define the optimizer
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

Train the model for the specified number of epochs. In each epoch, perform forward and backward passes, update the model’s parameters, and calculate the validation loss. If the validation loss is the lowest so far, save the model’s state.

We add some debug output for diagnostics.

# Train the model
best_val_loss = np.inf
best_epoch = 0
for epoch in range(num_epochs):
    model.train()
    optimizer.zero_grad()
    y_pred = model(x_train_numerical, x_train_categorical)
    # loss = criterion(y_pred.squeeze() * exposure_train, y_train * exposure_train)
    loss = (criterion(y_pred.squeeze(), y_train) * exposure_train).sum() / exposure_train.sum()
    loss.backward()
    optimizer.step()
    
    model.eval()
    with torch.no_grad():
        y_val_pred = model(x_val_numerical, x_val_categorical)
        # val_loss = criterion(y_val_pred.squeeze() * exposure_val, y_val * exposure_val)
        val_loss = (criterion(y_val_pred.squeeze(), y_val) * exposure_val).sum() / exposure_val.sum()
    
    if val_loss < best_val_loss:
        best_val_loss = val_loss
        best_epoch = epoch
        torch.save(model.state_dict(), 'best_model.pt')
    
    # Added for diagnostics
    if (val_loss < best_val_loss) | (epoch % 10 == 0):
        print(
            "Epoch:", epoch,
            "Train_Diff:", ((y_pred.squeeze() * exposure_train - y_train * exposure_train).sum() / exposure_train.sum()).item(),            
            "Train_Mean:", ((y_pred.squeeze() * exposure_train).sum() / exposure_train.sum()).item(), 
            "Train_Loss", loss.item(),
            "Val_mean", ((y_val_pred.squeeze() * exposure_val).sum() / exposure_val.sum()).item(), 
            "Val_loss", val_loss.item(),
        )

    if epoch - best_epoch >= early_stopping_epochs:
        print("Stopping at:", best_epoch)      
        break

Epoch: 0 Train_Diff: 3.9570383081333205e-10 Train_Mean: 0.10265947878360748 Train_Loss 0.33634716272354126 Val_mean 0.10276821255683899 Val_loss 0.336696594953537
Epoch: 10 Train_Diff: 0.0005192473763599992 Train_Mean: 0.10317873954772949 Train_Loss 0.3340966999530792 Val_mean 0.10280803591012955 Val_loss 0.3345138132572174
Epoch: 20 Train_Diff: 0.000555766629986465 Train_Mean: 0.10321525484323502 Train_Loss 0.3329228162765503 Val_mean 0.10316438227891922 Val_loss 0.3333604633808136
Epoch: 30 Train_Diff: 0.00022463459754362702 Train_Mean: 0.10288412868976593 Train_Loss 0.33167240023612976 Val_mean 0.10189294815063477 Val_loss 0.33199846744537354
Epoch: 40 Train_Diff: 0.00515005411580205 Train_Mean: 0.10780954360961914 Train_Loss 0.33076417446136475 Val_mean 0.09964833408594131 Val_loss 0.33111655712127686
Epoch: 50 Train_Diff: -0.0002511663769837469 Train_Mean: 0.10240831226110458 Train_Loss 0.3297358453273773 Val_mean 0.10654579848051071 Val_loss 0.33056363463401794
Epoch: 60 Train_Diff: 0.002754278015345335 Train_Mean: 0.1054137572646141 Train_Loss 0.32892730832099915 Val_mean 0.09867089241743088 Val_loss 0.32982489466667175
Epoch: 70 Train_Diff: 3.5053705005339e-07 Train_Mean: 0.10265983641147614 Train_Loss 0.32785457372665405 Val_mean 0.09791050851345062 Val_loss 0.32911568880081177
Epoch: 80 Train_Diff: 0.005469439085572958 Train_Mean: 0.10812892764806747 Train_Loss 0.32726752758026123 Val_mean 0.09609229862689972 Val_loss 0.3288998305797577
Epoch: 90 Train_Diff: 0.0024685768876224756 Train_Mean: 0.10512805730104446 Train_Loss 0.3264632225036621 Val_mean 0.09715571999549866 Val_loss 0.3284188508987427
Epoch: 100 Train_Diff: -0.002204097807407379 Train_Mean: 0.1004553884267807 Train_Loss 0.3260534703731537 Val_mean 0.10135945677757263 Val_loss 0.32796788215637207
Epoch: 110 Train_Diff: -0.004026252310723066 Train_Mean: 0.09863324463367462 Train_Loss 0.32590925693511963 Val_mean 0.10565565526485443 Val_loss 0.3279320001602173
Epoch: 120 Train_Diff: 0.003851202316582203 Train_Mean: 0.1065106987953186 Train_Loss 0.3255639374256134 Val_mean 0.0999462902545929 Val_loss 0.32771018147468567
Epoch: 130 Train_Diff: -0.0011717552551999688 Train_Mean: 0.10148773342370987 Train_Loss 0.32516470551490784 Val_mean 0.10182797163724899 Val_loss 0.3276842534542084
Stopping at: 121

Load the best model’s state from the saved file.

# Load the best model
model.load_state_dict(torch.load('best_model.pt'))

<All keys matched successfully>

Prepare the test set data and make predictions on it using the best model.

ChatGPT does not actually prepare the test set data, so we need to code this up ourselves. For this test we’ll just use our original full dataset.

test_data = data.copy()
for feat in categorical_feats:
    test_data[feat] = label_encoders[feat].transform(test_data[feat])

for feat in numerical_feats:
    test_data[feat] = scaler[feat].transform(test_data[feat].values.reshape(-1, 1))

Calculate the test loss (mean squared error) and the root mean squared error (RMSE) for the test set.

Print the test RMSE.

# Make predictions on the test set
x_test_numerical = torch.tensor(test_data[numerical_feats].values, dtype=torch.float32)
x_test_categorical = torch.tensor(test_data[categorical_feats].values, dtype=torch.long)
y_test = torch.tensor(test_data['claims'].values, dtype=torch.float32)
exposure_test = torch.tensor(test_data['expo'].values, dtype=torch.float32)

model.eval()
with torch.no_grad():
    y_test_pred = model(x_test_numerical, x_test_categorical)

# Calculate the test loss
# test_loss = criterion(y_test_pred.squeeze() * exposure_test, y_test * exposure_test)
test_loss = (criterion(y_test_pred.squeeze(), y_test) * exposure_test).sum() / exposure_test.sum()
test_rmse = torch.sqrt(test_loss.item() / torch.tensor(len(test_data)))

print(f'Test RMSE: {test_rmse:.4f}')

Test RMSE: 0.0007

This code fits a feedforward neural network with embeddings on categorical features and one hidden layer using PyTorch. It uses early stopping to prevent overfitting and achieves predictions for the response column “claims” in a tabular dataset.

Let us run some diagnostics on this model, on the validation data:

y_val_pred = model(x_val_numerical, x_val_categorical)

val_data["predicted"] = y_val_pred.cpu().detach().numpy().ravel() * val_data.expo
val_data["trueclaims"] = val_data.truefreq * val_data.expo

fig, axs = plt.subplots(len(numerical_feats) + 1, sharex=False, sharey=True, figsize=(7, 40))

for i, f in enumerate(["predicted"] + numerical_feats):
    dat_copy = val_data.copy()
    dat_copy["decile"] = pd.qcut(dat_copy[f], 10, labels=False, duplicates='drop')
    X_sum = dat_copy.groupby("decile").agg("sum").reset_index()
    
    axs[i].plot(X_sum.index, X_sum.trueclaims / X_sum.expo)
    axs[i].plot(X_sum.index, X_sum.predicted / X_sum.expo)    
    axs[i].set_title(f)


for i, f in enumerate(categorical_feats):
    dat_copy = val_data.copy()
    X_sum = dat_copy.groupby(f).agg("sum")[["trueclaims", "predicted", "expo"]]
    X_sum["trueclaims"] = X_sum["trueclaims"]/X_sum["expo"]
    X_sum["predicted"] = X_sum["predicted"]/X_sum["expo"]
    X_sum = X_sum.drop(columns="expo")
    axs[i] = X_sum.plot(kind='bar', rot=0, xlabel=f, ylabel='Value', title=f)

../_images/73aa33f64ef59660166a8294b295a7c4c8f8fa9b5f80cfa873f183d465e92d86.png

../_images/27b88001013778e2d637fda130c9c4b190bd64156ff9e668b625cf4f35c728d1.png

../_images/07758091d04b450c5a7525e38603fc567f22b51694c87f600ed219fd5f2a64ca.png

../_images/94e0badabe764c1d6e9bbd7f440d39db05817c05e8194aaafb0cf3cc3f88dfc9.png

../_images/3d65524489306936f022e13494728e5b252aa65da07735a1be31b028d5e6ec9c.png

So we have a working model with the assistance of ChatGPT. The code suggestions by the model were generally good, but sometimes subtlely wrong. We still needed an understanding of the mechanics of fitting a neural network on tabular data in order to debug the code generated by the LLM, and to recognise and include any requirements missed in our original prompt. Overall, use of generative AI made the development of this notebook considerably faster.

Onto the experiment on bias:

Experiment on Bias#

We run the training loop 50 times to get a model. We check whether the model predicts an average frequency that matches the (1) training set on which it was fitted and (2) the true underlying frequency. We resample train/test each time so that the sampling does not skew the result.

results = []
weight_list = []

for i in range(0, num_models):
    # Resample - Split the dataset into training and validation sets
    train_data, val_data = train_test_split(data, test_size=0.2)
    
    for feat in categorical_feats:
        train_data[feat] = label_encoders[feat].transform(train_data[feat])
        val_data[feat] = label_encoders[feat].transform(val_data[feat])

    for feat in numerical_feats:
        scaler[feat] = MinMaxScaler()
        train_data[feat] = scaler[feat].fit_transform(train_data[feat].values.reshape(-1, 1))
        val_data[feat] = scaler[feat].transform(val_data[feat].values.reshape(-1, 1))

    # Convert the dataset to PyTorch tensors
    x_train_numerical = torch.tensor(train_data[numerical_feats].values, dtype=torch.float32)
    x_train_categorical = torch.tensor(train_data[categorical_feats].values, dtype=torch.long)
    # y_train = torch.tensor(train_data['claims'].values, dtype=torch.float32)
    y_train = torch.tensor(train_data['claims'].values / train_data['expo'].values, dtype=torch.float32)
    exposure_train = torch.tensor(train_data['expo'].values, dtype=torch.float32)

    x_val_numerical = torch.tensor(val_data[numerical_feats].values, dtype=torch.float32)
    x_val_categorical = torch.tensor(val_data[categorical_feats].values, dtype=torch.long)
    # y_val = torch.tensor(val_data['claims'].values, dtype=torch.float32)
    y_val = torch.tensor(val_data['claims'].values / val_data['expo'].values, dtype=torch.float32)
    exposure_val = torch.tensor(val_data['expo'].values, dtype=torch.float32)

    # Create an instance of the FeedForwardNet model
    model = FeedForwardNet(len(numerical_feats), len(categorical_feats), embedding_sizes, hidden_size, init_bias = np.log(avg_claim).astype(np.float32))

    criterion = nn.PoissonNLLLoss(reduction='none', log_input=False)

    # Define the optimizer
    optimizer = optim.Adam(model.parameters(), lr=learning_rate)

    # Train the model
    best_val_loss = np.inf
    best_epoch = 0
    for epoch in range(num_epochs):
        model.train()
        optimizer.zero_grad()
        y_pred = model(x_train_numerical, x_train_categorical)
        # loss = criterion(y_pred.squeeze() * exposure_train, y_train * exposure_train)
        loss = (criterion(y_pred.squeeze(), y_train) * exposure_train).sum() / exposure_train.sum()
        loss.backward()
        optimizer.step()
        
        model.eval()
        with torch.no_grad():
            y_val_pred = model(x_val_numerical, x_val_categorical)
            # val_loss = criterion(y_val_pred.squeeze() * exposure_val, y_val * exposure_val)
            val_loss = (criterion(y_val_pred.squeeze(), y_val) * exposure_val).sum() / exposure_val.sum()
        
        if val_loss < best_val_loss:
            best_val_loss = val_loss
            best_epoch = epoch
            best_weights = model.state_dict().copy()
            torch.save(best_weights, 'best_model.pt')

            best_result = {
                "Epoch": epoch,
                "Train_Diff": ((y_pred.squeeze() * exposure_train - y_train * exposure_train).sum() / exposure_train.sum()).item(),            
                "Train_Mean": ((y_pred.squeeze() * exposure_train).sum() / exposure_train.sum()).item(), 
                "Train_Loss": loss.item(),
                "Val_mean": ((y_val_pred.squeeze() * exposure_val).sum() / exposure_val.sum()).item(), 
                "Val_loss": val_loss.item(),
                "Mean": 0.8*((y_pred.squeeze() * exposure_train).sum() / exposure_train.sum()).item() + 0.2*((y_val_pred.squeeze() * exposure_val).sum() / exposure_val.sum()).item()
            }
        
        if epoch - best_epoch >= early_stopping_epochs:
            results += [best_result]
            break

    weight_list.append(best_weights)

pd.DataFrame(results)

	Epoch	Train_Diff	Train_Mean	Train_Loss	Val_mean	Val_loss	Mean
0	130	-0.000645	0.102177	0.325923	0.100916	0.326849	0.101925
1	112	0.001128	0.104126	0.326204	0.100970	0.325138	0.103495
2	100	-0.011181	0.091365	0.329277	0.104439	0.330272	0.093980
3	144	0.003659	0.106545	0.326212	0.101356	0.326794	0.105507
4	149	-0.001328	0.100622	0.323909	0.103777	0.334216	0.101253
5	143	-0.000898	0.101453	0.325552	0.102908	0.330316	0.101744
6	121	-0.002593	0.099565	0.324868	0.103678	0.332443	0.100387
7	99	0.004969	0.107395	0.325997	0.101939	0.330895	0.106304
8	175	-0.001557	0.101031	0.324581	0.102687	0.328301	0.101362
9	103	-0.001491	0.101537	0.326496	0.101486	0.325832	0.101527
10	129	-0.000352	0.101937	0.324772	0.103736	0.331179	0.102297
11	119	-0.000176	0.102137	0.325486	0.103809	0.330387	0.102471
12	100	-0.011435	0.091064	0.327402	0.103818	0.329969	0.093615
13	151	0.001976	0.105372	0.326752	0.100252	0.321554	0.104348
14	114	-0.012920	0.089249	0.327542	0.104257	0.332765	0.092251
15	119	-0.004242	0.098700	0.326383	0.101354	0.326429	0.099231
16	117	0.002508	0.104865	0.324898	0.101881	0.331432	0.104268
17	111	-0.001113	0.101655	0.326419	0.101905	0.327278	0.101705
18	86	0.019172	0.122249	0.330964	0.102479	0.325229	0.118295
19	170	0.000330	0.103443	0.326102	0.101849	0.324363	0.103124
20	136	-0.002997	0.099309	0.324692	0.103050	0.332011	0.100057
21	124	-0.002397	0.100309	0.326027	0.102895	0.327030	0.100826
22	127	0.001637	0.104221	0.325922	0.103401	0.328156	0.104057
23	109	0.003718	0.106884	0.327108	0.101984	0.324299	0.105904
24	104	0.006168	0.108760	0.326227	0.100372	0.330046	0.107083
25	124	-0.000329	0.102373	0.325716	0.102718	0.327641	0.102442
26	104	0.004465	0.107011	0.325303	0.101300	0.329909	0.105869
27	97	0.003505	0.106148	0.326087	0.102803	0.328730	0.105479
28	152	0.003044	0.106007	0.325944	0.101053	0.325613	0.105016
29	119	0.000471	0.103239	0.325658	0.101029	0.328178	0.102797
30	105	-0.010499	0.092104	0.327594	0.102481	0.329511	0.094180
31	101	0.000356	0.103143	0.326779	0.101786	0.328151	0.102872
32	97	-0.007540	0.094523	0.325637	0.106127	0.335037	0.096844
33	110	-0.007851	0.094789	0.326994	0.103555	0.327416	0.096542
34	130	0.007680	0.110865	0.327033	0.102342	0.324362	0.109160
35	106	-0.000806	0.101983	0.325870	0.102793	0.326325	0.102145
36	116	0.001485	0.104698	0.326662	0.101559	0.323815	0.104070
37	107	0.001997	0.104687	0.325665	0.104222	0.328122	0.104594
38	90	-0.002557	0.100565	0.327322	0.102357	0.325118	0.100923
39	141	0.003426	0.106498	0.325961	0.099360	0.324842	0.105071
40	121	-0.002862	0.099332	0.324686	0.103656	0.332309	0.100197
41	87	-0.008324	0.094025	0.327252	0.102607	0.331421	0.095741
42	119	-0.000314	0.101241	0.323135	0.107160	0.337909	0.102425
43	164	-0.000449	0.101789	0.324233	0.102688	0.331431	0.101968
44	136	-0.001592	0.100898	0.324840	0.102806	0.329460	0.101279
45	111	-0.003606	0.098748	0.325302	0.105494	0.330357	0.100097
46	203	-0.002093	0.100181	0.324759	0.105140	0.330792	0.101173
47	158	0.000963	0.103939	0.325994	0.101919	0.325458	0.103535
48	107	-0.008366	0.095289	0.330197	0.103185	0.320212	0.096868
49	120	-0.004596	0.097606	0.325099	0.105453	0.331423	0.099175

So, are our 50 neural networks making biased predictions vs their training sets?

pd.DataFrame(results).Train_Diff.plot.kde()
plt.axvline(x = 0.0, color = 'darkgreen')

<matplotlib.lines.Line2D at 0x2bf030070>

../_images/9a30d3813fdfc2f6467b495ff29c2b37aefa44042b0fedbe8cb097a07f492336.png

bias_std = pd.DataFrame(results).Train_Diff.std()
pd.DataFrame(results).Train_Diff.mean(), pd.DataFrame(results).Train_Diff.std()

(-0.0008891218338976614, 0.005435967840328695)

Training runs do appear to match on average, the mean frequency of the training dataset. However early stopping appears to lead to models that are sometimes predicting higher and sometimes predicting lower.

How does the model go in matching the true and dataset means?

pd.DataFrame(results).Mean.plot.kde()
plt.axvline(x = avg_claim, color = 'darkgreen')

<matplotlib.lines.Line2D at 0x2bf0339d0>

../_images/6672c39cbe3b480e5570c28340354cbde228768f4eda3ea720f97c8ec200d322.png

Models predict near the mean values, but there is some variabilty of the predictions around it.

Overall, with help from ChatGPT, we appear to have replicated the finding from the Wuthrich paper that early stopping can lead to the biased models. On average, models predict at near the correct average levels, but individual models may predict with some variability around that level - they are biased. None of this is new or original but ChatGPT was quite helpful in being to recreate this analysis in Python (Wuthrich appears to have used R).

Currently our average validation loss is:

avg_val_loss = pd.DataFrame(results).Val_loss.mean()
avg_val_loss

0.32853454232215884

Solving the bias issue#

Wuthrich suggests regularization approaches to reduce the bias. See Section 4 in his paper.

Wuthrich in his other work also discusses ensembling approach in Section 5.1.6. Here the 50 models have a variability in the overall predicted levels, but on average they are right. So if we were to take the final model as the ensemble average of the 50 models, it would average out the bias issue.

# Average neural network prediction:
avg_nn_prediction = pd.DataFrame(results).Mean.mean()
# Average frequency in dataset
avg_frequency = data['claims'].values.sum() / data['expo'].values.sum()
# True underlying frequency 
true_frequency = (data['truefreq'] * data['expo']).values.sum() / data['expo'].values.sum()

print("Average neural network prediction:", avg_nn_prediction)
print("Average frequency in dataset:", avg_frequency)
print("True underlying frequency:", true_frequency)
print("Averaged prediction vs true frequency:", avg_nn_prediction / true_frequency)

Average neural network prediction: 0.10194956490397454
Average frequency in dataset: 0.10269062235781508
True underlying frequency: 0.10199107887858037
Averaged prediction vs true frequency: 0.9995929646488468

Learn Rates#

Another idea from other neural network work is to apply a higher learn rate solely on the bias.

The idea is to allow the bias to converge on its true value faster, before the other weights begin to overfit, triggering the early stopping and halting training.

hidden_size = [20, 15, 10]  # Replace with the multi-layer parameters.

num_epochs = 99999  # should not be a factor, we train until early stopping kicks in
early_stopping_epochs = 10

learning_rate = 0.005      # Half base learning rate
bias_learning_rate = 0.05  # Increased bias learning rate

The training loop is updated with the optimizer applying this higher learning rate to the one bias value.

results = []
weight_list = []

for i in range(0, num_models):
    # Resample - Split the dataset into training and validation sets
    train_data, val_data = train_test_split(data, test_size=0.2)
    
    for feat in categorical_feats:
        train_data[feat] = label_encoders[feat].transform(train_data[feat])
        val_data[feat] = label_encoders[feat].transform(val_data[feat])

    for feat in numerical_feats:
        scaler[feat] = MinMaxScaler()
        train_data[feat] = scaler[feat].fit_transform(train_data[feat].values.reshape(-1, 1))
        val_data[feat] = scaler[feat].transform(val_data[feat].values.reshape(-1, 1))

    # Convert the dataset to PyTorch tensors
    x_train_numerical = torch.tensor(train_data[numerical_feats].values, dtype=torch.float32)
    x_train_categorical = torch.tensor(train_data[categorical_feats].values, dtype=torch.long)
    # y_train = torch.tensor(train_data['claims'].values, dtype=torch.float32)
    y_train = torch.tensor(train_data['claims'].values / train_data['expo'].values, dtype=torch.float32)
    exposure_train = torch.tensor(train_data['expo'].values, dtype=torch.float32)

    x_val_numerical = torch.tensor(val_data[numerical_feats].values, dtype=torch.float32)
    x_val_categorical = torch.tensor(val_data[categorical_feats].values, dtype=torch.long)
    # y_val = torch.tensor(val_data['claims'].values, dtype=torch.float32)
    y_val = torch.tensor(val_data['claims'].values / val_data['expo'].values, dtype=torch.float32)
    exposure_val = torch.tensor(val_data['expo'].values, dtype=torch.float32)

    # Create an instance of the FeedForwardNet model
    model = FeedForwardNet(len(numerical_feats), len(categorical_feats), embedding_sizes, hidden_size, init_bias = np.log(avg_claim).astype(np.float32))

    criterion = nn.PoissonNLLLoss(reduction='none', log_input=False)

    # Bias specific learn rates
    my_list = ['fc2.bias']
    bias_params = list(filter(lambda kv: kv[0] in my_list, model.named_parameters()))
    base_params = list(filter(lambda kv: kv[0] not in my_list, model.named_parameters()))

    # Define the optimizer
    # optimizer = optim.Adam(model.parameters(), lr=learning_rate)

    optimizer = optim.Adam([
                {'params': [temp[1] for temp in base_params]},
                {'params': [temp[1] for temp in bias_params], 'lr': bias_learning_rate}
            ], lr=learning_rate)

    # Train the model
    best_val_loss = np.inf
    best_epoch = 0

    for epoch in range(num_epochs):
        model.train()
        optimizer.zero_grad()
        y_pred = model(x_train_numerical, x_train_categorical)
        # loss = criterion(y_pred.squeeze() * exposure_train, y_train * exposure_train)
        loss = (criterion(y_pred.squeeze(), y_train) * exposure_train).sum() / exposure_train.sum()
        loss.backward()
        optimizer.step()
        
        model.eval()
        with torch.no_grad():
            y_val_pred = model(x_val_numerical, x_val_categorical)
            # val_loss = criterion(y_val_pred.squeeze() * exposure_val, y_val * exposure_val)
            val_loss = (criterion(y_val_pred.squeeze(), y_val) * exposure_val).sum() / exposure_val.sum()
        
        if val_loss < best_val_loss:
            best_val_loss = val_loss
            best_epoch = epoch
            best_weights = model.state_dict().copy()
            torch.save(best_weights, 'best_model.pt')

            best_result = {
                "Epoch": epoch,
                "Train_Diff": ((y_pred.squeeze() * exposure_train - y_train * exposure_train).sum() / exposure_train.sum()).item(),            
                "Train_Mean": ((y_pred.squeeze() * exposure_train).sum() / exposure_train.sum()).item(), 
                "Train_Loss": loss.item(),
                "Val_mean": ((y_val_pred.squeeze() * exposure_val).sum() / exposure_val.sum()).item(), 
                "Val_loss": val_loss.item(),
                "Mean": 0.8*((y_pred.squeeze() * exposure_train).sum() / exposure_train.sum()).item() + 0.2*((y_val_pred.squeeze() * exposure_val).sum() / exposure_val.sum()).item()
            }
        
        if epoch - best_epoch >= early_stopping_epochs:
            results += [best_result]            
            break

    weight_list.append(best_weights)        

# check we have the right parameter
bias_params

[('fc2.bias',
  Parameter containing:
  tensor(-2.1823, requires_grad=True))]

So how do results look with this?

pd.DataFrame(results)

	Epoch	Train_Diff	Train_Mean	Train_Loss	Val_mean	Val_loss	Mean
0	205	0.002327	0.104974	0.325422	0.101117	0.328724	0.104203
1	168	0.000035	0.102497	0.325165	0.101048	0.330128	0.102208
2	208	0.001107	0.103627	0.324849	0.102306	0.330064	0.103363
3	194	-0.002439	0.100004	0.325384	0.104841	0.329106	0.100971
4	216	-0.002707	0.099631	0.324795	0.104469	0.329759	0.100599
5	155	-0.004103	0.098337	0.325339	0.103473	0.330643	0.099364
6	173	0.001059	0.104135	0.326621	0.102555	0.323889	0.103819
7	239	0.000795	0.103905	0.325969	0.101931	0.323872	0.103510
8	219	-0.000637	0.102157	0.325523	0.102171	0.325915	0.102160
9	168	0.002321	0.105407	0.326163	0.101906	0.324060	0.104707
10	143	0.000277	0.103440	0.327099	0.103935	0.323553	0.103539
11	138	-0.003321	0.099526	0.326536	0.104895	0.325924	0.100600
12	139	-0.002427	0.099488	0.324715	0.105711	0.334265	0.100732
13	155	-0.002495	0.100064	0.325782	0.104031	0.328607	0.100857
14	176	-0.000415	0.102324	0.326006	0.103674	0.326932	0.102594
15	208	0.002315	0.104998	0.325499	0.102390	0.327024	0.104476
16	214	0.001363	0.104500	0.326591	0.102777	0.324041	0.104155
17	150	0.001920	0.104772	0.326084	0.101824	0.325707	0.104182
18	160	-0.002173	0.100638	0.326172	0.102994	0.325890	0.101109
19	183	0.000192	0.103026	0.325667	0.101727	0.326649	0.102766
20	180	-0.000990	0.101203	0.325036	0.103650	0.331248	0.101692
21	197	-0.001338	0.100844	0.324968	0.104099	0.331851	0.101495
22	164	0.001723	0.104569	0.326951	0.100148	0.328108	0.103685
23	207	0.003274	0.105323	0.323811	0.103895	0.333068	0.105037
24	166	0.002680	0.105871	0.326872	0.101878	0.322789	0.105072
25	180	0.000940	0.103496	0.325236	0.103305	0.328780	0.103458
26	201	-0.000276	0.102151	0.324925	0.101921	0.329467	0.102105
27	177	-0.000592	0.102395	0.326053	0.101323	0.325711	0.102180
28	215	0.002475	0.105416	0.326046	0.100961	0.325818	0.104525
29	171	-0.001479	0.100942	0.325523	0.103197	0.330104	0.101393
30	230	-0.000927	0.101362	0.324933	0.103380	0.330297	0.101765
31	162	0.001179	0.103946	0.326290	0.102380	0.327654	0.103633
32	183	0.002927	0.105697	0.325705	0.101642	0.327327	0.104886
33	194	0.001614	0.103999	0.325319	0.102904	0.330006	0.103780
34	170	0.002754	0.105223	0.325185	0.101073	0.330482	0.104393
35	144	-0.002177	0.100043	0.325104	0.102772	0.331401	0.100589
36	212	0.000414	0.102930	0.325115	0.101467	0.328294	0.102637
37	173	-0.000467	0.102793	0.326599	0.103692	0.322864	0.102973
38	200	-0.001061	0.101328	0.325358	0.102929	0.329955	0.101648
39	196	-0.002276	0.100090	0.325243	0.103280	0.330101	0.100728
40	201	0.000618	0.103018	0.324686	0.101520	0.331242	0.102718
41	188	0.001549	0.104363	0.325910	0.102523	0.326799	0.103995
42	128	0.002517	0.105386	0.326826	0.101684	0.326658	0.104645
43	150	0.002720	0.106185	0.327632	0.101504	0.321514	0.105249
44	202	-0.000765	0.101599	0.325121	0.103453	0.330568	0.101969
45	195	-0.007928	0.094683	0.326373	0.103778	0.328316	0.096502
46	179	-0.002498	0.099990	0.325483	0.103390	0.329075	0.100670
47	177	-0.002789	0.099647	0.325664	0.103856	0.328823	0.100489
48	163	0.002125	0.104693	0.325735	0.102967	0.327648	0.104348
49	237	-0.000325	0.102494	0.325931	0.104136	0.325074	0.102823

pd.DataFrame(results).Train_Diff.plot.kde()
plt.axvline(x = 0.0, color = 'darkgreen')

<matplotlib.lines.Line2D at 0x2c01e7e50>

../_images/ff2457edec264d0ab48a61239f1cf78920544299125513da58de77bb95308971.png

new_bias_std = pd.DataFrame(results).Train_Diff.std()
pd.DataFrame(results).Train_Diff.mean(), pd.DataFrame(results).Train_Diff.std()

(-6.771093656425364e-05, 0.0022470243173901318)

Compare the standard deviation of the bias with the bias-specific learning rate compared with the same learning rate across both.

new_bias_std / bias_std

0.41336232726024763

We can see with the abve result that the increased bias-specific learn rate was successful in reducing the variance in the overall model bias.

pd.DataFrame(results).Mean.plot.kde()
plt.axvline(x = avg_claim, color = 'darkgreen')

<matplotlib.lines.Line2D at 0x2c4911a80>

../_images/7303f4fc97f11c9fdeb087efefbad40f11d55b89c34a0eb93058110afc329404.png

(
    # Average neural network prediction:
    pd.DataFrame(results).Mean.mean(), 
    # Average frequency in dataset
    data['claims'].values.sum() / data['expo'].values.sum(),
    # True underlying frequency 
    (data['truefreq'] * data['expo']).values.sum() / data['expo'].values.sum(),
)

(0.10261997532844544, 0.10269062235781508, 0.10199107887858037)

Predicted mean frequency for the models look to track near the training data mean more consistently.

Finally, the validation loss. Unlike regularisation, which is known to shape the segmentation predictions, conceptually a higher learn rate for the bias should just mean the training loop focuses on getting a more aligned bias estimate. Does this come through in our result?

fast_bias_val_loss = pd.DataFrame(results).Val_loss.mean()
fast_bias_val_loss

0.32791590332984927

fast_bias_val_loss - avg_val_loss

-0.0006186389923095725

The experiments are random so results may vary if this is re-run but for this run, val loss was similar.

Look at the diagnostics again. This will be on the last model of the batch that was trained.

# Load the best model in the last run
model.load_state_dict(torch.load('best_model.pt'))

y_val_pred = model(x_val_numerical, x_val_categorical)

val_data["predicted"] = y_val_pred.cpu().detach().numpy().ravel() * val_data.expo
val_data["trueclaims"] = val_data.truefreq * val_data.expo

fig, axs = plt.subplots(len(numerical_feats) + 1, sharex=False, sharey=True, figsize=(7, 40))

for i, f in enumerate(["predicted"] + numerical_feats):
    dat_copy = val_data.copy()
    dat_copy["decile"] = pd.qcut(dat_copy[f], 10, labels=False, duplicates='drop')
    X_sum = dat_copy.groupby("decile").agg("sum").reset_index()
        
    axs[i].plot(X_sum.index, X_sum.trueclaims / X_sum.expo)
    axs[i].plot(X_sum.index, X_sum.predicted / X_sum.expo)
    axs[i].set_title(f)


for i, f in enumerate(categorical_feats):
    dat_copy = val_data.copy()
    X_sum = dat_copy.groupby(f).agg("sum")[["trueclaims", "predicted", "expo"]]
    X_sum["trueclaims"] = X_sum["trueclaims"]/X_sum["expo"]
    X_sum["predicted"] = X_sum["predicted"]/X_sum["expo"]
    X_sum = X_sum.drop(columns="expo")
    axs[i] = X_sum.plot(kind='bar', rot=0, xlabel=f, ylabel='Value', title=f)

../_images/230e744d065563f5073d90f0197aaafd8467c162e29d68406d16d20e090251b4.png

../_images/31016c749df73ae1d71254f6293ce1d4f423b98587e9661709c62a534716ab4d.png

../_images/7c1154c562a56486427c20d9d3da910b5a463289b7d507b7c13908449f301d6a.png

../_images/b64cc92fb143ae203c5fdea52fe027a1198e080e6c702f2bc7988722428d1ee9.png

../_images/619e05da66db01e6148ae2a898c94b7c4579ec56df387548c1a0d0aa7fe27b7f.png

Diagnostics still look good.

Just change the bias#

In Wuthrich 2023, he proposes (Section 5.1.5) the idea of training a GLM to full convergence with the final hidden layer weights and biases from the neural network to solve the bias issue.

Finally, if having an unbiased model with early stopping is a must have, with pytorch we have full control over the training loop.

We could just explicitly set bias after every epoch to the value that keeps the predicted mean to be the mean of the training data.

# Define the hyperparameters
hidden_size = [20, 15, 10]  # Replace with the multi-layer parameters.
learning_rate = 0.01
num_epochs = 9999  # should not be a factor, we train until early stopping kicks in
early_stopping_epochs = 10

results = []
weight_list = []

for i in range(0, num_models):
    # Resample - Split the dataset into training and validation sets
    train_data, val_data = train_test_split(data, test_size=0.2)
    
    for feat in categorical_feats:
        train_data[feat] = label_encoders[feat].transform(train_data[feat])
        val_data[feat] = label_encoders[feat].transform(val_data[feat])

    for feat in numerical_feats:
        scaler[feat] = MinMaxScaler()
        train_data[feat] = scaler[feat].fit_transform(train_data[feat].values.reshape(-1, 1))
        val_data[feat] = scaler[feat].transform(val_data[feat].values.reshape(-1, 1))

    # Convert the dataset to PyTorch tensors
    x_train_numerical = torch.tensor(train_data[numerical_feats].values, dtype=torch.float32)
    x_train_categorical = torch.tensor(train_data[categorical_feats].values, dtype=torch.long)

    y_train = torch.tensor(train_data['claims'].values / train_data['expo'].values, dtype=torch.float32)
    exposure_train = torch.tensor(train_data['expo'].values, dtype=torch.float32)

    x_val_numerical = torch.tensor(val_data[numerical_feats].values, dtype=torch.float32)
    x_val_categorical = torch.tensor(val_data[categorical_feats].values, dtype=torch.long)

    y_val = torch.tensor(val_data['claims'].values / val_data['expo'].values, dtype=torch.float32)
    exposure_val = torch.tensor(val_data['expo'].values, dtype=torch.float32)

    # Create an instance of the FeedForwardNet model
    model = FeedForwardNet(len(numerical_feats), len(categorical_feats), embedding_sizes, hidden_size, init_bias = np.log(avg_claim).astype(np.float32))

    criterion = nn.PoissonNLLLoss(reduction='none', log_input=False)

    # Define the optimizer
    optimizer = optim.Adam(model.parameters(), lr=learning_rate)

    # Train the model
    best_val_loss = np.inf
    best_epoch = 0
    for epoch in range(num_epochs):
        model.train()
        optimizer.zero_grad()
        y_pred = model(x_train_numerical, x_train_categorical)
        loss = (criterion(y_pred.squeeze(), y_train) * exposure_train).sum() / exposure_train.sum()
        loss.backward()
        optimizer.step()     
        model.eval()

        # Adjust the bias each epoch
        with torch.no_grad():
            # Get predictions
            y_pred = model(x_train_numerical, x_train_categorical)

            # Get adjustment
            adjustment = torch.log((y_train * exposure_train).sum()) - torch.log((y_pred.squeeze() * exposure_train).sum())

        model.fc2.bias.data += adjustment 

        # Adjusted y_pred
        y_pred = model(x_train_numerical, x_train_categorical)

        with torch.no_grad():
            y_val_pred = model(x_val_numerical, x_val_categorical)
            val_loss = (criterion(y_val_pred.squeeze(), y_val) * exposure_val).sum() / exposure_val.sum()
        
        if val_loss < best_val_loss:
            best_val_loss = val_loss
            best_epoch = epoch
            best_weights = model.state_dict().copy()
            torch.save(best_weights, 'best_model.pt')

            best_result = {
                "Epoch": epoch,
                "Train_Diff": ((y_pred.squeeze() * exposure_train - y_train * exposure_train).sum() / exposure_train.sum()).item(),            
                "Train_Mean": ((y_pred.squeeze() * exposure_train).sum() / exposure_train.sum()).item(), 
                "Train_Loss": loss.item(),
                "Val_mean": ((y_val_pred.squeeze() * exposure_val).sum() / exposure_val.sum()).item(), 
                "Val_loss": val_loss.item(),
                "Mean": 0.8*((y_pred.squeeze() * exposure_train).sum() / exposure_train.sum()).item() + 0.2*((y_val_pred.squeeze() * exposure_val).sum() / exposure_val.sum()).item()
            }
        
        if epoch - best_epoch >= early_stopping_epochs:
            results += [best_result]
            break

    weight_list.append(best_weights)

pd.DataFrame(results)

	Epoch	Train_Diff	Train_Mean	Train_Loss	Val_mean	Val_loss	Mean
0	122	2.392140e-09	0.102556	0.324907	0.102498	0.328764	0.102544
1	97	-2.860957e-08	0.102573	0.325711	0.102370	0.329056	0.102533
2	103	-1.143877e-08	0.102153	0.324566	0.102151	0.332670	0.102152
3	100	2.330167e-08	0.103220	0.326741	0.103121	0.323224	0.103200
4	114	1.104944e-08	0.102778	0.326044	0.103099	0.326367	0.102842
5	137	-3.339519e-08	0.102181	0.324399	0.102149	0.333316	0.102175
6	105	5.588781e-08	0.102945	0.326395	0.103071	0.325628	0.102970
7	151	-4.614875e-08	0.102484	0.324409	0.102470	0.329613	0.102481
8	97	4.902244e-10	0.102397	0.325335	0.101848	0.329740	0.102287
9	109	-5.120140e-08	0.102658	0.325921	0.102671	0.328305	0.102660
10	116	-8.096723e-09	0.103427	0.326815	0.103234	0.321770	0.103388
11	101	-2.164153e-08	0.102033	0.324435	0.101915	0.333433	0.102010
12	117	-3.639235e-08	0.102620	0.325797	0.102445	0.328296	0.102585
13	112	-1.901544e-08	0.102983	0.326010	0.103124	0.324312	0.103011
14	103	-6.258782e-08	0.102355	0.324778	0.102871	0.330884	0.102458
15	110	3.401983e-08	0.102748	0.325920	0.102963	0.325666	0.102791
16	111	-1.676353e-08	0.102496	0.325475	0.102318	0.328534	0.102460
17	91	7.000872e-08	0.102985	0.326366	0.103263	0.324911	0.103040
18	103	-4.030646e-08	0.102205	0.324718	0.102024	0.331843	0.102169
19	99	4.145939e-08	0.102938	0.326770	0.103006	0.324792	0.102952
20	97	-1.510715e-08	0.102897	0.326182	0.102669	0.326991	0.102852
21	131	-5.966714e-08	0.101933	0.323772	0.101948	0.334276	0.101936
22	120	-2.709646e-08	0.102465	0.325010	0.102429	0.330330	0.102458
23	105	-2.102355e-08	0.102337	0.325117	0.102071	0.331128	0.102284
24	108	-4.034199e-08	0.102668	0.326064	0.102556	0.327703	0.102646
25	119	-2.227432e-08	0.103057	0.326187	0.102746	0.325391	0.102994
26	101	2.083051e-08	0.102695	0.326309	0.102215	0.327150	0.102599
27	121	1.541380e-08	0.102893	0.326008	0.103202	0.325928	0.102955
28	113	3.449376e-08	0.102807	0.325954	0.102774	0.325762	0.102800
29	126	2.174281e-08	0.103070	0.326702	0.103266	0.324199	0.103109
30	117	-1.656592e-08	0.102850	0.326272	0.102875	0.327013	0.102855
31	129	6.832808e-08	0.102650	0.325482	0.102521	0.327829	0.102624
32	77	-1.616476e-08	0.102999	0.326902	0.103152	0.326036	0.103029
33	88	5.655543e-09	0.102623	0.326193	0.102884	0.329003	0.102675
34	162	-2.356158e-09	0.102348	0.324156	0.102377	0.331616	0.102353
35	86	4.196311e-08	0.102868	0.327524	0.103084	0.326439	0.102911
36	89	-5.090045e-08	0.101892	0.324506	0.102245	0.334591	0.101962
37	108	2.277418e-08	0.102124	0.324321	0.102444	0.332908	0.102188
38	116	6.947347e-08	0.102534	0.325019	0.102408	0.329999	0.102509
39	80	7.592326e-08	0.102538	0.326241	0.102415	0.328567	0.102514
40	111	-1.581654e-08	0.102781	0.326168	0.102739	0.326581	0.102772
41	98	5.042758e-08	0.102842	0.326228	0.102668	0.326550	0.102807
42	87	8.698981e-08	0.102296	0.325354	0.101990	0.332758	0.102235
43	123	6.307235e-08	0.103129	0.326391	0.102828	0.323361	0.103069
44	102	2.011095e-08	0.102592	0.325649	0.102498	0.328519	0.102573
45	133	4.943590e-08	0.103334	0.326803	0.103473	0.321474	0.103362
46	96	5.013611e-08	0.102532	0.325522	0.102558	0.328560	0.102537
47	148	-5.638632e-08	0.102877	0.326021	0.102527	0.325705	0.102807
48	95	-4.618958e-08	0.102155	0.325058	0.102273	0.332594	0.102179
49	131	4.888754e-08	0.102431	0.325105	0.102663	0.329156	0.102477

So as expected, Train_diff is now zero - the predicted mean for all our models is exactly the same as the training data mean, because we set it that way.

But do we lose any predictive power from this?

fixed_bias_val_loss = pd.DataFrame(results).Val_loss.mean()
print(fixed_bias_val_loss, avg_val_loss, fixed_bias_val_loss - avg_val_loss)

0.32818478763103487 0.32853454232215884 -0.0003497546911239713

Let us also look at some diagnostics:

# Load the best model in the last run
model.load_state_dict(torch.load('best_model.pt'))

y_val_pred = model(x_val_numerical, x_val_categorical)

val_data["predicted"] = y_val_pred.cpu().detach().numpy().ravel() * val_data.expo
val_data["trueclaims"] = val_data.truefreq * val_data.expo

fig, axs = plt.subplots(len(numerical_feats) + 1, sharex=False, sharey=True, figsize=(7, 40))

for i, f in enumerate(["predicted"] + numerical_feats):
    dat_copy = val_data.copy()
    dat_copy["decile"] = pd.qcut(dat_copy[f], 10, labels=False, duplicates='drop')
    X_sum = dat_copy.groupby("decile").agg("sum").reset_index()
        
    axs[i].plot(X_sum.index, X_sum.trueclaims / X_sum.expo)
    axs[i].plot(X_sum.index, X_sum.predicted / X_sum.expo)
    axs[i].set_title(f)


for i, f in enumerate(categorical_feats):
    dat_copy = val_data.copy()
    X_sum = dat_copy.groupby(f).agg("sum")[["trueclaims", "predicted", "expo"]]
    X_sum["trueclaims"] = X_sum["trueclaims"]/X_sum["expo"]
    X_sum["predicted"] = X_sum["predicted"]/X_sum["expo"]
    X_sum = X_sum.drop(columns="expo")
    axs[i] = X_sum.plot(kind='bar', rot=0, xlabel=f, ylabel='Value', title=f)

../_images/84210ca7d2759ea44ee4a1d7949524b5dfaaf0ee2200c32af6a2026215ae7da0.png

../_images/fb9b9373c37238873d79f881f924b8f2b68832261edc038fcbb9f03f819c4ddb.png

../_images/2e745049b89cc4453b9b7d504427064a9eddc7f271a5b085d44796c3d42a0c63.png

../_images/c7224545a889dc70a6195b9c9017287a551b27dc5bac9269347f3043bf4602c8.png

../_images/ed936f38475fd89740848018b46f2cc81f3ab84d1c52b441c4a2eacc0fb90d96.png

Conclusions#

ChatGPT is an effective tool for accelerating actuarial data science programming. However, users need to have sufficient understanding to debug the resulting generated code, and identify any subtle logic errors, additional requirements, or best practices.
Early stopping does lead to models that are biased. The bias itself is not biased, it averages around the mean, but any given model may be biased.

We propose other ideas to avoid this other than the solution proposed by Wuthrich.

Ensemble a group of models, which will average out biases.
Increase the learning rate on the bias to ensure that remains close to the right levels. Note that our initialisation strategy means the bias already starts at the right level.
Reset the bias so the mean reconciles after each epoch

Appendix: Unedited ChatGPT code output#

import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import mean_squared_error

# Define the neural network class
class FeedForwardNet(nn.Module):
    def __init__(self, num_numerical_feats, num_categorical_feats, embedding_sizes, hidden_size):
        super(FeedForwardNet, self).__init__()
        self.embeddings = nn.ModuleList([
            nn.Embedding(num_classes, emb_size) for num_classes, emb_size in embedding_sizes
        ])
        self.num_numerical_feats = num_numerical_feats
        self.num_categorical_feats = num_categorical_feats
        self.total_embed_size = sum([emb_size for _, emb_size in embedding_sizes])
        self.input_size = self.num_numerical_feats + self.total_embed_size
        self.fc1 = nn.Linear(self.input_size, hidden_size)
        self.fc2 = nn.Linear(hidden_size, 1)
        self.dropout = nn.Dropout(p=0.2)
        
    def forward(self, x_numerical, x_categorical):
        embedded_x = [embedding(x_categorical[:, i]) for i, embedding in enumerate(self.embeddings)]
        embedded_x = torch.cat(embedded_x, dim=1)
        x = torch.cat([embedded_x, x_numerical], dim=1)
        x = self.dropout(x)
        x = torch.relu(self.fc1(x))
        x = self.dropout(x)
        x = self.fc2(x)
        return x

# Load the dataset
data = pd.read_csv('your_dataset.csv')

# Split the dataset into training and validation sets
train_data, val_data = train_test_split(data, test_size=0.2, random_state=42)

# Define the categorical features and numerical features
categorical_feats = ['gas', 'brand', 'area', 'ct']
numerical_feats = ['age', 'ac', 'power', 'dens']

# Encode the categorical features using LabelEncoder
label_encoders = {}
for feat in categorical_feats:
    label_encoders[feat] = LabelEncoder()
    train_data[feat] = label_encoders[feat].fit_transform(train_data[feat])
    val_data[feat] = label_encoders[feat].transform(val_data[feat])

# Define the embedding sizes for categorical features
embedding_sizes = [(len(train_data[feat].unique()), min(50, (len(train_data[feat].unique()) + 1) // 2)) for feat in categorical_feats]

# Convert the dataset to PyTorch tensors
x_train_numerical = torch.tensor(train_data[numerical_feats].values, dtype=torch.float32)
x_train_categorical = torch.tensor(train_data[categorical_feats].values, dtype=torch.long)
y_train = torch.tensor(train_data['claims'].values, dtype=torch.float32)
exposure_train = torch.tensor(train_data['expo'].values, dtype=torch.float32)

x_val_numerical = torch.tensor(val_data[numerical_feats].values, dtype=torch.float32)
x_val_categorical = torch.tensor(val_data[categorical_feats].values, dtype=torch.long)
y_val = torch.tensor(val_data['claims'].values, dtype=torch.float32)
exposure_val = torch.tensor(val_data['expo'].values, dtype=torch.float32)

# Define the hyperparameters
hidden_size = 64
learning_rate = 0.001
batch_size = 32
num_epochs = 100
early_stopping_epochs = 10

# Create an instance of the FeedForwardNet model
model = FeedForwardNet(len(numerical_feats), len(categorical_feats), embedding_sizes, hidden_size)

# Define the loss function
criterion = nn.MSELoss(reduction='sum')

# Define the optimizer
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

# Train the model
best_val_loss = np.inf
best_epoch = 0
for epoch in range(num_epochs):
    model.train()
    optimizer.zero_grad()
    y_pred = model(x_train_numerical, x_train_categorical)
    loss = criterion(y_pred.squeeze() * exposure_train, y_train * exposure_train)
    loss.backward()
    optimizer.step()
    
    model.eval()
    with torch.no_grad():
        y_val_pred = model(x_val_numerical, x_val_categorical)
        val_loss = criterion(y_val_pred.squeeze() * exposure_val, y_val * exposure_val)
    
    if val_loss < best_val_loss:
        best_val_loss = val_loss
        best_epoch = epoch
        torch.save(model.state_dict(), 'best_model.pt')
    
    if epoch - best_epoch >= early_stopping_epochs:
        break

# Load the best model
model.load_state_dict(torch.load('best_model.pt'))

# Make predictions on the test set
x_test_numerical = torch.tensor(test_data[numerical_feats].values, dtype=torch.float32)
x_test_categorical = torch.tensor(test_data[categorical_feats].values, dtype=torch.long)
y_test = torch.tensor(test_data['claims'].values, dtype=torch.float32)
exposure_test = torch.tensor(test_data['expo'].values, dtype=torch.float32)

model.eval()
with torch.no_grad():
    y_test_pred = model(x_test_numerical, x_test_categorical)

# Calculate the test loss
test_loss = criterion(y_test_pred.squeeze() * exposure_test, y_test * exposure_test)
test_rmse = torch.sqrt(test_loss.item() / len(test_data))

print(f'Test RMSE: {test_rmse:.4f}')