Fraud Detection for Automobile Claims: Data Exploration


This notebook is the first part of a series of notebooks that will demonstrate how to prepare, train, and deploy a model that detects fradulent autoclaims. In this notebook, we will focusing on data exploration. You can choose to run this notebook by itself or in sequence with the other notebooks listed below. Please see the for more information about this use case implemented by this series of notebooks.

  1. `Fraud Detection for Automobile Claims: Data Exploration <./0-AutoClaimFraudDetection.ipynb>`__

  2. Fraud Detection for Automobile Claims: Data Preparation, Process, and Store Features

  3. Fraud Detection for Automobile Claims: Train, Check Bias, Tune, Record Lineage, and Register a Model

  4. Fraud Detection for Automobile Claims: Mitigate Bias, Train, Register, and Deploy Unbiased Model

Datasets and Exploratory Visualizations

The dataset is synthetically generated and consists of customers and claims datasets. Here we will load them and do some exploratory visualizations.

[ ]:
!pip install seaborn==0.11.1
[ ]:
# Importing required libraries.
import pandas as pd
import numpy as np
import seaborn as sns  # visualisation
import matplotlib.pyplot as plt  # visualisation

%matplotlib inline

df_claims = pd.read_csv("./data/claims_preprocessed.csv", index_col=0)
df_customers = pd.read_csv("./data/customers_preprocessed.csv", index_col=0)
[ ]:

This should return no null values in both of the datasets.

[ ]:
# plot the bar graph customer gender
plt.xticks([0, 1], ["Male", "Female"]);

The dataset is heavily weighted towards male customers.

[ ]:
# plot the bar graph of fraudulent claims
plt.xticks([0, 1], ["Not Fraud", "Fraud"]);

The overwhemling majority of claims are legitimate (i.e. not fraudulent).

[ ]:
# plot the education categories
educ = df_customers.customer_education.value_counts(normalize=True, sort=False), educ.values)
plt.xlabel("Customer Education Level");
[ ]:
# plot the total claim amounts
plt.hist(df_claims.total_claim_amount, bins=30)
plt.xlabel("Total Claim Amount")

Majority of the total claim amounts are under $25,000.

[ ]:
# plot the number of claims filed in the past year
plt.suptitle("Number of Claims in the Past Year")
plt.xlabel("Number of claims per year")

Most customers did not file any claims in the previous year, but some filed as many as 7 claims.

[ ]:
    data=df_customers, vars=["num_insurers_past_5_years", "months_as_customer", "customer_age"]

Understandably, the months_as_customer and customer_age are correlated with each other. A younger person have been driving for a smaller amount of time and therefore have a smaller potential for how long they might have been a customer.

We can also see that the num_insurers_past_5_years is negatively correlated with months_as_customer. If someone frequently jumped around to different insurers, then they probably spent less time as a customer of this insurer.

[ ]:
df_combined = df_customers.join(df_claims)
sns.lineplot(x="num_insurers_past_5_years", y="fraud", data=df_combined);

Fraud is positively correlated with having a greater number of insurers over the past 5 years. Customers who switched insurers more frequently also had more prevelance of fraud.

[ ]:
[ ]:

Our customers range from 18 to 75 years old.

[ ]:
plt.xticks([0, 1], ["Male", "Female"])
plt.suptitle("Fraud by Gender");

Fraudulent claims come disproportionately from male customers.

[ ]:
# Creating a correlation matrix of fraud, gender, months as customer, and number of different insurers
cols = [
corr = df_combined[cols].corr()

# plot the correlation matrix
sns.heatmap(corr, annot=True, cmap="Reds");

Fraud is correlated with having more insurers in the past 5 years, and negatively correlated with being a customer for a longer period of time. These go hand in hand and mean that long time customers are less likely to commit fraud.

Combined DataSets

We have been looking at the indivudual datasets, now let’s look at their combined view (join).

[ ]:
import pandas as pd

df_combined = pd.read_csv("./data/claims_customer.csv")
[ ]:
df_combined = df_combined.loc[:, ~df_combined.columns.str.contains("^Unnamed: 0")]
# get rid of an unwanted column
[ ]:

Let’s explore any unique, missing, or large percentage category in the combined dataset.

[ ]:
combined_stats = []

for col in df_combined.columns:
            df_combined[col].isnull().sum() * 100 / df_combined.shape[0],
            df_combined[col].value_counts(normalize=True, dropna=False).values[0] * 100,

stats_df = pd.DataFrame(
    columns=["feature", "unique_values", "percent_missing", "percent_largest_category", "datatype"],
stats_df.sort_values("percent_largest_category", ascending=False)
[ ]:
import matplotlib.pyplot as plt
import numpy as np


corr_list = [

corr_df = df_combined[corr_list]
corr = round(corr_df.corr(), 2)

fix, ax = plt.subplots(figsize=(15, 15))

mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

ax = sns.heatmap(corr, mask=mask, ax=ax, annot=True, cmap="OrRd")

ax.set_xticklabels(ax.xaxis.get_ticklabels(), fontsize=10, ha="right", rotation=45)
ax.set_yticklabels(ax.yaxis.get_ticklabels(), fontsize=10, va="center", rotation=0)