Fraud Detection for Automobile Claims: Data Exploration
This notebook’s CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook.
Background
This notebook is the first part of a series of notebooks that will demonstrate how to prepare, train, and deploy a model that detects fradulent autoclaims. In this notebook, we will focusing on data exploration. You can choose to run this notebook by itself or in sequence with the other notebooks listed below. Please see the README.md for more information about this use case implemented by this series of notebooks.
`Fraud Detection for Automobile Claims: Data Exploration <./0-AutoClaimFraudDetection.ipynb>`__
Fraud Detection for Automobile Claims: Data Preparation, Process, and Store Features
Fraud Detection for Automobile Claims: Train, Check Bias, Tune, Record Lineage, and Register a Model
Fraud Detection for Automobile Claims: Mitigate Bias, Train, Register, and Deploy Unbiased Model
Datasets and Exploratory Visualizations
The dataset is synthetically generated and consists of customers and claims datasets. Here we will load them and do some exploratory visualizations.
[ ]:
!pip install seaborn==0.11.1
[ ]:
# Importing required libraries.
import pandas as pd
import numpy as np
import seaborn as sns # visualisation
import matplotlib.pyplot as plt # visualisation
%matplotlib inline
sns.set(color_codes=True)
df_claims = pd.read_csv("./data/claims_preprocessed.csv", index_col=0)
df_customers = pd.read_csv("./data/customers_preprocessed.csv", index_col=0)
[ ]:
print(df_claims.isnull().sum().sum())
print(df_customers.isnull().sum().sum())
This should return no null values in both of the datasets.
[ ]:
# plot the bar graph customer gender
df_customers.customer_gender_female.value_counts(normalize=True).plot.bar()
plt.xticks([0, 1], ["Male", "Female"]);
The dataset is heavily weighted towards male customers.
[ ]:
# plot the bar graph of fraudulent claims
df_claims.fraud.value_counts(normalize=True).plot.bar()
plt.xticks([0, 1], ["Not Fraud", "Fraud"]);
The overwhemling majority of claims are legitimate (i.e. not fraudulent).
[ ]:
# plot the education categories
educ = df_customers.customer_education.value_counts(normalize=True, sort=False)
plt.bar(educ.index, educ.values)
plt.xlabel("Customer Education Level");
[ ]:
# plot the total claim amounts
plt.hist(df_claims.total_claim_amount, bins=30)
plt.xlabel("Total Claim Amount")
Majority of the total claim amounts are under $25,000.
[ ]:
# plot the number of claims filed in the past year
df_customers.num_claims_past_year.hist(density=True)
plt.suptitle("Number of Claims in the Past Year")
plt.xlabel("Number of claims per year")
Most customers did not file any claims in the previous year, but some filed as many as 7 claims.
[ ]:
sns.pairplot(
data=df_customers, vars=["num_insurers_past_5_years", "months_as_customer", "customer_age"]
);
Understandably, the months_as_customer
and customer_age
are correlated with each other. A younger person have been driving for a smaller amount of time and therefore have a smaller potential for how long they might have been a customer.
We can also see that the num_insurers_past_5_years
is negatively correlated with months_as_customer
. If someone frequently jumped around to different insurers, then they probably spent less time as a customer of this insurer.
[ ]:
df_combined = df_customers.join(df_claims)
sns.lineplot(x="num_insurers_past_5_years", y="fraud", data=df_combined);
Fraud is positively correlated with having a greater number of insurers over the past 5 years. Customers who switched insurers more frequently also had more prevelance of fraud.
[ ]:
sns.boxplot(x=df_customers["months_as_customer"]);
[ ]:
sns.boxplot(x=df_customers["customer_age"]);
Our customers range from 18 to 75 years old.
[ ]:
df_combined.groupby("customer_gender_female").mean()["fraud"].plot.bar()
plt.xticks([0, 1], ["Male", "Female"])
plt.suptitle("Fraud by Gender");
Fraudulent claims come disproportionately from male customers.
[ ]:
# Creating a correlation matrix of fraud, gender, months as customer, and number of different insurers
cols = [
"fraud",
"customer_gender_male",
"customer_gender_female",
"months_as_customer",
"num_insurers_past_5_years",
]
corr = df_combined[cols].corr()
# plot the correlation matrix
sns.heatmap(corr, annot=True, cmap="Reds");
Fraud is correlated with having more insurers in the past 5 years, and negatively correlated with being a customer for a longer period of time. These go hand in hand and mean that long time customers are less likely to commit fraud.
Combined DataSets
We have been looking at the indivudual datasets, now let’s look at their combined view (join).
[ ]:
import pandas as pd
df_combined = pd.read_csv("./data/claims_customer.csv")
[ ]:
df_combined = df_combined.loc[:, ~df_combined.columns.str.contains("^Unnamed: 0")]
# get rid of an unwanted column
df_combined.head()
[ ]:
df_combined.describe()
Let’s explore any unique, missing, or large percentage category in the combined dataset.
[ ]:
combined_stats = []
for col in df_combined.columns:
combined_stats.append(
(
col,
df_combined[col].nunique(),
df_combined[col].isnull().sum() * 100 / df_combined.shape[0],
df_combined[col].value_counts(normalize=True, dropna=False).values[0] * 100,
df_combined[col].dtype,
)
)
stats_df = pd.DataFrame(
combined_stats,
columns=["feature", "unique_values", "percent_missing", "percent_largest_category", "datatype"],
)
stats_df.sort_values("percent_largest_category", ascending=False)
[ ]:
import matplotlib.pyplot as plt
import numpy as np
sns.set_style("white")
corr_list = [
"customer_age",
"months_as_customer",
"total_claim_amount",
"injury_claim",
"vehicle_claim",
"incident_severity",
"fraud",
]
corr_df = df_combined[corr_list]
corr = round(corr_df.corr(), 2)
fix, ax = plt.subplots(figsize=(15, 15))
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
ax = sns.heatmap(corr, mask=mask, ax=ax, annot=True, cmap="OrRd")
ax.set_xticklabels(ax.xaxis.get_ticklabels(), fontsize=10, ha="right", rotation=45)
ax.set_yticklabels(ax.yaxis.get_ticklabels(), fontsize=10, va="center", rotation=0)
plt.show()
Notebook CI Test Results
This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.