Fraud Detection for Automobile Claims: Data Exploration
This notebook is the first part of a series of notebooks that will demonstrate how to prepare, train, and deploy a model that detects fradulent autoclaims. In this notebook, we will focusing on data exploration. You can choose to run this notebook by itself or in sequence with the other notebooks listed below. Please see the README.md for more information about this use case implemented by this series of notebooks.
`Fraud Detection for Automobile Claims: Data Exploration <./0-AutoClaimFraudDetection.ipynb>`__
Datasets and Exploratory Visualizations
The dataset is synthetically generated and consists of customers and claims datasets. Here we will load them and do some exploratory visualizations.
!pip install seaborn==0.11.1
# Importing required libraries. import pandas as pd import numpy as np import seaborn as sns # visualisation import matplotlib.pyplot as plt # visualisation %matplotlib inline sns.set(color_codes=True) df_claims = pd.read_csv("./data/claims_preprocessed.csv", index_col=0) df_customers = pd.read_csv("./data/customers_preprocessed.csv", index_col=0)
This should return no null values in both of the datasets.
# plot the bar graph customer gender df_customers.customer_gender_female.value_counts(normalize=True).plot.bar() plt.xticks([0, 1], ["Male", "Female"]);
The dataset is heavily weighted towards male customers.
# plot the bar graph of fraudulent claims df_claims.fraud.value_counts(normalize=True).plot.bar() plt.xticks([0, 1], ["Not Fraud", "Fraud"]);
The overwhemling majority of claims are legitimate (i.e. not fraudulent).
# plot the education categories educ = df_customers.customer_education.value_counts(normalize=True, sort=False) plt.bar(educ.index, educ.values) plt.xlabel("Customer Education Level");
# plot the total claim amounts plt.hist(df_claims.total_claim_amount, bins=30) plt.xlabel("Total Claim Amount")
Majority of the total claim amounts are under $25,000.
# plot the number of claims filed in the past year df_customers.num_claims_past_year.hist(density=True) plt.suptitle("Number of Claims in the Past Year") plt.xlabel("Number of claims per year")
Most customers did not file any claims in the previous year, but some filed as many as 7 claims.
sns.pairplot( data=df_customers, vars=["num_insurers_past_5_years", "months_as_customer", "customer_age"] );
customer_age are correlated with each other. A younger person have been driving for a smaller amount of time and therefore have a smaller potential for how long they might have been a customer.
We can also see that the
num_insurers_past_5_years is negatively correlated with
months_as_customer. If someone frequently jumped around to different insurers, then they probably spent less time as a customer of this insurer.
df_combined = df_customers.join(df_claims) sns.lineplot(x="num_insurers_past_5_years", y="fraud", data=df_combined);
Fraud is positively correlated with having a greater number of insurers over the past 5 years. Customers who switched insurers more frequently also had more prevelance of fraud.
Our customers range from 18 to 75 years old.
df_combined.groupby("customer_gender_female").mean()["fraud"].plot.bar() plt.xticks([0, 1], ["Male", "Female"]) plt.suptitle("Fraud by Gender");
Fraudulent claims come disproportionately from male customers.
# Creating a correlation matrix of fraud, gender, months as customer, and number of different insurers cols = [ "fraud", "customer_gender_male", "customer_gender_female", "months_as_customer", "num_insurers_past_5_years", ] corr = df_combined[cols].corr() # plot the correlation matrix sns.heatmap(corr, annot=True, cmap="Reds");
Fraud is correlated with having more insurers in the past 5 years, and negatively correlated with being a customer for a longer period of time. These go hand in hand and mean that long time customers are less likely to commit fraud.
We have been looking at the indivudual datasets, now let’s look at their combined view (join).
import pandas as pd df_combined = pd.read_csv("./data/claims_customer.csv")
df_combined = df_combined.loc[:, ~df_combined.columns.str.contains("^Unnamed: 0")] # get rid of an unwanted column df_combined.head()
Let’s explore any unique, missing, or large percentage category in the combined dataset.
combined_stats =  for col in df_combined.columns: combined_stats.append( ( col, df_combined[col].nunique(), df_combined[col].isnull().sum() * 100 / df_combined.shape, df_combined[col].value_counts(normalize=True, dropna=False).values * 100, df_combined[col].dtype, ) ) stats_df = pd.DataFrame( combined_stats, columns=["feature", "unique_values", "percent_missing", "percent_largest_category", "datatype"], ) stats_df.sort_values("percent_largest_category", ascending=False)
import matplotlib.pyplot as plt import numpy as np sns.set_style("white") corr_list = [ "customer_age", "months_as_customer", "total_claim_amount", "injury_claim", "vehicle_claim", "incident_severity", "fraud", ] corr_df = df_combined[corr_list] corr = round(corr_df.corr(), 2) fix, ax = plt.subplots(figsize=(15, 15)) mask = np.zeros_like(corr, dtype=np.bool) mask[np.triu_indices_from(mask)] = True ax = sns.heatmap(corr, mask=mask, ax=ax, annot=True, cmap="OrRd") ax.set_xticklabels(ax.xaxis.get_ticklabels(), fontsize=10, ha="right", rotation=45) ax.set_yticklabels(ax.yaxis.get_ticklabels(), fontsize=10, va="center", rotation=0) plt.show()