Fleet Predictive Maintenance: Part 2. Feature Engineering and Exploratory Data Visualization

Using SageMaker Studio to Predict Fault Classification

Background

This notebook is part of a sequence of notebooks whose purpose is to demonstrate a Predictive Maintenance (PrM) solution for automobile fleet maintenance via Amazon SageMaker Studio so that business users have a quick path towards a PrM POC. In this notebook, we will be focusing on feature engineering. It is the second notebook in a series of notebooks. You can choose to run this notebook by itself or in sequence with the other notebooks listed below. Please see the README.md for more information about this use case implement of this sequence of notebooks.

  1. Data Prep: Processing Job from SageMaker Data Wrangler Output

  2. Data Prep: Featurization (current notebook)

  3. Train, Tune and Predict using Batch Transform

Important Notes:

  • Due to cost consideration, the goal of this example is to show you how to use some of SageMaker Studio’s features, not necessarily to achieve the best result.

  • We use the built-in classification algorithm in this example, and a Python 3 (Data Science) Kernel is required.

  • The nature of predictive maintenace solutions, requires a domain knowledge expert of the system or machinery. With this in mind, we will make assumptions here for certain elements of this solution with the acknowldgement that these assumptions should be informed by a domain expert and a main business stakeholder


Contents

  1. Setup

  2. Feature Engineering

  3. Visualization of the Data Distributions


Setup

Let’s start by:

  • Installing and importing any dependencies

  • Instantiating SageMaker session

  • Specifying the S3 bucket and prefix that you want to use for your training and model data. This should be within the same region as SageMaker training

  • Defining the IAM role used to give training access to your data

[ ]:
# Install any missing dependencies
!pip install -qU 'sagemaker-experiments==0.1.24' 'sagemaker>=2.16.1' 'boto3' 'awswrangler'
[ ]:
import os
import json
import sys
import collections
import glob
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# SageMaker dependencies
import boto3
import sagemaker
from sagemaker import get_execution_role
from sagemaker.image_uris import retrieve
import awswrangler as wr

# This instantiates a SageMaker session that we will be operating in.
smclient = boto3.Session().client("sagemaker")
region = boto3.Session().region_name

# This object represents the IAM role that we are assigned.
role = sagemaker.get_execution_role()

sess = sagemaker.Session()
bucket = sess.default_bucket()

# prefix is the path within the bucket where SageMaker stores the output from training jobs.
prefix_prm = "predmaint"  # place to upload training files within the bucket

Feature Engineering

For PrM, feature selection, generation and engineering is extremely important and very depended on domain expertise and understanding of the systems involved. For our solution, we will focus on the some simple features such as: * lag features * rolling average * rolling standard deviation * age of the engines * categorical labels

These features serve as a small example of the potential features that could be created. Other features to consider are changes in the sensor values within a window, change from the initial value or number over a defined threshold. For additional guidance on Feature Engineering, visit the `SageMaker Tabular Feature Engineering guide <>`__.

First, we load up our cleaned dataset, which can be produced by following the steps in the notebook Data Prep: Processing Job from SageMaker Data Wrangler Output (the first section in this notebook series). See the Background section at the beginning of the notebook for more information.

[ ]:
fleet = pd.read_csv("fleet_data.csv")
[ ]:
%matplotlib inline
fig, axs = plt.subplots(3, 1, figsize=(20, 15))
plot_fleet = fleet.loc[fleet["vehicle_id"] == 1]

sns.set_style("darkgrid")
axs[0].plot(plot_fleet["datetime"], plot_fleet["voltage"])
axs[1].plot(plot_fleet["datetime"], plot_fleet["current"])
axs[2].plot(plot_fleet["datetime"], plot_fleet["resistance"])

axs[0].set_ylabel("voltage")
axs[1].set_ylabel("current")
axs[2].set_ylabel("resistance");
[ ]:
fig, axs = plt.subplots(3, 1, figsize=(20, 15))
plot_fleet = fleet.loc[fleet["vehicle_id"] == 2]

sns.set_style("darkgrid")
axs[0].plot(plot_fleet["datetime"], plot_fleet["voltage"])
axs[1].plot(plot_fleet["datetime"], plot_fleet["current"])
axs[2].plot(plot_fleet["datetime"], plot_fleet["resistance"])

axs[0].set_ylabel("voltage")
axs[1].set_ylabel("current")
axs[2].set_ylabel("resistance");
[ ]:
# let's look at the proportion of failures to non-failure
print(fleet["target"].value_counts())
print(
    "\nPercent of failures in the dataset: "
    + str(fleet["target"].value_counts()[1] / len(fleet["target"]))
)
print(
    "Number of vehicles with 1+ failures: "
    + str(fleet[fleet["target"] == 1]["vehicle_id"].drop_duplicates().count())
    + "\n"
)

# view the percentage distribution of target column
print(fleet["target"].value_counts() / np.float(len(fleet)))

We can see that percentage of observations of the class label 0 (no failure) and 1 (failure) is 80.42% and 19.58% respectively. So, this is a class imbalanced problem. For PrM, class imbalance is oftentimes a problem as failues happen less frequently and businesses do not want to allow for more failures than is necessary. There are a variety of techniques for dealing with class imbalances in data such as SMOTE. For this use case, we will leverage SageMaker’s Estimator built-in hyperparameters to I will deal with imbalance. We discuss more in a later section.

[ ]:
p = fleet.groupby(["vehicle_id"])["target"].sum().rename("percentage of failures")
fail_percent = pd.DataFrame(p / 100)
print(fail_percent.sort_values("percentage of failures", ascending=False).head(20))
# fail_percent.plot(kind='box')
[ ]:
# check for missing values
print(fleet.isnull().sum())

# check sensor readings for zeros
fleet[fleet.loc[:, "voltage":"resistance"].values == 0]
[ ]:
# # optional: load in the fleet dataset from above
# fleet = pd.read_csv('fleet_data.csv')
fleet.datetime = pd.to_datetime(fleet.datetime)
[ ]:
# add lag features for voltage, current and resistance
# we will only look as 2 lags
for i in range(1, 2):
    fleet["voltage_lag_" + str(i)] = (
        fleet.groupby("vehicle_id")["voltage"].shift(i).fillna(method="bfill", limit=7)
    )
    fleet["current_lag_" + str(i)] = (
        fleet.groupby("vehicle_id")["current"].shift(i).fillna(method="bfill", limit=7)
    )
    fleet["resistance_lag_" + str(i)] = (
        fleet.groupby("vehicle_id")["resistance"].shift(i).fillna(method="bfill", limit=7)
    )
[ ]:
# create rolling stats for voltage, current and resistance group by vehicle_id
stats = pd.DataFrame()
grouped = fleet.groupby("vehicle_id")

# windows set to 4
# you could also add in additional rolling window lengths based on the machinery and domain knowledge
mean = [
    (col + "_" + "rolling_mean_" + str(win), grouped[col].rolling(window=win).mean())
    for win in [4]
    for col in ["voltage", "current", "resistance"]
]
std = [
    (col + "_" + "rolling_std_" + str(win), grouped[col].rolling(window=win).std())
    for win in [4]
    for col in ["voltage", "current", "resistance"]
]
df_mean = pd.DataFrame.from_dict(collections.OrderedDict(mean))
df_std = pd.DataFrame.from_dict(collections.OrderedDict(std))
stats = (
    pd.concat([df_mean, df_std], axis=1)
    .reset_index()
    .set_index("level_1")
    .fillna(method="bfill", limit=7)
)  # fill backward
stats.head(5)
[ ]:
fleet_lagged = pd.concat([fleet, stats.drop(columns=["vehicle_id"])], axis=1)
fleet_lagged.head(2)
[ ]:
# let's look at the descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution
round(fleet_lagged.describe(), 2).T

Visualization of the Data Distributions

[ ]:
# plot a single engine's histograms
# we will lood at vehicle_id 2 as it has 1+ failures
def plot_engine_hists(sensor_data):
    cols = sensor_data.columns
    n_cols = min(len(cols), 4)
    n_rows = int(np.ceil(len(cols) / n_cols))

    fig, axes = plt.subplots(n_rows, n_cols, figsize=(15, 15))
    plt.tight_layout()
    axes = axes.flatten()
    for col, ax in zip(cols, axes):
        sns.distplot(sensor_data[[col]], ax=ax, label=col)
        ax.set_xlabel(col)
        ax.set_ylabel("p")


plot_engine_hists(fleet_lagged[fleet_lagged["vehicle_id"] == 2].loc[:, "voltage":])

You should get a diagram that looks like the diagram below.

image0

[ ]:
# remove features used for one-hot encoding the categorical features including make, model, engine_type and vehicle_class
features = fleet_lagged.drop(columns=["make", "model", "year", "vehicle_class", "engine_type"])
features.to_csv("features.csv", index=False)
features_created_prm = True
%store features_created_prm
[ ]:
features = pd.read_csv("features.csv")

Although we have kept the EDA and feature engineering limited here, there is much more that could be done. Additional analysis could be done to understand if the relationships between the make and model and/or the engine type and failure rates. Also, much more analysis could be done based on discussions with domain experts and their in-depth understandings of the systems based on experience.

Now let’s split our data into train, test and validation

For PrM, we will want to split the data based on a time-dependent record splitting strategy since the data is time series sensor readings. We will make the splits by choosing a points in time based on the desired size of the training, test and validations sets. To prevent any records in the training set from sharing time windows with the records in the test set, we remove any records at the boundary.

[ ]:
# we will devote 80% to training, and we will save 10% for test and ~10% for validation (less the dropped records to avoid data leakage)
train_size = int(len(features) * 0.80)
val_size = int(len(features) * 0.10)

# order by datetime in order to split on time
ordered = features.sort_values("datetime")

# make train, test and validation splits
train, test, val = (
    ordered[0:train_size],
    ordered[train_size : train_size + val_size],
    ordered.tail(val_size),
)
train.sort_values(["vehicle_id", "datetime"], inplace=True)

# make sure there is no data leakage between train, test and validation
test = test.loc[test["datetime"] > train["datetime"].max()]
val = val.loc[val["datetime"] > test["datetime"].max()]

print("First train datetime: ", train["datetime"].min())
print("Last train datetime: ", train["datetime"].max(), "\n")
print("First test datetime: ", test["datetime"].min())
print("Last test datetime: ", test["datetime"].max(), "\n")
print("First validation datetime: ", val["datetime"].min())
print("Last validation datetime: ", val["datetime"].max())
[ ]:
train = train.drop(["datetime", "vehicle_id"], axis=1)

test = test.sort_values(["vehicle_id", "datetime"])
test = test.drop(["datetime", "vehicle_id"], axis=1)

val = val.sort_values(["vehicle_id", "datetime"])
val = val.drop(["datetime", "vehicle_id"], axis=1)
[ ]:
print("Total Observations: ", len(ordered))
print("Number of observations in the training data:", len(train))
print("Number of observations in the test data:", len(test))
print("Number of observations in the validation data:", len(val))

Converting data to the appropriate format for Estimator

Amazon SageMaker implementation of Linear Learner takes either csv format or recordIO-wrapped protobuf. We will start by scaling the features and saving the data files to csv format. Then, we will save the data to file. If you are using your own data, and it is too large to fit in memory, protobuf might be a better option than csv. For more information on data formats for training, please refer to Common Data Formats for Training.

[ ]:
# scale all features for train, test and validation
from sklearn import preprocessing

scaler = preprocessing.MinMaxScaler(feature_range=(0.0, 1.0))
train = pd.DataFrame(scaler.fit_transform(train))
test = pd.DataFrame(scaler.transform(test))
val = pd.DataFrame(scaler.transform(val))

train.to_csv("train.csv", header=False, index=False)
test.to_csv("test.csv", header=False, index=False)
test.loc[:, 1:].to_csv("test_x.csv", header=False, index=False)
val.to_csv("validation.csv", header=False, index=False)

Next Notebook : Train

Once you have selected some models that you would like to try out, SageMaker Experiments can be a great tool to track and compare all of the models before selecting the best model to deploy. We will set up an experiment using SageMaker experiments to track all the model training iterations for the Linear Learner Estimator we will try. You can read more about SageMaker Experiments to learn about experiment features, tracking and comparing outputs.