Using TensorFlow Scripts in SageMaker - Quickstart
This notebook’s CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook.
Starting with TensorFlow version 1.11, you can use SageMaker’s TensorFlow containers to train TensorFlow scripts the same way you would train outside SageMaker. This feature is named Script Mode.
This example uses Multi-layer Recurrent Neural Networks (LSTM, RNN) for character-level language models in Python using Tensorflow. You can use the same technique for other scripts or repositories, including TensorFlow Model Zoo and TensorFlow benchmark scripts.
Get the data
For training data, we use plain text versions of Sherlock Holmes stories. Let’s create a folder named sherlock to store our dataset:
[ ]:
import os
data_dir = os.path.join(os.getcwd(), "sherlock")
os.makedirs(data_dir, exist_ok=True)
We need to download the dataset to this folder:
[ ]:
!wget https://sherlock-holm.es/stories/plain-text/cnus.txt --force-directories --output-document=sherlock/input.txt
Preparing the training script
For training scripts, let’s use Git integration for SageMaker Python SDK here. That is, you can specify a training script that is stored in a GitHub, CodeCommit or other Git repository as the entry point for the estimator, so that you don’t have to download the scripts locally. If you do so, source directory and dependencies should be in the same repo if they are needed.
To use Git integration, pass a dict git_config
as a parameter when you create the TensorFlow
Estimator object. In the git_config
parameter, you specify the fields repo
, branch
and commit
to locate the specific repo you want to use. If authentication is required to access the repo, you can specify fields 2FA_enabled
, username
, password
and token accordingly.
The scripts we want to use for this example is stored in GitHub repo https://github.com/awslabs/amazon-sagemaker-examples/tree/training-scripts, under the branch training-scripts
. It is a public repo so we don’t need authentication to access it. Let’s specify the git_config
argument here:
[ ]:
git_config = {
"repo": "https://github.com/awslabs/amazon-sagemaker-examples.git",
"branch": "training-scripts",
}
Note that we did not specify commit
in git_config
here, so the latest commit of the specified repo and branch will be used by default.
The scripts we will use are under the char-rnn-tensorflow
directory in the repo. The directory also includes a README.md with an overview of the project, requirements, and basic usage:
Basic Usage
To train with default parameters on the tinyshakespeare corpus, runpython train.py. To access all the parameters usepython train.py –help.
train.py uses the argparse library and requires the following arguments:
parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
# Data and model checkpoints directories
parser.add_argument('--data_dir', type=str, default='data/tinyshakespeare', help='data directory containing input.txt with training examples')
parser.add_argument('--save_dir', type=str, default='save', help='directory to store checkpointed models')
...
args = parser.parse_args()
When SageMaker training finishes, it deletes all data generated inside the container with exception of the directories _/opt/ml/model_
and _/opt/ml/output_
. To ensure that model data is not lost during training, training scripts are invoked in SageMaker with an additional argument --model_dir
. The training script should save the model data that results from the training job to this directory..
The training script executes in the container as shown bellow:
python train.py --num-epochs 1 --data_dir /opt/ml/input/data/training --model_dir /opt/ml/model
Test locally using SageMaker Python SDK TensorFlow Estimator
You can use the SageMaker Python SDK `TensorFlow
<https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/tensorflow/README.rst#training-with-tensorflow>`__ estimator to easily train locally and in SageMaker.
Let’s start by setting the training script arguments --num_epochs
and --data_dir
as hyperparameters. Remember that we don’t need to provide --model_dir
:
[ ]:
hyperparameters = {"num_epochs": 1, "data_dir": "/opt/ml/input/data/training"}
This notebook shows how to use the SageMaker Python SDK to run your code in a local container before deploying to SageMaker’s managed training or hosting environments. Just change your estimator’s train_instance_type to local or local_gpu. For more information, see: https://github.com/aws/sagemaker-python-sdk#local-mode.
In order to use this feature you’ll need to install docker-compose (and nvidia-docker if training with a GPU). Running following script will install docker-compose or nvidia-docker-compose and configure the notebook environment for you.
Note, you can only run a single local notebook at a time.
[ ]:
!/bin/bash ./setup.sh
To train locally, you set train_instance_type
to local:
[ ]:
train_instance_type = "local"
We create the TensorFlow
Estimator, passing the git_config
argument and the flag script_mode=True
. Note that we are using Git integration here, so source_dir
should be a relative path inside the Git repo; otherwise it should be a relative or absolute local path. the Tensorflow
Estimator is created as following:
[ ]:
import os
import sagemaker
from sagemaker.tensorflow import TensorFlow
estimator = TensorFlow(
entry_point="train.py",
source_dir="char-rnn-tensorflow",
git_config=git_config,
instance_type=train_instance_type,
instance_count=1,
hyperparameters=hyperparameters,
role=sagemaker.get_execution_role(), # Passes to the container the AWS role that you are using on this notebook
framework_version="1.15.2",
py_version="py3",
)
To start a training job, we call estimator.fit(inputs)
, where inputs is a dictionary where the keys, named channels, have values pointing to the data location. estimator.fit(inputs)
downloads the TensorFlow container with TensorFlow Python 3, CPU version, locally and simulates a SageMaker training job. When training starts, the TensorFlow container executes train.py, passing hyperparameters
and model_dir
as script arguments, executing the example as follows:
python -m train --num-epochs 1 --data_dir /opt/ml/input/data/training --model_dir /opt/ml/model
[ ]:
inputs = {"training": f"file://{data_dir}"}
estimator.fit(inputs)
Let’s explain the values of --data_dir
and --model_dir
with more details:
/opt/ml/input/data/training is the directory inside the container where the training data is downloaded. The data is downloaded to this folder because
training
is the channel name defined inestimator.fit({'training': inputs})
. See training data for more information./opt/ml/model use this directory to save models, checkpoints, or any other data. Any data saved in this folder is saved in the S3 bucket defined for training. See model data for more information.
Reading additional information from the container
Often, a user script needs additional information from the container that is not available in hyperparameters
. SageMaker containers write this information as environment variables that are available inside the script.
For example, the example above can read information about the training
channel provided in the training job request by adding the environment variable SM_CHANNEL_TRAINING
as the default value for the --data_dir
argument:
if __name__ == '__main__':
parser = argparse.ArgumentParser()
# reads input channels training and testing from the environment variables
parser.add_argument('--data_dir', type=str, default=os.environ['SM_CHANNEL_TRAINING'])
Script mode displays the list of available environment variables in the training logs. You can find the entire list here.
Training in SageMaker
After you test the training job locally, upload the dataset to an S3 bucket so SageMaker can access the data during training:
[ ]:
import sagemaker
inputs = sagemaker.Session().upload_data(path="sherlock", key_prefix="datasets/sherlock")
The returned variable inputs above is a string with a S3 location which SageMaker Tranining has permissions to read data from.
[ ]:
inputs
To train in SageMaker: - change the estimator argument train_instance_type
to any SageMaker ml instance available for training. - set the training
channel to a S3 location.
[ ]:
estimator = TensorFlow(
entry_point="train.py",
source_dir="char-rnn-tensorflow",
git_config=git_config,
instance_type="ml.c4.xlarge", # Executes training in a ml.c4.xlarge instance
instance_count=1,
hyperparameters=hyperparameters,
role=sagemaker.get_execution_role(),
framework_version="1.15.2",
py_version="py3",
)
estimator.fit({"training": inputs})
Notebook CI Test Results
This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.