Distributed data parallel BERT training with TensorFlow 2 and SageMaker distributed
This notebook’s CI test result for us-west-2 is as follows. CI test results in other regions can be found at the end of the notebook.
Amazon SageMaker’s distributed library can be used to train deep learning models faster and cheaper. The data parallel feature in this library (smdistributed.dataparallel
) is a distributed data parallel training framework for PyTorch, TensorFlow, and MXNet.
This notebook example shows how to use smdistributed.dataparallel
with TensorFlow(version 2.4.1) on Amazon SageMaker to train a BERT model using Amazon FSx for Lustre file-system as data source.
The outline of steps is as follows:
Stage dataset in Amazon S3. Original dataset for BERT pretraining consists of text passages from BooksCorpus (800M words) (Zhu et al. 2015) and English Wikipedia (2,500M words). Please follow original guidelines by NVidia to prepare training data in hdf5 format - https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/LanguageModeling/BERT/README.md#getting-the-data
Create Amazon FSx Lustre file-system and import data into the file-system from S3
Build Docker training image and push it to Amazon ECR
Configure data input channels for SageMaker
Configure hyper-prarameters
Define training metrics
Define training job, set distribution strategy for
smdistributed.dataparallel
and start training
NOTE: With large traning dataset, we recommend using Amazon FSx as the input filesystem for the SageMaker training job. FSx file input to SageMaker significantly cuts down training start up time on SageMaker because it avoids downloading the training data each time you start the training job (as done with S3 input for SageMaker training job) and provides good data read throughput.
NOTE: This example requires SageMaker Python SDK v2.X.
Amazon SageMaker Initialization
Initialize the notebook instance. Get the aws region, sagemaker execution role.
The following code cell defines role
which is the IAM role ARN used to create and run SageMaker training and hosting jobs. This is the same IAM role used to create this SageMaker Notebook instance.
role
must have permission to create a SageMaker training job and endpoint. For granular policies you can use to grant these permissions, see Amazon SageMaker Roles. If you do not require fine-tuned permissions for this demo, you can used the IAM managed policy AmazonSageMakerFullAccess to complete this demo.
As described above, since we will be using FSx, please make sure to attach FSx Access
permission to this IAM role.
[ ]:
%%time
! python3 -m pip install --upgrade sagemaker
import sagemaker
from sagemaker import get_execution_role
from sagemaker.estimator import Estimator
import boto3
sagemaker_session = sagemaker.Session()
bucket = sagemaker_session.default_bucket()
role = (
get_execution_role()
) # provide a pre-existing role ARN as an alternative to creating a new role
print(f"SageMaker Execution Role:{role}")
client = boto3.client("sts")
account = client.get_caller_identity()["Account"]
print(f"AWS account:{account}")
session = boto3.session.Session()
region = session.region_name
print(f"AWS region:{region}")
To verify that the role above has required permissions:
Go to the IAM console: https://console.aws.amazon.com/iam/home.
Select Roles.
Enter the role name in the search box to search for that role.
Select the role.
Use the Permissions tab to verify this role has required permissions attached.
Prepare SageMaker Training Images
SageMaker by default use the latest Amazon Deep Learning Container Images (DLC) TensorFlow training image. In this step, we use it as a base image and install additional dependencies required for training BERT model.
In the Github repository https://github.com/HerringForks/DeepLearningExamples.git we have made
smdistributed.dataparallel
TensorFlow 2 BERT training script available for your use. This repository will be cloned in the training image for running the model training.
Build and Push Docker Image to ECR
Run the below command build the docker image and push it to ECR.
[ ]:
image = "<IMAGE_NAME>" # Example: tf2-smdataparallel-bert-sagemaker
tag = "<IMAGE_TAG>" # Example: latest
[ ]:
!pygmentize ./Dockerfile
[ ]:
!pygmentize ./build_and_push.sh
[ ]:
%%time
! chmod +x build_and_push.sh; bash build_and_push.sh {region} {image} {tag}
Preparing FSx Input for SageMaker
Download and prepare your training dataset on S3.
Follow the steps listed here to create a FSx linked with your S3 bucket with training data - https://docs.aws.amazon.com/fsx/latest/LustreGuide/create-fs-linked-data-repo.html. Make sure to add an endpoint to your VPC allowing S3 access.
Follow the steps listed here to configure your SageMaker training job to use FSx https://aws.amazon.com/blogs/machine-learning/speed-up-training-on-amazon-sagemaker-using-amazon-efs-or-amazon-fsx-for-lustre-file-systems/
Important Caveats
You need use the same
subnet
andvpc
andsecurity group
used with FSx when launching the SageMaker notebook instance. The same configurations will be used by your SageMaker training job.Make sure you set appropriate inbound/output rules in the
security group
. Specically, opening up these ports is necessary for SageMaker to access the FSx filesystem in the training job. https://docs.aws.amazon.com/fsx/latest/LustreGuide/limit-access-security-groups.htmlMake sure
SageMaker IAM Role
used to launch this SageMaker training job has access toAmazonFSx
.
SageMaker TensorFlow Estimator function options
In the following code block, you can update the estimator function to use a different instance type, instance count, and distrubtion strategy. You’re also passing in the training script you reviewed in the previous cell.
Instance types
smdistributed.dataparallel
supports model training on SageMaker with the following instance types only. For best performance, it is recommended you use an instance type that supports Amazon Elastic Fabric Adapter (ml.p3dn.24xlarge and ml.p4d.24xlarge).
ml.p3.16xlarge
ml.p3dn.24xlarge [Recommended]
ml.p4d.24xlarge [Recommended]
Instance count
To get the best performance and the most out of smdistributed.dataparallel
, you should use at least 2 instances, but you can also use 1 for testing this example.
Distribution strategy
Note that to use DDP mode, you update the the distribution
strategy, and set it to use smdistributed dataparallel
.
Training script
In the Github repository https://github.com/HerringForks/deep-learning-models.git we have made reference smdistributed.dataparallel
TensorFlow BERT training script available for your use. Clone the repository.
[ ]:
# Clone herring forks repository for reference implementation BERT with TensorFlow2-SMDataParallel
!rm -rf deep-learning-models
!git clone --recursive https://github.com/HerringForks/deep-learning-models.git
[ ]:
from sagemaker.tensorflow import TensorFlow
[ ]:
instance_type = "ml.p3dn.24xlarge" # Other supported instance type: ml.p3.16xlarge, ml.p4d.24xlarge
instance_count = 2 # You can use 2, 4, 8 etc.
docker_image = f"{account}.dkr.ecr.{region}.amazonaws.com/{image}:{tag}" # YOUR_ECR_IMAGE_BUILT_WITH_ABOVE_DOCKER_FILE
username = "AWS"
subnets = ["<SUBNET_ID>"] # Should be same as Subnet used for FSx. Example: subnet-0f9XXXX
security_group_ids = [
"<SECURITY_GROUP_ID>"
] # Should be same as Security group used for FSx. sg-03ZZZZZZ
job_name = "smdataparallel-bert-tf2-fsx-2p3dn" # This job name is used as prefix to the sagemaker training job. Makes it easy for your look for your training job in SageMaker Training job console.
file_system_id = "<FSX_ID>" # FSx file system ID with your training dataset. Example: 'fs-0bYYYYYY'
[ ]:
SM_DATA_ROOT = "/opt/ml/input/data/train"
hyperparameters = {
"train_dir": "/".join(
[
SM_DATA_ROOT,
"tfrecords/train/max_seq_len_128_max_predictions_per_seq_20_masked_lm_prob_15",
]
),
"val_dir": "/".join(
[
SM_DATA_ROOT,
"tfrecords/validation/max_seq_len_128_max_predictions_per_seq_20_masked_lm_prob_15",
]
),
"log_dir": "/".join([SM_DATA_ROOT, "checkpoints/bert/logs"]),
"checkpoint_dir": "/".join([SM_DATA_ROOT, "checkpoints/bert"]),
"load_from": "scratch",
"model_type": "bert",
"model_size": "large",
"per_gpu_batch_size": 64,
"max_seq_length": 128,
"max_predictions_per_seq": 20,
"optimizer": "lamb",
"learning_rate": 0.005,
"end_learning_rate": 0.0003,
"hidden_dropout_prob": 0.1,
"attention_probs_dropout_prob": 0.1,
"gradient_accumulation_steps": 1,
"learning_rate_decay_power": 0.5,
"warmup_steps": 2812,
"total_steps": 2000,
"log_frequency": 10,
"run_name": job_name,
"squad_frequency": 0,
}
[ ]:
estimator = TensorFlow(
entry_point="albert/run_pretraining.py",
role=role,
image_uri=docker_image,
source_dir="deep-learning-models/models/nlp",
framework_version="2.4.1",
py_version="py37",
instance_count=instance_count,
instance_type=instance_type,
sagemaker_session=sagemaker_session,
subnets=subnets,
hyperparameters=hyperparameters,
security_group_ids=security_group_ids,
debugger_hook_config=False,
# Training using smdistributed.dataparallel Distributed Training Framework
distribution={"smdistributed": {"dataparallel": {"enabled": True}}},
)
[ ]:
# Configure FSx Input for your SageMaker Training job
from sagemaker.inputs import FileSystemInput
# YOUR_MOUNT_PATH_FOR_TRAINING_DATA # NOTE: '/fsx/' will be the root mount path. Example: '/fsx/albert''''
file_system_directory_path = "<FSX_DIRECTORY_PATH>"
file_system_access_mode = "rw"
file_system_type = "FSxLustre"
train_fs = FileSystemInput(
file_system_id=file_system_id,
file_system_type=file_system_type,
directory_path=file_system_directory_path,
file_system_access_mode=file_system_access_mode,
)
data_channels = {"train": train_fs}
[ ]:
# Submit SageMaker training job
estimator.fit(inputs=data_channels, job_name=job_name)
Notebook CI Test Results
This notebook was tested in multiple regions. The test results are as follows, except for us-west-2 which is shown at the top of the notebook.