Project developed for AWS Machine Learning Engineer Scholarship offered by Udacity (2023)
The right configuration for deployment is a very important step in machine learning operations as its can avoid problems such as high costs and bad performance. Some examples of configurations for production deployment of a model includes computer resources such as machine instance type and number of instances for training and deployment, security since poor security configuration can leads to data leaks or performance issues. By implement the right configuration we can have a high-throughtput and low-lantecy machine learning model in production.
Finding SageMaker in AWS
In SageMaker we then create a notebook instance in Notebook -> Notebook Instances -> Create notebook instance button
Then we create a new instance choosing a notebook instance name and type. For this project ml.m5.xlarge instance type was selected
Bellow you can see a notebook instance called mlops already created
With the notebook instance create we can upload a jupyter notebook, the script.py file to train our deep learning image classification model, the inference.py script and an image for test. Bellow you can find the list of files that we need to upload:
train_and_deploy-solution.ipynb hpo.py inference2.py lab.jpg
Before we can start to train our model we need first create a S3 bucket where we will upload our train, validation and test data to. So let's do it now :)
Finding s3
Next, we create a new bucket by clicking in create a new bucket button and give our S3 bucket a unique name
As we can see our bucket was created in S3
The snipped code bellow shows how to donwload data using wget command and upload it to AWS s3 using the cp command.
%%capture
!wget https://s3-us-west-1.amazonaws.com/udacity-aind/dog-project/dogImages.zip
!unzip dogImages.zip
!aws s3 cp dogImages s3://mlopsimageclassification/data/ --recursive
Note This code are located in the train_and_deploy.ipynb
Bellow can see that data was successfuly uploaded to s3
Now that we have our data in S3 we can train our model.
So let's starting by reviewing some important information you will see in the jupyter notebook
SM_CHANNEL_TRAINING: where the data used to train model is located in AWS S3
SM_MODEL_DIR: where model artifact will be saved in S3
SM_OUTPUT_DATA_DIR: where output will be saved in S3
Here we are passing some paths to our S3 which will be used by the notebook instance to get data, save model and output
os.environ['SM_CHANNEL_TRAINING']='s3://mlopsimageclassification/data/'
os.environ['SM_MODEL_DIR']='s3://mlopsimageclassification/model/'
os.environ['SM_OUTPUT_DATA_DIR']='s3://mlopsimageclassification/output/'
Here we can see how we can access the enviroment variables in hpo.py script
if __name__=='__main__':
parser=argparse.ArgumentParser()
parser.add_argument('--learning_rate', type=float)
parser.add_argument('--batch_size', type=int)
parser.add_argument('--data', type=str, default=os.environ['SM_CHANNEL_TRAINING'])
parser.add_argument('--model_dir', type=str, default=os.environ['SM_MODEL_DIR'])
parser.add_argument('--output_dir', type=str, default=os.environ['SM_OUTPUT_DATA_DIR'])
args=parser.parse_args()
For this model two hyperparameters was tunning: learning rate and batch size.
hyperparameter_ranges = {
"learning_rate": ContinuousParameter(0.001, 0.1),
"batch_size": CategoricalParameter([32, 64, 128, 256, 512]),
}
Bellow you can see how hyperparameter tuner and estimator was defined. Notice that we are using a py script (hpo.py) as entry point to the estimator, this script contains the code need to train model with different hyperparameters values.
estimator = PyTorch(
entry_point="hpo.py",
base_job_name='pytorch_dog_hpo',
role=role,
framework_version="1.4.0",
instance_count=1,
instance_type="ml.g4dn.xlarge",
py_version='py3'
)
tuner = HyperparameterTuner(
estimator,
objective_metric_name,
hyperparameter_ranges,
metric_definitions,
max_jobs=2,
max_parallel_jobs=1, # you once have one ml.g4dn.xlarge instance available
objective_type=objective_type
)
tuner.fit({"training": "s3://mlopsimageclassification/data/"})
Note We are passing a S3 path where the data for training, validation and testing are loacated to the HyperparameterTuner fit method
After we start the model training we can see the training job status at SageMaker -> Training -> Training Jobs
Without multi-instance
Notice that training a model without enable multi-instance took 21 minutes to complete
Deploying model
We can check the deployed model in SageMaker -> Inference -> Endpoints
Notice that the model was deployed with one initial instance and a instance type which uses the type ml.m5.large
predictor = pytorch_model.deploy(initial_instance_count=1, instance_type='ml.m5.large')
Bellow we can see the deployed model
With multi-instance
EC2 as others AWS services can be founded by search it by name in AWS
Now we can create our new instance by clicking in Launch instances button
First, we must give a name to our instance
We are now selecting an Amazon Machine Image (AMI), which is a supported and maintained image provided by AWS that contains the necessary information to launch an instance. Since we will be training a deep learning model with PyTorch, we need to select an AMI that supports PyTorch for deep learning.
We can have an overview of the AMI information in this image
Next, we need to choose an EC2 instance that is supported by this AMI. According to the documentation, this type of AMI supports the following instances: G3, P3, P3dn, P4d, G5, and G4dn.
EC2 requires a key pair that can be used, for example, to SSH into our instance from another service. A good example would be SSHing into our instance from AWS Cloud9.
Note To simplify things, other configurations will be set to their default values
Now that we created our instance we can connecting to it following the three images bellow
If everything works well we are connected to our instance. The last step is activate pytorch virtual enviroment by typing source activate pytorch on terminal
Now the fun part :)
First we need to donwload the dataset to EC2 by running the following command on terminal
wget https://s3-us-west-1.amazonaws.com/udacity-aind/dog-project/dogImages.zipunzip dogImages.zip
Since we are downloading our data to EC2, we can retrieve the path in the training script as follows. This is a key difference between training the model in a notebook instance versus training it on EC2
data = 'dogImages'
train_data_path = os.path.join(data, 'train')
test_data_path = os.path.join(data, 'test')
validation_data_path=os.path.join(data, 'valid')
Next, we are going to create a directory to save the trained models
mkdir TrainedModels
Now, we need to create a Python file and paste the training code into it
Use vim to create an empy file
vim solution.py
Use the following command so that we paste our code into solution.py
:set paste
Copy the code located in https://github.com/mathewsrc/Operationalizing-an-AWS-ML-Project/blob/master/ec2train1.py and paste into solution.py
:wq! + Press Enter
Now we can run our script to train the model
python solution.py
Both services have their own advantages:
EC2 instances can be easily scaled up or down based on computing needs, can be customized to meet specific requirements such as framework (pytroch or tensorflow), number of CPUs, memory size and GPU support and EC2 instances can be optimized for high-performance computing, which can greatly reduce the time it takes to train large machine learning models.
Notebook instances have their own advantages too such as: quick setup as they comes with pre-configured with popular machine learning frameworks and libraries, easy collaboration and integration with others AWS services such as AWS SageMaker, which provides a lot of tools required for machine learning engineering and operations.
The following images show how to create a AWS Lambda Function:
Finding Lambda Functions
Creating a Lambda Function
To create a Lambda Function click on Create a function button
Deploying a Lambda Function
To update our Lambda Function we need to click on Deploy button. The Deploy button is located to the right of Test button
Lambda Function configuration
Notice that we have the ability to adjust the memory and storage requirements based on our specific needs.
After we create the Lambda Function we can replace the default code with the code located in https://github.com/mathewsrc/Operationalizing-an-AWS-ML-Project/blob/master/lamdafunction.py
Since a Lambda function will invoke a SageMaker endpoint, we need to grant permission to the Lambda function to access SageMaker
runtime=boto3.Session().client('sagemaker-runtime')
response=runtime.invoke_endpoint(EndpointName=endpoint_Name,
ContentType="application/json",
Accept='application/json',
#Body=bytearray(x)
Body=json.dumps(bs))
result=response['Body'].read().decode('utf-8')
sss=json.loads(result)
Adding SageMaker access permission to Lambda Function
We need to add a new policy to our Lambda function so that it can access SageMaker. This can be done through AWS IAM.
First select roles
Next, we need find our Lambda Function and click on it
Click on Add permissions button and then on Attach policies button
Finally, we should search for SageMaker and select an appropriate policy. While the full access option may be the simplest choice, it's important to remember that granting excessive permissions to a service can pose security risks. Therefore, it's advisable to carefully consider the level of access required for your specific use case.
Now with the right permission we can create a new test to test our Lambda Function.
First click on Test button
Now give a name for the test
Replace the default JSON with the following JSON data, as shown in the image below
{ "url": "https://s3.amazonaws.com/cdn-origin-etr.akc.org/wp-content/uploads/2017/11/20113314/Carolina-Dog-standing-outdoors.jpg" }
By default a Lambda Function can only respond one request at once. One way to change that is to use concurrency so that the Lambda Function can be responds to multiple requests at once. Before add concurrency we need to configure a version which can be done in Configuration tab.
Add a description for the new version and click on Publish button
Now we can add concurrency in Configuration -> Provisioned concurrency -> Edit button
Our final task is to select an integer value for concurrency. Provisioned concurrency initializes a specified number of execution environments, enabling them to respond immediately. Therefore, a higher concurrency level results in reduced latency. In this example the concurrency was set to two so this Lambda Functions can handle two requests at once which could not be enough for services with high demand.
Auto scaling is a powerful feature of SageMaker that allows for dynamic adjustment of the number of instances used with deployed models based on changes in workload. With auto scaling, SageMaker automatically increases or decreases the number of instances, ensuring that we only pay for the instances that are actively running.
We can enable auto-scaling in SageMaker -> Endpoints -> Endpoint runtime settings
We can increase the maximum of instances for our endpoint
We can define a scaling policy to control how auto-scaling works. In this example, the 'Target value' was set to 20, meaning that when our endpoint receives 20 requests simultaneously, auto-scaling will be triggered, and the number of instances will be increased. The 'Scale In' and 'Scale Out' parameters were both set to 30 seconds, which controls the amount of time auto-scaling should wait before increasing or decreasing the number of instances.
Now we see that our endpoint has auto-scaling enabled
To avoid any cost we can delete all services and instances used in this project. Bellow you can see how to terminate, delete or stop services and instances in AWS.
Deleting the notebook instance. Notebook instances are located in SageMaker -> Notebooks -> Notebook Instances
Deleting EC2 instances. EC2 instances are located in EC2 -> Instances
Deleting Lambda Functions. Lambda Functions are located in Lambda -> Functions
Deleting Endpoints. Endpoints are located in SageMaker -> Endpoints
https://docs.aws.amazon.com/sagemaker/latest/dg/whatis.html
https://docs.aws.amazon.com/sagemaker/latest/dg/nbi.html
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/concepts.html
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AMIs.html
https://docs.aws.amazon.com/s3/index.html?nc2=h_ql_doc_s3
https://docs.aws.amazon.com/lambda/latest/dg/welcome.html
https://docs.aws.amazon.com/lambda/latest/dg/configuration-concurrency.html
https://docs.aws.amazon.com/IAM/latest/UserGuide/introduction.html
https://docs.aws.amazon.com/sagemaker/latest/dg/endpoint-auto-scaling.html