Operationalizing-an-AWS-ML-Project

Project developed for AWS Machine Learning Engineer Scholarship offered by Udacity (2023)

The right configuration for deployment is a very important step in machine learning operations as its can avoid problems such as high costs and bad performance. Some examples of configurations for production deployment of a model includes computer resources such as machine instance type and number of instances for training and deployment, security since poor security configuration can leads to data leaks or performance issues. By implement the right configuration we can have a high-throughtput and low-lantecy machine learning model in production.

Setup notebook instance

Finding SageMaker in AWS

In SageMaker we then create a notebook instance in Notebook -> Notebook Instances -> Create notebook instance button

Then we create a new instance choosing a notebook instance name and type. For this project ml.m5.xlarge instance type was selected

Bellow you can see a notebook instance called mlops already created

With the notebook instance create we can upload a jupyter notebook, the script.py file to train our deep learning image classification model, the inference.py script and an image for test. Bellow you can find the list of files that we need to upload:

train_and_deploy-solution.ipynb hpo.py inference2.py lab.jpg

Before we can start to train our model we need first create a S3 bucket where we will upload our train, validation and test data to. So let's do it now :)

Setup S3

Finding s3

Next, we create a new bucket by clicking in create a new bucket button and give our S3 bucket a unique name

As we can see our bucket was created in S3

Uploading data to S3

The snipped code bellow shows how to donwload data using wget command and upload it to AWS s3 using the cp command.

%%capture
!wget https://s3-us-west-1.amazonaws.com/udacity-aind/dog-project/dogImages.zip
!unzip dogImages.zip
!aws s3 cp dogImages s3://mlopsimageclassification/data/ --recursive

Note This code are located in the train_and_deploy.ipynb

Bellow can see that data was successfuly uploaded to s3

Now that we have our data in S3 we can train our model.

So let's starting by reviewing some important information you will see in the jupyter notebook

Hyperparameter tunning

Defining enviroment variables for hyperparameter tunning

SM_CHANNEL_TRAINING: where the data used to train model is located in AWS S3

SM_MODEL_DIR: where model artifact will be saved in S3

SM_OUTPUT_DATA_DIR: where output will be saved in S3

Here we are passing some paths to our S3 which will be used by the notebook instance to get data, save model and output

os.environ['SM_CHANNEL_TRAINING']='s3://mlopsimageclassification/data/'
os.environ['SM_MODEL_DIR']='s3://mlopsimageclassification/model/'
os.environ['SM_OUTPUT_DATA_DIR']='s3://mlopsimageclassification/output/'

Here we can see how we can access the enviroment variables in hpo.py script

if __name__=='__main__':
    parser=argparse.ArgumentParser()
    parser.add_argument('--learning_rate', type=float)
    parser.add_argument('--batch_size', type=int)
    parser.add_argument('--data', type=str, default=os.environ['SM_CHANNEL_TRAINING'])
    parser.add_argument('--model_dir', type=str, default=os.environ['SM_MODEL_DIR'])
    parser.add_argument('--output_dir', type=str, default=os.environ['SM_OUTPUT_DATA_DIR'])
    
    args=parser.parse_args()

For this model two hyperparameters was tunning: learning rate and batch size.

hyperparameter_ranges = {
    "learning_rate": ContinuousParameter(0.001, 0.1),
    "batch_size": CategoricalParameter([32, 64, 128, 256, 512]),
}

Bellow you can see how hyperparameter tuner and estimator was defined. Notice that we are using a py script (hpo.py) as entry point to the estimator, this script contains the code need to train model with different hyperparameters values.

estimator = PyTorch(
    entry_point="hpo.py",
    base_job_name='pytorch_dog_hpo',
    role=role,
    framework_version="1.4.0",
    instance_count=1,
    instance_type="ml.g4dn.xlarge",
    py_version='py3'
)

tuner = HyperparameterTuner(
    estimator,
    objective_metric_name,
    hyperparameter_ranges,
    metric_definitions,
    max_jobs=2,
    max_parallel_jobs=1,  # you once have one ml.g4dn.xlarge instance available
    objective_type=objective_type
)

tuner.fit({"training": "s3://mlopsimageclassification/data/"})

Note We are passing a S3 path where the data for training, validation and testing are loacated to the HyperparameterTuner fit method

After we start the model training we can see the training job status at SageMaker -> Training -> Training Jobs

Training Model with best hyperparameters values

Without multi-instance

Notice that training a model without enable multi-instance took 21 minutes to complete

Deploying model

We can check the deployed model in SageMaker -> Inference -> Endpoints

Notice that the model was deployed with one initial instance and a instance type which uses the type ml.m5.large

predictor = pytorch_model.deploy(initial_instance_count=1, instance_type='ml.m5.large')

Bellow we can see the deployed model

With multi-instance

EC2 Setup

EC2 as others AWS services can be founded by search it by name in AWS

Now we can create our new instance by clicking in Launch instances button

First, we must give a name to our instance

We are now selecting an Amazon Machine Image (AMI), which is a supported and maintained image provided by AWS that contains the necessary information to launch an instance. Since we will be training a deep learning model with PyTorch, we need to select an AMI that supports PyTorch for deep learning.

We can have an overview of the AMI information in this image

Next, we need to choose an EC2 instance that is supported by this AMI. According to the documentation, this type of AMI supports the following instances: G3, P3, P3dn, P4d, G5, and G4dn.

EC2 requires a key pair that can be used, for example, to SSH into our instance from another service. A good example would be SSHing into our instance from AWS Cloud9.

Note To simplify things, other configurations will be set to their default values

Now that we created our instance we can connecting to it following the three images bellow

If everything works well we are connected to our instance. The last step is activate pytorch virtual enviroment by typing source activate pytorch on terminal

Now the fun part :)

First we need to donwload the dataset to EC2 by running the following command on terminal

wget https://s3-us-west-1.amazonaws.com/udacity-aind/dog-project/dogImages.zipunzip dogImages.zip

Since we are downloading our data to EC2, we can retrieve the path in the training script as follows. This is a key difference between training the model in a notebook instance versus training it on EC2

data = 'dogImages'
train_data_path = os.path.join(data, 'train')
test_data_path = os.path.join(data, 'test')
validation_data_path=os.path.join(data, 'valid')

Next, we are going to create a directory to save the trained models

mkdir TrainedModels

Now, we need to create a Python file and paste the training code into it

Use vim to create an empy file

vim solution.py

Use the following command so that we paste our code into solution.py

:set paste

Copy the code located in https://github.com/mathewsrc/Operationalizing-an-AWS-ML-Project/blob/master/ec2train1.py and paste into solution.py

:wq! + Press Enter

Now we can run our script to train the model

python solution.py

EC2 vs Notebook instance for training models

Both services have their own advantages:

EC2 instances can be easily scaled up or down based on computing needs, can be customized to meet specific requirements such as framework (pytroch or tensorflow), number of CPUs, memory size and GPU support and EC2 instances can be optimized for high-performance computing, which can greatly reduce the time it takes to train large machine learning models.

Notebook instances have their own advantages too such as: quick setup as they comes with pre-configured with popular machine learning frameworks and libraries, easy collaboration and integration with others AWS services such as AWS SageMaker, which provides a lot of tools required for machine learning engineering and operations.

Lambda Functions Setup

The following images show how to create a AWS Lambda Function:

Finding Lambda Functions

Creating a Lambda Function

To create a Lambda Function click on Create a function button

Deploying a Lambda Function

To update our Lambda Function we need to click on Deploy button. The Deploy button is located to the right of Test button

Lambda Function configuration

Notice that we have the ability to adjust the memory and storage requirements based on our specific needs.

After we create the Lambda Function we can replace the default code with the code located in https://github.com/mathewsrc/Operationalizing-an-AWS-ML-Project/blob/master/lamdafunction.py

Since a Lambda function will invoke a SageMaker endpoint, we need to grant permission to the Lambda function to access SageMaker

runtime=boto3.Session().client('sagemaker-runtime')
    
    response=runtime.invoke_endpoint(EndpointName=endpoint_Name,
                                    ContentType="application/json",
                                    Accept='application/json',
                                    #Body=bytearray(x)
                                    Body=json.dumps(bs))
    
    result=response['Body'].read().decode('utf-8')
    sss=json.loads(result)

Adding SageMaker access permission to Lambda Function

We need to add a new policy to our Lambda function so that it can access SageMaker. This can be done through AWS IAM.

First select roles

Next, we need find our Lambda Function and click on it

Click on Add permissions button and then on Attach policies button

Finally, we should search for SageMaker and select an appropriate policy. While the full access option may be the simplest choice, it's important to remember that granting excessive permissions to a service can pose security risks. Therefore, it's advisable to carefully consider the level of access required for your specific use case.

Now with the right permission we can create a new test to test our Lambda Function.

First click on Test button

Now give a name for the test

Replace the default JSON with the following JSON data, as shown in the image below

{ "url": "https://s3.amazonaws.com/cdn-origin-etr.akc.org/wp-content/uploads/2017/11/20113314/Carolina-Dog-standing-outdoors.jpg" }

Scaling Lambda Function and Endpoint

Adding concurrency to Lambda Function

By default a Lambda Function can only respond one request at once. One way to change that is to use concurrency so that the Lambda Function can be responds to multiple requests at once. Before add concurrency we need to configure a version which can be done in Configuration tab.

Add a description for the new version and click on Publish button

Now we can add concurrency in Configuration -> Provisioned concurrency -> Edit button

Our final task is to select an integer value for concurrency. Provisioned concurrency initializes a specified number of execution environments, enabling them to respond immediately. Therefore, a higher concurrency level results in reduced latency. In this example the concurrency was set to two so this Lambda Functions can handle two requests at once which could not be enough for services with high demand.

Auto-Scaling endpoint

Auto scaling is a powerful feature of SageMaker that allows for dynamic adjustment of the number of instances used with deployed models based on changes in workload. With auto scaling, SageMaker automatically increases or decreases the number of instances, ensuring that we only pay for the instances that are actively running.

We can enable auto-scaling in SageMaker -> Endpoints -> Endpoint runtime settings

We can increase the maximum of instances for our endpoint

We can define a scaling policy to control how auto-scaling works. In this example, the 'Target value' was set to 20, meaning that when our endpoint receives 20 requests simultaneously, auto-scaling will be triggered, and the number of instances will be increased. The 'Scale In' and 'Scale Out' parameters were both set to 30 seconds, which controls the amount of time auto-scaling should wait before increasing or decreasing the number of instances.

Now we see that our endpoint has auto-scaling enabled

Deleting EC2 instances, Lambda Functions, and Endpoints

To avoid any cost we can delete all services and instances used in this project. Bellow you can see how to terminate, delete or stop services and instances in AWS.

Deleting the notebook instance. Notebook instances are located in SageMaker -> Notebooks -> Notebook Instances

Deleting EC2 instances. EC2 instances are located in EC2 -> Instances

Deleting Lambda Functions. Lambda Functions are located in Lambda -> Functions

Deleting Endpoints. Endpoints are located in SageMaker -> Endpoints

References

https://docs.aws.amazon.com/sagemaker/latest/dg/whatis.html

https://docs.aws.amazon.com/sagemaker/latest/dg/nbi.html

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/concepts.html

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AMIs.html

https://docs.aws.amazon.com/s3/index.html?nc2=h_ql_doc_s3

https://docs.aws.amazon.com/lambda/latest/dg/welcome.html

https://docs.aws.amazon.com/lambda/latest/dg/configuration-concurrency.html

https://docs.aws.amazon.com/IAM/latest/UserGuide/introduction.html

https://docs.aws.amazon.com/sagemaker/latest/dg/endpoint-auto-scaling.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Operationalizing-an-AWS-ML-Project

Setup notebook instance

Setup S3

Uploading data to S3

Hyperparameter tunning

Defining enviroment variables for hyperparameter tunning

Training Model with best hyperparameters values

EC2 Setup

EC2 vs Notebook instance for training models

Lambda Functions Setup

Scaling Lambda Function and Endpoint

Adding concurrency to Lambda Function

Auto-Scaling endpoint

Deleting EC2 instances, Lambda Functions, and Endpoints

References

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
.gitignore		.gitignore
README.md		README.md
ec2train1.py		ec2train1.py
hpo.py		hpo.py
inference2.py		inference2.py
lab.jpg		lab.jpg
lambdafunction.py		lambdafunction.py
train_and_deploy-solution.ipynb		train_and_deploy-solution.ipynb

mathewsrc/Operationalizing-an-AWS-ML-Project

Folders and files

Latest commit

History

Repository files navigation

Operationalizing-an-AWS-ML-Project

Setup notebook instance

Setup S3

Uploading data to S3

Hyperparameter tunning

Defining enviroment variables for hyperparameter tunning

Training Model with best hyperparameters values

EC2 Setup

EC2 vs Notebook instance for training models

Lambda Functions Setup

Scaling Lambda Function and Endpoint

Adding concurrency to Lambda Function

Auto-Scaling endpoint

Deleting EC2 instances, Lambda Functions, and Endpoints

References

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages