Skip to content

This project SageMaker features to adjust, improve, configure, and prepare a image classification model for production-grade deployment.

Notifications You must be signed in to change notification settings

mathewsrc/Operationalizing-an-AWS-ML-Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

60 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Operationalizing-an-AWS-ML-Project

Project developed for AWS Machine Learning Engineer Scholarship offered by Udacity (2023)

The right configuration for deployment is a very important step in machine learning operations as its can avoid problems such as high costs and bad performance. Some examples of configurations for production deployment of a model includes computer resources such as machine instance type and number of instances for training and deployment, security since poor security configuration can leads to data leaks or performance issues. By implement the right configuration we can have a high-throughtput and low-lantecy machine learning model in production.


Setup notebook instance

Finding SageMaker in AWS

findsagemaker

In SageMaker we then create a notebook instance in Notebook -> Notebook Instances -> Create notebook instance button

createanotebookinstance

Then we create a new instance choosing a notebook instance name and type. For this project ml.m5.xlarge instance type was selected

setupnotebooksetinstance

Bellow you can see a notebook instance called mlops already created

notebookcreated

With the notebook instance create we can upload a jupyter notebook, the script.py file to train our deep learning image classification model, the inference.py script and an image for test. Bellow you can find the list of files that we need to upload:

train_and_deploy-solution.ipynb hpo.py inference2.py lab.jpg

Before we can start to train our model we need first create a S3 bucket where we will upload our train, validation and test data to. So let's do it now :)


Setup S3

Finding s3

finds3

Next, we create a new bucket by clicking in create a new bucket button and give our S3 bucket a unique name

creates3bucket

As we can see our bucket was created in S3

s3bucket

Uploading data to S3

The snipped code bellow shows how to donwload data using wget command and upload it to AWS s3 using the cp command.

%%capture
!wget https://s3-us-west-1.amazonaws.com/udacity-aind/dog-project/dogImages.zip
!unzip dogImages.zip
!aws s3 cp dogImages s3://mlopsimageclassification/data/ --recursive

Note This code are located in the train_and_deploy.ipynb

Bellow can see that data was successfuly uploaded to s3

datains3

Now that we have our data in S3 we can train our model.

So let's starting by reviewing some important information you will see in the jupyter notebook


Hyperparameter tunning

Defining enviroment variables for hyperparameter tunning

SM_CHANNEL_TRAINING: where the data used to train model is located in AWS S3

SM_MODEL_DIR: where model artifact will be saved in S3

SM_OUTPUT_DATA_DIR: where output will be saved in S3

Here we are passing some paths to our S3 which will be used by the notebook instance to get data, save model and output

os.environ['SM_CHANNEL_TRAINING']='s3://mlopsimageclassification/data/'
os.environ['SM_MODEL_DIR']='s3://mlopsimageclassification/model/'
os.environ['SM_OUTPUT_DATA_DIR']='s3://mlopsimageclassification/output/'

Here we can see how we can access the enviroment variables in hpo.py script

if __name__=='__main__':
    parser=argparse.ArgumentParser()
    parser.add_argument('--learning_rate', type=float)
    parser.add_argument('--batch_size', type=int)
    parser.add_argument('--data', type=str, default=os.environ['SM_CHANNEL_TRAINING'])
    parser.add_argument('--model_dir', type=str, default=os.environ['SM_MODEL_DIR'])
    parser.add_argument('--output_dir', type=str, default=os.environ['SM_OUTPUT_DATA_DIR'])
    
    args=parser.parse_args()

For this model two hyperparameters was tunning: learning rate and batch size.

hyperparameter_ranges = {
    "learning_rate": ContinuousParameter(0.001, 0.1),
    "batch_size": CategoricalParameter([32, 64, 128, 256, 512]),
}

Bellow you can see how hyperparameter tuner and estimator was defined. Notice that we are using a py script (hpo.py) as entry point to the estimator, this script contains the code need to train model with different hyperparameters values.

estimator = PyTorch(
    entry_point="hpo.py",
    base_job_name='pytorch_dog_hpo',
    role=role,
    framework_version="1.4.0",
    instance_count=1,
    instance_type="ml.g4dn.xlarge",
    py_version='py3'
)

tuner = HyperparameterTuner(
    estimator,
    objective_metric_name,
    hyperparameter_ranges,
    metric_definitions,
    max_jobs=2,
    max_parallel_jobs=1,  # you once have one ml.g4dn.xlarge instance available
    objective_type=objective_type
)

tuner.fit({"training": "s3://mlopsimageclassification/data/"})

Note We are passing a S3 path where the data for training, validation and testing are loacated to the HyperparameterTuner fit method

After we start the model training we can see the training job status at SageMaker -> Training -> Training Jobs

trainingjobs

Training Model with best hyperparameters values

Without multi-instance

Notice that training a model without enable multi-instance took 21 minutes to complete

trainingwithoutmultiinstance

trainingjobwithoutmultiinstanceconfigs

Deploying model

We can check the deployed model in SageMaker -> Inference -> Endpoints

Notice that the model was deployed with one initial instance and a instance type which uses the type ml.m5.large

predictor = pytorch_model.deploy(initial_instance_count=1, instance_type='ml.m5.large')

Bellow we can see the deployed model

modeldeployedwithoutmultiinstance

With multi-instance

trainingjobmultiinstance

trainingjobmultiinstanceconfigs


EC2 Setup

EC2 as others AWS services can be founded by search it by name in AWS

findec2

Now we can create our new instance by clicking in Launch instances button

ec2instance

First, we must give a name to our instance

setupec2name

We are now selecting an Amazon Machine Image (AMI), which is a supported and maintained image provided by AWS that contains the necessary information to launch an instance. Since we will be training a deep learning model with PyTorch, we need to select an AMI that supports PyTorch for deep learning.

setupec2choosingami

We can have an overview of the AMI information in this image

amidetails

Next, we need to choose an EC2 instance that is supported by this AMI. According to the documentation, this type of AMI supports the following instances: G3, P3, P3dn, P4d, G5, and G4dn.

setupec2choosinginstance

EC2 requires a key pair that can be used, for example, to SSH into our instance from another service. A good example would be SSHing into our instance from AWS Cloud9.

setupec2creatingapairkey

setupec2createkeypair

Note To simplify things, other configurations will be set to their default values

Now that we created our instance we can connecting to it following the three images bellow

connectingtoec2choosinginstance

ec2connecting

ec2connecting2

If everything works well we are connected to our instance. The last step is activate pytorch virtual enviroment by typing source activate pytorch on terminal

ec2activatepytorchvirtualenviroment

Now the fun part :)

First we need to donwload the dataset to EC2 by running the following command on terminal

wget https://s3-us-west-1.amazonaws.com/udacity-aind/dog-project/dogImages.zipunzip dogImages.zip

Since we are downloading our data to EC2, we can retrieve the path in the training script as follows. This is a key difference between training the model in a notebook instance versus training it on EC2

data = 'dogImages'
train_data_path = os.path.join(data, 'train')
test_data_path = os.path.join(data, 'test')
validation_data_path=os.path.join(data, 'valid')

Next, we are going to create a directory to save the trained models

mkdir TrainedModels

Now, we need to create a Python file and paste the training code into it

Use vim to create an empy file

vim solution.py

Use the following command so that we paste our code into solution.py

:set paste

Copy the code located in https://github.com/mathewsrc/Operationalizing-an-AWS-ML-Project/blob/master/ec2train1.py and paste into solution.py

:wq! + Press Enter

Now we can run our script to train the model

python solution.py

EC2 vs Notebook instance for training models

Both services have their own advantages:

EC2 instances can be easily scaled up or down based on computing needs, can be customized to meet specific requirements such as framework (pytroch or tensorflow), number of CPUs, memory size and GPU support and EC2 instances can be optimized for high-performance computing, which can greatly reduce the time it takes to train large machine learning models.

Notebook instances have their own advantages too such as: quick setup as they comes with pre-configured with popular machine learning frameworks and libraries, easy collaboration and integration with others AWS services such as AWS SageMaker, which provides a lot of tools required for machine learning engineering and operations.


Lambda Functions Setup

The following images show how to create a AWS Lambda Function:

Finding Lambda Functions

findlambda

Creating a Lambda Function

To create a Lambda Function click on Create a function button

create a function

Deploying a Lambda Function

To update our Lambda Function we need to click on Deploy button. The Deploy button is located to the right of Test button

lambdadeployfunction

Lambda Function configuration

Notice that we have the ability to adjust the memory and storage requirements based on our specific needs.

lambdaconfiguration

After we create the Lambda Function we can replace the default code with the code located in https://github.com/mathewsrc/Operationalizing-an-AWS-ML-Project/blob/master/lamdafunction.py

Since a Lambda function will invoke a SageMaker endpoint, we need to grant permission to the Lambda function to access SageMaker

runtime=boto3.Session().client('sagemaker-runtime')
    
    response=runtime.invoke_endpoint(EndpointName=endpoint_Name,
                                    ContentType="application/json",
                                    Accept='application/json',
                                    #Body=bytearray(x)
                                    Body=json.dumps(bs))
    
    result=response['Body'].read().decode('utf-8')
    sss=json.loads(result)

Adding SageMaker access permission to Lambda Function

We need to add a new policy to our Lambda function so that it can access SageMaker. This can be done through AWS IAM.

findiam

First select roles

iamroletab

Next, we need find our Lambda Function and click on it

imaselectinglambdafunctionrole

Click on Add permissions button and then on Attach policies button

iamaddpermissions

Finally, we should search for SageMaker and select an appropriate policy. While the full access option may be the simplest choice, it's important to remember that granting excessive permissions to a service can pose security risks. Therefore, it's advisable to carefully consider the level of access required for your specific use case.

iamsagemakerpermissionsforlambda

Now with the right permission we can create a new test to test our Lambda Function.

First click on Test button

testlambdafunction

Now give a name for the test

lambdafunctionconfiguringtest

Replace the default JSON with the following JSON data, as shown in the image below

{ "url": "https://s3.amazonaws.com/cdn-origin-etr.akc.org/wp-content/uploads/2017/11/20113314/Carolina-Dog-standing-outdoors.jpg" }

addjsontotestlambda

Scaling Lambda Function and Endpoint

Adding concurrency to Lambda Function

By default a Lambda Function can only respond one request at once. One way to change that is to use concurrency so that the Lambda Function can be responds to multiple requests at once. Before add concurrency we need to configure a version which can be done in Configuration tab.

lambdaversionsetup

Add a description for the new version and click on Publish button

creatinganewversionforlambda

Now we can add concurrency in Configuration -> Provisioned concurrency -> Edit button

provisionedconcurrencyforlambda

Our final task is to select an integer value for concurrency. Provisioned concurrency initializes a specified number of execution environments, enabling them to respond immediately. Therefore, a higher concurrency level results in reduced latency. In this example the concurrency was set to two so this Lambda Functions can handle two requests at once which could not be enough for services with high demand.

lambdaconcurrencyconfig


Auto-Scaling endpoint

Auto scaling is a powerful feature of SageMaker that allows for dynamic adjustment of the number of instances used with deployed models based on changes in workload. With auto scaling, SageMaker automatically increases or decreases the number of instances, ensuring that we only pay for the instances that are actively running.

We can enable auto-scaling in SageMaker -> Endpoints -> Endpoint runtime settings

endpointruntimesettings

We can increase the maximum of instances for our endpoint

autoscalingnumberinstances

We can define a scaling policy to control how auto-scaling works. In this example, the 'Target value' was set to 20, meaning that when our endpoint receives 20 requests simultaneously, auto-scaling will be triggered, and the number of instances will be increased. The 'Scale In' and 'Scale Out' parameters were both set to 30 seconds, which controls the amount of time auto-scaling should wait before increasing or decreasing the number of instances.

scallingpolicy

Now we see that our endpoint has auto-scaling enabled

autoscalingcreated


Deleting EC2 instances, Lambda Functions, and Endpoints

To avoid any cost we can delete all services and instances used in this project. Bellow you can see how to terminate, delete or stop services and instances in AWS.

Deleting the notebook instance. Notebook instances are located in SageMaker -> Notebooks -> Notebook Instances

stopndeletenotebookinstance

Deleting EC2 instances. EC2 instances are located in EC2 -> Instances

stopnterminateec2

Deleting Lambda Functions. Lambda Functions are located in Lambda -> Functions

deletelambda

Deleting Endpoints. Endpoints are located in SageMaker -> Endpoints

deletingendpoint

References

https://docs.aws.amazon.com/sagemaker/latest/dg/whatis.html

https://docs.aws.amazon.com/sagemaker/latest/dg/nbi.html

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/concepts.html

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AMIs.html

https://docs.aws.amazon.com/s3/index.html?nc2=h_ql_doc_s3

https://docs.aws.amazon.com/lambda/latest/dg/welcome.html

https://docs.aws.amazon.com/lambda/latest/dg/configuration-concurrency.html

https://docs.aws.amazon.com/IAM/latest/UserGuide/introduction.html

https://docs.aws.amazon.com/sagemaker/latest/dg/endpoint-auto-scaling.html

About

This project SageMaker features to adjust, improve, configure, and prepare a image classification model for production-grade deployment.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published