Skip to content

aws-samples/aws-sagemaker-pipelines-skin-classification

MLOps for existing workflow with SageMaker Pipelines: Skin lesion classification using Computer Vision

This repository aims at enabling the customization of a built-in SageMaker pipeline for MLOps to user-defined workflow. In this case we address a computer vision use case for skin lesion classification. This is a step-by-step guide on how to adapt an existing code to the CI/CD pipeline in AWS SageMaker Studio.

Background

Amazon SageMaker is a fully-managed service for building, training an deploying Machine Learning (ML) models. It offers a variety of features. For instance SageMaker Pipelines is a continuous integration and continuous delivery (CI/CD) service designed for ML use cases. It can be used to create, automate, and manage end-to-end ML workflows. We are going to use the SageMaker Studio project template “MLOps template for building, training, and deploying models” that creates a pipeline and the infrastructure you need to create an MLOps solution for continuous integration and continuous deployment (CI/CD) of ML models. SageMaker Pipelines come with SageMaker Python SDK integration, so you can build each step of your pipeline using a Python-based interface. SageMaker Studio Python SDK is an easy starting point, that allows Data Scientists to train and deploy models using popular deep learning frameworks with algorithms provided by Amazon, or their own algorithms built into SageMaker-compatible Docker images. For the detailed how-to create and access the template, please refer to link.

Implementation

I. AWS SageMaker Studio setup

  1. Open SageMaker Studio. This can be done for existing users or while creating new ones. For a detailed how-to set up SageMaker Studio go here.
  2. In SageMaker Studio, you can choose the Projects menu on the SageMaker resources** menu.

  1. On the projects page, you can launch a pre-configured SageMaker MLOps template. Choose MLOps template for model building, training, and deployment.

alt text

  1. In the next page provide Project Name and short Description and select Create Project. The project will take a while to be created.

II. Prepare the dataset

  1. Go to Harvard DataVerse
  2. Select "Access Dataset" in top right, and review the license Creative Commons Attribution-NonCommercial 4.0 International Public License.
  3. If you accept license, then select "Original Format Zip" and download the zip.
  4. Create S3 bucket and choose a name starting with "sagemaker" (this will allow SageMaker to access the bucket without any extra permissions). You can enable access logigng and encryption for securtiy best practices. Upload dataverse_files.zip to it. Save the S3 bucket path for later use.
  5. Make a note of the name of the bucket you have stored the data in, and the names of any subsequent folders, they will be needed later.

III. Preparing for data preprocessing

Since we will be using MXNet and OpenCV in our preprocessing step, we use a pre-built MXNet docker image and install the remaining dependencies using the requirements.txt file. To do so, you need to copy it and paste it under pipelines/skin in the modelbuild repository.

IV. Changing the Pipelines template

  1. Create a folder inside the default bucket
  2. Make sure the SageMaker Studio execution role has access to the default bucket as well as the bucket containing the dataset.
  3. From the list of projects, choose the one that was just created.
  4. On the Repositories tab, you can select the hyperlinks to locally clone the CodeCommit repositories to your local SageMaker Studio instance.

alt text

  1. Navigate to the pipelines directory inside the modelbuild directory and rename the abalone directory to skin.
  2. Now open the codebuild-buildspec.yml file in the modelbuild directory and modify the run pipeline path from run-pipeline —module-name pipelines.abalone.pipeline (line 15) to this:

run-pipeline --module-name pipelines.skin.pipeline \

  1. Save the file.
  2. Replace 3 files in the Pipeline directory pipelines.py, preprocess.py and evaluate.py with the files from this repository.

  1. Update the preprocess.py file (lines 183-186) with the S3 location (SKIN_CANCER_BUCKET) and folder name (SKIN_CANCER_BUCKET_PATH) where the dataverse_files.zip archive was uploaded to S3 at the end of the Step II:
  • skin_cancer_bucket='monai-bucket-skin-cancer' (replace this with your bucket name)
  • skin_cancer_bucket_path='skin_cancer_bucket_prefix' (replace this with the prefix to the dataset inside the bucket)
  • skin_cancer_files='dataverse_files' (replace this with name of the zip without extention)
  • skin_cancer_files_ext=dataverse_files.zip' (replace this with name of the zip with extention)

In the example above, the dataset would be stored under: s3://monai-bucket-skin-cancer/skin_cancer_bucket_prefix/dataverse_files.zip

alt text

V. Triggering a pipeline run

Pushing committed changes to the CodeCommit repository (done on the Studio source control tab) triggers a new pipeline run, because an Amazon EventBridge event monitors for commits. We can monitor the run by choosing the pipeline inside the SageMaker project.

  1. To commit the changes, navigate to the Git Section on the left panel and follow the steps: a. Stage all changes. You don't need to keep track of the -checkpoint file. You can add an entry to .gitignore file with *checkpoint.* to ignore them. b. Commit the changes by providing a Summary and your Name and an email address. c. Push the changes.

alt text

  1. Navigate back to the project and select the Pipelines section.
  2. If you double click on the executing pipelines, the steps of the pipeline will appear. You will be able to monitor the step that is currently running.

alt text

  1. When the pipeline is complete, you can go back to the project screen and choose the Model groups tab. You can then inspect the metadata attached to the model artifacts.
  2. If everything looks good, you can click on the Update Status tab and manually approve the model. The default ModelApprovalStatus is set to PendingManualApproval. If our model has greater than 60% accuracy, it’s added to the model registry, but not deployed until manual approval is complete. You can then go to Endpoints in the SageMaker menu where you will see a staging endpoint being created. After a while the endpoint will be listed with the InService status.
  3. To deploy the endpoint into production, go to CodePipeline, click on the modeldeploy pipeline which is currently in progress. At the end of the DeployStaging phase, you need to manually approve the deployment. Once it is done you will see the production endpoint being deployed in the SageMaker Endpoints. After a while the endpoint will also be InService.

alt text

Dataset

The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions: Web Link

Useful links

Security

See CONTRIBUTING for more information.

License

This library is licensed under the MIT-0 License. See the LICENSE file.

About

No description, website, or topics provided.

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages