Fine-tuning Foundation Models on Amazon EKS for AI/ML Workloads

In this all-in-one demo on Amazon EKS, we use JupyterHub to guide you through the initial stages of serving a pre-trained model and fine-tuning it with new data. We then shift gears to utilize Ray Operator for optimized serving and Karpenter for custom resource allocation. The demo showcases how to achieve cost-effectiveness and multi-tenancy with various Ray clusters, all while walking you through each step in a Jupyter notebook.

Prerequisites

AWS Credentials configured
AWS CLI
kubectl
Helm
Terraform
Spot Instance Linked Role

Create Spot Instance Linked Role

aws iam create-service-linked-role --aws-service-name spot.amazonaws.com

High Level Architecture

Environment Setup

The environment required to train the models in this demonstration, will require 8-10 g5.4xlarge Amazon EC2 instances which together will sum up to 160 vCPU of the G5 Instance family, make sure that you have the enough Service Quota available in your AWS Account to acomodate the resources that are being created.

You can use this deep link to validate your current AWS Service Quota limits. Check the Applied quota value, and Request increase at account-level clicking on the top right button, if needed.

Apply Terraform Script

terraform init
terraform plan
terraform apply --auto-approve

This command provisions an EKS cluster along with the following components:

JupyterHub: For development and analysis
Nvidia GPU Operator: The GPU Operator allows administrators of Kubernetes clusters to manage GPU nodes just like CPU nodes in the cluster, instead of provisioning a special OS image.
Ray Operator: To manage Ray clusters
Karpenter: For automatic scaling
Kube Prometheus Stack: For observability
Apache Airflow: To automate the e2e ML pipeline, fetching DAGs from this Git repository

Exporting terraform outputs

Since we will be pushing code to Amazon S3 let's export the BUCKET_NAME

export BUCKET_NAME=$(terraform output -raw bucket_name)

Have this bucket name handy, we will use it troughout the demo

Update Kubeconfig

terraform output -raw configure_kubectl | bash

Validate Cluster Setup

kubectl get nodes

You should see output similar to:

NAME                                        STATUS   ROLES    AGE     VERSION
ip-10-8-10-118.us-west-2.compute.internal   Ready    <none>   50m   v1.27.5-eks-43840fb
ip-10-8-17-213.us-west-2.compute.internal   Ready    <none>   51m   v1.27.5
ip-10-8-17-85.us-west-2.compute.internal    Ready    <none>   51m   v1.27.5
ip-10-8-21-138.us-west-2.compute.internal   Ready    <none>   51m   v1.27.5
ip-10-8-22-37.us-west-2.compute.internal    Ready    <none>   51m   v1.27.5
ip-10-8-30-194.us-west-2.compute.internal   Ready    <none>   22m   v1.27.5-eks-43840fb
ip-10-8-32-55.us-west-2.compute.internal    Ready    <none>   24m   v1.27.5-eks-43840fb

Validade if all Pods are Running and in Ready state.

kubectl get pods -A

You're now ready to proceed with the demonstration.

Modules in This Demonstration

The demonstration is broken down into two modules, each focusing on a specific aspect of fine-tuning Foundation Models like Falcon 7B on Amazon EKS. By the end of this demonstration, you'll have learned how to use Notebooks powered by JupyterHub to craft your training and serving script and run them on specific Ray Clusters.

Demonstration Flow Diagram

1. Crafting scripts in Jupyter Notebook and submitting using RayJobSubmission

2. Serving finetuned model with contextual data using RayOperator

Security

See CONTRIBUTING for more information.

License

This library is licensed under the MIT-0 License. See the LICENSE file.

Tear down

To tear down your environment, run the tear-down.sh script inside the terraform/scripts directory.

sh scripts/tear-down.sh

Name		Name	Last commit message	Last commit date
Latest commit History 162 Commits
dags		dags
dataset		dataset
modules		modules
notebooks		notebooks
old		old
ray_serve_manifests		ray_serve_manifests
sample_app		sample_app
static		static
terraform		terraform
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md

License

aws-samples/gen-ai-on-eks

Folders and files

Latest commit

History

Repository files navigation

Fine-tuning Foundation Models on Amazon EKS for AI/ML Workloads

Prerequisites

Create Spot Instance Linked Role

High Level Architecture

Environment Setup

Apply Terraform Script

Exporting terraform outputs

Update Kubeconfig

Validate Cluster Setup

Modules in This Demonstration

Demonstration Flow Diagram

Security

License

Tear down

About

Topics

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Languages