AWS Machine Learning Blog

Announcing the Amazon S3 plugin for PyTorch

November 2023: On 11/22/2023, AWS announced the Amazon S3 Connector for PyTorch ─ a new connector that delivers high throughput for PyTorch training jobs that access data in Amazon S3. We recommend customers use the new connector for PyTorch training jobs that read and write data in Amazon S3. The Amazon S3 Connector for PyTorch delivers a new implementation of PyTorch’s dataset primitive that you can use to load training data from Amazon S3. It also includes a checkpointing interface to save and load checkpoints directly to Amazon S3, without first saving to local storage and writing custom code to upload to Amazon S3. To learn more, read the What’s New post and the landing page in the GitHub repository.

Amazon S3 plugin for PyTorch is an open-source library which is built to be used with the deep learning framework PyTorch for streaming data from Amazon Simple Storage Service (Amazon S3). With this feature available in PyTorch Deep Learning Containers, you can take advantage of using data from S3 buckets directly with PyTorch dataset and dataloader APIs without needing to download it first on local storage.

What is the Amazon S3 plugin for PyTorch?

The Amazon S3 plugin for PyTorch is designed to be a high-performance PyTorch dataset library to efficiently access data stored in S3 buckets. It provides streaming data access to data of any size and therefore eliminates the need to provision local storage capacity. The library is designed to use high throughput offered by Amazon S3 with minimal latency.

It also provides a way to transfer data from Amazon S3 in parallel when needed to get maximum performance without worrying about thread safety or multiple connections to Amazon S3. You can also stream data from .zip or .tar archives and shuffle the dataset within shards or across the shards as required. The Amazon S3 plugin for PyTorch works seamlessly with existing PyTorch code base because S3Dataset and S3IterarableDataset provided by this plugin are implementations of PyTorch’s internal Dataset and IterableDataset interfaces, so you don’t need to change the existing code to make it work with Amazon S3.

The library itself is file format agnostic and presents objects in Amazon S3 as a binary buffer (blob). You can apply any additional transformations on the data received from Amazon S3. You can also easily extend S3Dataset or S3IterableDataset to consume data from Amazon S3 and perform data processing as needed.

Benefits of using the Amazon S3 plugin for PyTorch

The plugin offers the following benefits:

  • Support for both map-style and iterable-style dataset interfaces – PyTorch supports two different types of datasets. The Amazon S3 plugin for PyTorch also provides the flexibility to use either map-style or iterable-style dataset interfaces based on your needs:
    • Map-style dataset – Represents a map from indexes or keys to data samples. It provides random access capabilities.
    • Iterable-style dataset – Represents an iterable over data samples. This type of dataset is particularly suitable for cases where random reads are expensive or even improbable, and where the batch size depends on the fetched data.
  • Support for various data formats – Training data can be in a variety of different formats, such as CSV, Parquet, and JPEG. This plugin is file format agnostic and presents objects in Amazon S3 as a binary buffer (blob). You can apply any additional transformations on the data received from Amazon S3.
  • Support for shuffling – In deep learning, you may need to shuffle data across shards and within shards to reduce variance. This plugin provides a way to shuffle data in-memory within shards using ShuffleDataset or across shards by providing the input parameter shuffle_urls while extending S3IterableDataset.

Building blocks

The Amazon S3 plugin for PyTorch provides a native experience of using data from Amazon S3 to PyTorch without adding complexity in your code. To achieve this, it relies heavily on the AWS SDK. AWS provides high-level utilities for managing transfers to and from Amazon S3 through the AWS SDK. This plugin uses standard TransferManager APIs from the AWS_SDK_CPP package underneath to communicate with Amazon S3. These APIs make extensive use of Amazon S3 multipart download capabilities to achieve enhanced throughput and reliability, and are also thread safe.

When dealing with large content sizes and high bandwidth, this can have a significant increase on throughput. TransferManager is also responsible for managing resources such as connections and threads, and hides the complexity of transferring files behind simple APIs.

To use TransferManager, the plugin has C++ APIs underneath for the following actions:

  • Validating access to S3 buckets
  • Parsing S3 paths
  • Checking file existence
  • Getting file sizes
  • Listing files
  • Reading files

To provide easy access to PyTorch users, the plugin uses Pybind11 to wrap the preceding C++ functions and make them available to be used as PyTorch dataset constructs.

The Amazon S3 plugin for PyTorch is available to use through pre-configured PyTorch Docker images, or directly from the GitHub repository.

Configuration

Before reading data from the S3 bucket, you need to provide the bucket Region parameter AWS_REGION. By default, a Regional endpoint is used for Amazon S3, with the Region controlled by AWS_REGION.

If AWS_REGION isn’t specified, us-west-2 is used by default. You can specify it either by running export AWS_REGION=us-east-1or through code with os.environ['AWS_REGION'] = 'us-east-1'.

To read objects in a bucket that isn’t publicly accessible, you must provide AWS credentials through one of the following methods:

Use the library

Getting started with this library is easy, as we demonstrate in the following example.

First, log in to Amazon Elastic Container Registry (Amazon ECR):

aws ecr get-login-password --region <$region> | docker login --username AWS --password-stdin $ACCOUNT_ID.dkr.ecr.region.amazonaws.com

You can use the following commands to run the container. You must use nvidia-docker for GPU images.

nvidia-docker run -it <GPU training container> # for GPU
OR
docker run -it <CPU training container> # for CPU

Use the map-style dataset

If each object in Amazon S3 contains a single training sample, then you can use the map-style dataset (S3Dataset). To partition data across nodes and to shuffle data, you can use this dataset with the PyTorch distributed sampler. Additionally, you can apply preprocessing to the data in Amazon S3 by extending the S3Dataset class. The following example code uses map-style S3Dataset for image datasets:

from awsio.python.lib.io.s3.s3dataset import S3Dataset
from torch.utils.data import DataLoader
from torchvision import transforms
from PIL import Image
import io

class S3ImageSet(S3Dataset):
    def __init__(self, urls, transform=None):
        super().__init__(urls)
        self.transform = transform

    def __getitem__(self, idx):
        img_name, img = super(S3ImageSet, self).__getitem__(idx)
        # Convert bytes object to image
        img = Image.open(io.BytesIO(img)).convert('RGB')
        
        # Apply preprocessing functions on data
        if self.transform is not None:
            img = self.transform(img)
        return img

batch_size = 32

preproc = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225)),
    transforms.Resize((100, 100))
])

# urls can be S3 prefix containing images or list of all individual S3 images
urls = 's3://path/to/s3_prefix/'

dataset = S3ImageSet(urls, transform=preproc)
dataloader = DataLoader(dataset,
        batch_size=batch_size,
        num_workers=64)

Please replace the S3 paths with your actual path. This same code is available in the amazon-s3-plugin-for-pytorch GitHub repo. You can run this example with the following code:

git clone https://github.com/aws/amazon-s3-plugin-for-pytorch.git
cd amazon-s3-plugin-for-pytorch/examples
python s3_cv_transform.py

Use the iterable-style dataset

If each object in Amazon S3 contains multiple training samples (such as archive files containing multiple small files), we recommend using the iterable-style dataset implementation (S3IterableDataset).

Consider using a .tar file for image classification. You can load it easily by writing a custom Python generator function using the iterator returned by S3IterableDataset. (To create shards from a file dataset, refer to the following GitHub repo.) See the following code:

from torch.utils.data import IterableDataset
from awsio.python.lib.io.s3.s3dataset import S3IterableDataset
from PIL import Image
import io
import numpy as np
from torchvision import transforms

class ImageS3(IterableDataset):
    def __init__(self, urls, shuffle_urls=False, transform=None):
        self.s3_iter_dataset = S3IterableDataset(urls,
                                                 shuffle_urls)
        self.transform = transform

    def data_generator(self):
        try:
            while True:
                # Based on alphabetical order of files, sequence of label and image may change.
                label_fname, label_fobj = next(self.s3_iter_dataset_iterator)
                image_fname, image_fobj = next(self.s3_iter_dataset_iterator)
                
                label = int(label_fobj)
                image_np = Image.open(io.BytesIO(image_fobj)).convert('RGB')
                
                # Apply torch vision transforms if provided
                if self.transform is not None:
                    image_np = self.transform(image_np)
                yield image_np, label

        except StopIteration:
            return
            
    def __iter__(self):
        self.s3_iter_dataset_iterator = iter(self.s3_iter_dataset)
        return self.data_generator()
        
    def set_epoch(self, epoch):
        self.s3_iter_dataset.set_epoch(epoch)

# urls can be a S3 prefix containing all the shards or a list of S3 paths for all the shards 
urls = ["s3://path/to/file1.tar", "s3://path/to/file2.tar"]

# Example Torchvision transforms to apply on data    
preproc = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225)),
    transforms.Resize((100, 100))
])

dataset = ImageS3(urls, transform=preproc)

Please replace the S3 paths with your actual path. You can easily use this dataset with DataLoader for parallel data loading and preprocessing:

dataloader = torch.utils.data.DataLoader(dataset, num_workers=4, batch_size=32)

We can shuffle the sequence of fetching shards by setting shuffle_urls=True and calling the set_epoch method at the beginning of every epoch:

dataset = ImageS3(urls, transform=preproc, shuffle_urls=True)
for epoch in range(epochs):
    dataset.set_epoch(epoch)
    # training code ...

The preceding code only shuffles the sequence of shards; the individual training samples within the shards are fetched in the same order. To shuffle the order of training samples across shards, use ShuffleDataset. ShuffleDataset maintains a buffer of data samples read from multiple shards and returns a random sample from it. The count of samples to be buffered is specified by buffer_size. To use ShuffleDataset, update the preceding example as follows:

dataset = ShuffleDataset(ImageS3(urls), buffer_size=4000)

This same code is available in the amazon-s3-plugin-for-pytorch GitHub repo. You can run this example with the following code:

git clone https://github.com/aws/amazon-s3-plugin-for-pytorch.git
cd amazon-s3-plugin-for-pytorch/examples
python s3_cv_iterable_shuffle_example.py

Conclusion

In this post, we showed you how to use S3Dataset and S3IterableDataset to stream data directly from S3 buckets and perform training with PyTorch. We demonstrated this solution for a computer vision dataset, but you can apply the same methods to other use cases when the dataset is text files, such as natural language processing.

Laying the foundation to access datasets while training can be critical for many enterprises that are looking to eliminate storing data locally and still get the desired performance. With availability of the Amazon S3 plugin for PyTorch, you can now stream data from S3 buckets and perform the large-scale data processing needed for training in PyTorch.

The Amazon S3 plugin for PyTorch was designed for ease of use and flexibility with PyTorch.

To learn more on how to use this package, we recommend starting with our example use cases.

As we further develop and extend the Amazon S3 plugin for PyTorch, we welcome community participation through questions, requests, and contributions. Head over to the aws/amazon-s3-plugin-for-pytorch GitHub repository to get started!


About the Authors

Roshani Nagmote is a Software Developer for AWS Deep Learning. She focuses on building distributed Deep Learning systems and innovative tools to make Deep Learning accessible for all. In her spare time, she enjoys hiking, exploring new places and is a huge dog lover.

Rajesh Parangi Sharabhalingappa is a Senior Software Engineer at AWS Deep Learning. He works on platforms and libraries to make deep learning training easier for customers . Outside of work, he enjoys cycling.

Khaled ElGalaind is the engineering manager for AWS Deep Engine Benchmarking, focusing on performance improvements for Amazon Machine Learning customers. Khaled is passionate about democratizing deep learning. Outside of work, he enjoys volunteering with the Boy Scouts, BBQ, and hiking in Yosemite.

Aditya Bindal is a Senior Product Manager for AWS Deep Learning. He works on products that make it easier for customers to train deep learning models on AWS. In his spare time, he enjoys spending time with his daughter, playing tennis, reading historical fiction, and traveling.