AWS Machine Learning Blog

Fine-tune and deploy a Wav2Vec2 model for speech recognition with Hugging Face and Amazon SageMaker

Automatic speech recognition (ASR) is a commonly used machine learning (ML) technology in our daily lives and business scenarios. Applications such as voice-controlled assistants like Alexa and Siri, and voice-to-text applications like automatic subtitling for videos and transcribing meetings, are all powered by this technology. These applications take audio clips as input and convert speech signals to text, also referred as speech-to-text applications. In recent years, ASR services such as Amazon Transcribe let customers add speech to text capabilities with no prior machine learning experience required. Amazon Transcribe is a fully managed AI service that provides customers with pre-trained models that are continually improving, cost flexibility, and scalability across languages, regions, and volume.

For data scientists who want to build their own ASR models, many of the latest models can achieve a good performance, such as transformer-based models Wav2Vec2 and Speech2Text. Transformer is a sequence-to-sequence deep learning architecture originally proposed for machine translation. Now it’s extended to solve all kinds of natural language processing (NLP) tasks, such as text classification, text summarization, and ASR. The transformer architecture yields very good model performance and results in various NLP tasks; however, the models’ sizes (the number of parameters) as well as the amount of data they’re pre-trained on increase exponentially when pursuing better performance. It becomes very time-consuming and costly to train a transformer from scratch, for example training a BERT model from scratch could take 4 days and cost $6,912 (for more information, see The Staggering Cost of Training SOTA AI Models). Hugging Face, an AI company, provides an open-source platform where developers can share and reuse thousands of pre-trained transformer models. With the transfer learning technique, you can fine-tune your model with a small set of labeled data for a target use case. This reduces the overall compute cost, speeds up the development lifecycle, and lessens the carbon footprint of the community.

AWS announced collaboration with Hugging Face in 2021. Developers can easily work with Hugging Face models on Amazon SageMaker and benefit from both worlds. You can fine-tune and optimize all models from Hugging Face, and SageMaker provides managed training and inference services that offer high performance resources and high scalability via Amazon SageMaker distributed training libraries. This collaboration can help you accelerate your NLP tasks’ productization journey and realize business benefits.

This post shows how to use SageMaker to easily fine-tune the latest Wav2Vec2 model from Hugging Face, and then deploy the model with a custom-defined inference process to a SageMaker managed inference endpoint. Finally, you can test the model performance with sample audio clips, and review the corresponding transcription as output.

Wav2Vec2 background

Wav2Vec2 is a transformer-based architecture for ASR tasks and was released in September 2020. The following diagram shows its simplified architecture. For more details, see the original paper. As the diagram shows, the model is composed of a multi-layer convolutional network (CNN) as a feature extractor, which takes an input audio signal and outputs audio representations, also considered as features. They are fed into a transformer network to generate contextualized representations. This part of training can be self-supervised; the transformer can be trained with unlabeled speech and learn from it. Then the model is fine-tuned on labeled data with the Connectionist Temporal Classification (CTC) algorithm for specific ASR tasks. The base model we use in this post is Wav2Vec2-Base-960h, fine-tuned on 960 hours of Librispeech on 16 kHz sampled speech audio.

CTC is a character-based algorithm. During training, it’s able to demarcate each character of the transcription in the speech automatically, so the timeframe alignment isn’t required between audio signal and transcription. For example, if the audio clip says “Hello World,” we don’t need to know in which second the word “hello” is located. It saves a lot of labeling effort for ASR use cases. For more information about how the algorithm works, refer to Sequence Modeling With CTC.

Solution overview

In this post, we use the SUPERB (Speech processing Universal PERformance Benchmark) dataset available from the Hugging Face Datasets library, and fine-tune the Wav2Vec2 model and deploy it as a SageMaker endpoint for real-time inference for an ASR task. SUPERB is a leaderboard to benchmark the performance of a shared model across a wide range of speech processing tasks.

The following diagram provides a high-level view of the solution workflow.

First, we show how to load and preprocess the SUPERB dataset in a SageMaker environment in order to obtain a tokenizer and feature extractor, which are required for fine-tuning the Wav2Vec2 model. Then we use SageMaker Script Mode for training and inference steps, which allows you to define and use custom training and inference scripts, and SageMaker provides supported Hugging Face framework Docker containers. For more information about training and serving Hugging Face models on SageMaker, see Use Hugging Face with Amazon SageMaker. This functionality is available through the development of Hugging Face AWS Deep Learning Containers (DLCs).

The notebook and code from this post are available on GitHub. The notebook is tested in both Amazon SageMaker Studio and SageMaker notebook environments.

Data preprocessing

In this section, we walk through the steps to preprocess the data.

Process the dataset

In this post we use SUPERB dataset, which you can load from the Hugging Face Datasets library directly using the load_dataset function. The SUPERB dataset also includes speaker_id and chapter_id; we remove these columns and only keep audio files and transcriptions to fine-tune the Wav2Vec2 model for an ASR task, which transcribes speech to text. To speed up the fine-tuning process for this example, we only take the test dataset from the original dataset, then split it into train and test datasets. See the following code:

data = load_dataset("superb", 'asr', ignore_verifications=True) 
data = data.remove_columns(['speaker_id', 'chapter_id', 'id'])
# reduce the data volume for this example. only take the test data from the original dataset for fine-tune
data = data['test'] 

train_test = data.train_test_split(test_size=0.2)
dataset = DatasetDict({
    'train': train_test['train'],
    'test': train_test['test']})

After we process the data, the dataset structure is as follows:

DatasetDict({
    train: Dataset({
        features: ['file', 'audio', 'text'],
        num_rows: 2096
    })
    test: Dataset({
        features: ['file', 'audio', 'text'],
        num_rows: 524
    })
})

Let’s print one data point from the train dataset and examine the information in each feature. ‘file’ is the audio file path where it’s saved and cached in the local repository. ‘audio’ contains three components: ‘path’ is the same as ‘file’, ‘array’ is the numerical representation of the raw waveform of the audio file in NumPy array format, and ‘sampling_rate’ shows the number of samples of audio recorded every second. ‘text’ is the transcript of the audio file.

print(dataset['train'][0])
result: 
{ {'file': '/root/.cache/huggingface/datasets/downloads/extracted/e0f3d50e856945385982ba36b58615b72eef9b2ba5a2565bdcc225b70f495eed/LibriSpeech/test-clean/7021/85628/7021-85628-0000.flac',
 'audio': {'path': '/root/.cache/huggingface/datasets/downloads/extracted/e0f3d50e856945385982ba36b58615b72eef9b2ba5a2565bdcc225b70f495eed/LibriSpeech/test-clean/7021/85628/7021-85628-0000.flac',
  'array': array([-0.00018311, -0.00024414, -0.00018311, ...,  0.00061035,
          0.00064087,  0.00061035], dtype=float32),
  'sampling_rate': 16000},
 'text': 'but anders cared nothing about that'}

Build a vocabulary file

The Wav2Vec2 model uses the CTC algorithm to train deep neural networks in sequence problems, and its output is a single letter or blank. It uses a character-based tokenizer. Therefore, we extract distinct letters from the dataset and build the vocabulary file using the following code:

def extract_characters(batch):
  texts = " ".join(batch["text"])
  vocab = list(set(texts))
  return {"vocab": [vocab], "texts": [texts]}

vocabs = dataset.map(extract_characters, batched=True, batch_size=-1, 
                   keep_in_memory=True, remove_columns= dataset.column_names["train"])

vocab_list = list(set(vocabs["train"]["vocab"][0]) | set(vocabs["test"]["vocab"][0]))
vocab_dict = {v: k for k, v in enumerate(vocab_list)}
vocab_dict["|"] = vocab_dict[" "]
del vocab_dict[" "]

vocab_dict["[UNK]"] = len(vocab_dict) # add "unknown" token 
vocab_dict["[PAD]"] = len(vocab_dict) # add a padding token that corresponds to CTC's "blank token"

with open('vocab.json', 'w') as vocab_file:
    json.dump(vocab_dict, vocab_file)

Create a tokenizer and feature extractor

The Wav2Vec2 model contains a tokenizer and feature extractor. In this step, we use the vocab.json file that we created from the previous step to create the Wav2Vec2CTCTokenizer. We use Wav2Vec2FeatureExtractor to make sure that the dataset used in fine-tuning has the same audio sampling rate as the dataset used for pre-training. Finally, we create a Wav2Vec2 processor that can wrap the feature extractor and the tokenizer into one single processor. See the following code:

# create Wav2Vec2 tokenizer
tokenizer = Wav2Vec2CTCTokenizer("vocab.json", unk_token="[UNK]",
                                  pad_token="[PAD]", word_delimiter_token="|")

# create Wav2Vec2 feature extractor
feature_extractor = Wav2Vec2FeatureExtractor(feature_size=1, sampling_rate=16000, 
                                             padding_value=0.0, do_normalize=True, return_attention_mask=False)
# create a processor pipeline 
processor = Wav2Vec2Processor(feature_extractor=feature_extractor, tokenizer=tokenizer)

Prepare the train and test datasets

Next, we extract the array representation of the audio files and its sampling_rate from the dataset and process them using the processor, in order to have train and test data that can be consumed by the model:

# extract the numerical representation from the dataset
def extract_array_samplingrate(batch):
    batch["speech"] = batch['audio']['array'].tolist()
    batch["sampling_rate"] = batch['audio']['sampling_rate']
    batch["target_text"] = batch["text"]
    return batch

dataset = dataset.map(extract_array_samplingrate, 
                      remove_columns=dataset.column_names["train"])

# process the dataset with processor pipeline that created above
def process_dataset(batch):  
    batch["input_values"] = processor(batch["speech"], 
                            sampling_rate=batch["sampling_rate"][0]).input_values

    with processor.as_target_processor():
        batch["labels"] = processor(batch["target_text"]).input_ids
    return batch

data_processed = dataset.map(process_dataset, 
                    remove_columns=dataset.column_names["train"], batch_size=8, 
                    batched=True)

train_dataset = data_processed['train']
test_dataset = data_processed['test']

Then we upload the train and test data to Amazon Simple Storage Service (Amazon S3) using the following code:

from datasets.filesystems import S3FileSystem
s3 = S3FileSystem()

# save train_dataset to s3
training_input_path = f's3://{BUCKET}/{PREFIX}/train'
train_dataset.save_to_disk(training_input_path,fs=s3)

# save test_dataset to s3
test_input_path = f's3://{BUCKET}/{PREFIX}/test'
test_dataset.save_to_disk(test_input_path,fs=s3)

Fine-tune the Hugging Face model (Wav2Vec2)

We use SageMaker Hugging Face DLC script mode to construct the training and inference job, which allows you to write custom training and serving code and using Hugging Face framework containers that are maintained and supported by AWS.

When we create a training job using the script mode, the entry_point script, hyperparameters, its dependencies (inside requirements.txt), and input data (train and test datasets) are copied into the container. Then it invokes the entry_point training script, where the train and test datasets are loaded, training steps are performed, and model artifacts are saved in /opt/ml/model in the container. After training, artifacts in this directory are uploaded to Amazon S3 for later model hosting.

You can inspect the training script in the GitHub repo, in the scripts/ directory.

Create an estimator and start a training job

We use the Hugging Face estimator class to train our model. When creating the estimator, you need to specify the following parameters:

  • entry_point – The name of the training script. It loads data from the input channels, configures training with hyperparameters, trains a model, and saves the model.
  • source_dir – The location of the training scripts.
  • transformers_version – The Hugging Face Transformers library version we want to use.
  • pytorch_version – The PyTorch version that’s compatible with the Transformers library.

For this use case and dataset, we use one ml.p3.2xlarge instance, and the training job is able to finish in around 2 hours. You can select a more powerful instance with more memory and GPU to reduce the training time; however, it incurs more cost.

When you create a Hugging Face estimator, you can configure hyperparameters and provide a custom parameter into the training script, such as vocab_url in this example. Also, you can specify the metrics in the estimator, parse the logs of these metrics, and send them to Amazon CloudWatch to monitor and track the training performance. For more details, see Monitor and Analyze Training Jobs Using Amazon CloudWatch Metrics.

from sagemaker.huggingface import HuggingFace

#create an unique id to tag training job, model name and endpoint name. 
id = int(time.time())

TRAINING_JOB_NAME = f"huggingface-wav2vec2-training-{id}"
vocab_url = f"s3://{BUCKET}/{PREFIX}/vocab.json"

hyperparameters = {'epochs':10, # you can increase the epoch number to improve model accuracy
                   'train_batch_size': 8,
                   'model_name': "facebook/wav2vec2-base",
                   'vocab_url': vocab_url
                  }
                  
# define metrics definitions
metric_definitions=[
        {'Name': 'eval_loss', 'Regex': "'eval_loss': ([0-9]+(.|e\-)[0-9]+),?"},
        {'Name': 'eval_wer', 'Regex': "'eval_wer': ([0-9]+(.|e\-)[0-9]+),?"},
        {'Name': 'eval_runtime', 'Regex': "'eval_runtime': ([0-9]+(.|e\-)[0-9]+),?"},
        {'Name': 'eval_samples_per_second', 'Regex': "'eval_samples_per_second': ([0-9]+(.|e\-)[0-9]+),?"},
        {'Name': 'epoch', 'Regex': "'epoch': ([0-9]+(.|e\-)[0-9]+),?"}]

OUTPUT_PATH= f's3://{BUCKET}/{PREFIX}/{TRAINING_JOB_NAME}/output/'

huggingface_estimator = HuggingFace(entry_point='train.py',
                                    source_dir='./scripts',
                                    output_path= OUTPUT_PATH, 
                                    instance_type='ml.p3.2xlarge',
                                    instance_count=1,
                                    transformers_version='4.6.1',
                                    pytorch_version='1.7.1',
                                    py_version='py36',
                                    role=ROLE,
                                    hyperparameters = hyperparameters,
                                    metric_definitions = metric_definitions,
                                   )

#Starts the training job using the fit function, training takes approximately 2 hours to complete.
huggingface_estimator.fit({'train': training_input_path, 'test': test_input_path},
                          job_name=TRAINING_JOB_NAME)

In the following figure of CloudWatch training job logs, you can see that, after 10 epochs of training, the model evaluation metrics WER (word error rate) can achieve around 0.17 for the subset of the SUPERB dataset. WER is a commonly used metric to evaluate speech recognition model performance, and the objective is to minimize it. You can increase the number of epochs or use the full SUPERB dataset to improve the model further.

Deploy the model as an endpoint on SageMaker and run inference

In this section, we walk through the steps to deploy the model and perform inference.

Inference script

We use the SageMaker Hugging Face Inference Toolkit to host our fine-tuned model. It provides default functions for preprocessing, predicting, and postprocessing for certain tasks. However, the default capabilities can’t inference our model properly. Therefore, we defined the custom functions model_fn(), input_fn(), predict_fn(), and output_fn() in the inference.py script to override the default settings with custom requirements. For more details, refer to the GitHub repo.

As of January 2022, the Inference Toolkit can inference tasks from architectures that end with 'TapasForQuestionAnswering', 'ForQuestionAnswering', 'ForTokenClassification', 'ForSequenceClassification', 'ForMultipleChoice', 'ForMaskedLM', 'ForCausalLM', 'ForConditionalGeneration', 'MTModel', 'EncoderDecoderModel','GPT2LMHeadModel', and 'T5WithLMHeadModel'. The Wav2Vec2 model is not currently supported.

You can inspect the full inference script in the GitHub repo, in the scripts/ directory.

Create a Hugging Face model from the estimator

We use the Hugging Face Model class to create a model object, which you can deploy to a SageMaker endpoint. When creating the model, specify the following parameters:

  • entry_point – The name of the inference script. The methods defined in the inference script are implemented to the endpoint.
  • source_dir – The location of the inference scripts.
  • transformers_version – The Hugging Face Transformers library version we want to use. It should be consistent with the training step.
  • pytorch_version – The PyTorch version that is compatible with the Transformers library. It should be consistent with the training step.
  • model_data – The Amazon S3 location of a SageMaker model data .tar.gz file.
from sagemaker.huggingface import HuggingFaceModel

huggingface_model = HuggingFaceModel(
        entry_point = 'inference.py',
        source_dir='./scripts',
        name = f'huggingface-wav2vec2-model-{id}',
        transformers_version='4.6.1', 
        pytorch_version='1.7.1', 
        py_version='py36',
        model_data=huggingface_estimator.model_data,
        role=ROLE,
    )

predictor = huggingface_model.deploy(
    initial_instance_count=1,
    instance_type="ml.g4dn.xlarge", 
    endpoint_name = f'huggingface-wav2vec2-endpoint-{id}'
)

When you create a predictor by using the model.deploy function, you can change the instance count and instance type based on your performance requirements.

Inference audio files

After you deploy the endpoint, you can run prediction tests to check the model performance. You can download an audio file from the S3 bucket by using the following code:

import boto3
s3 = boto3.client('s3')
s3.download_file(BUCKET, 'huggingface-blog/sample_audio/xxx.wav', 'downloaded.wav')
file_name ='downloaded.wav'

Alternatively, you can download a sample audio file to run the inference request:

import soundfile
!wget https://datashare.ed.ac.uk/bitstream/handle/10283/343/MKH800_19_0001.wav
file_name ='MKH800_19_0001.wav'
speech_array, sampling_rate = soundfile.read(file_name)
json_request_data = {"speech_array": speech_array.tolist(),
                     "sampling_rate": sampling_rate}

prediction = predictor.predict(json_request_data)
print(prediction)

The predicted result is as follows:

['"she had your dark suit in grecy wash water all year"', 'application/json']

Clean up

When you’re finished using the solution, delete the SageMaker endpoint to avoid ongoing charges:

predictor.delete_endpoint()

Conclusion

Automatic speech recognition technology has matured in recent years and ASR services such as Amazon Transcribe let customers add speech to text capabilities with no prior machine learning experience required. In this post, we show how data scientists who want to build their own ASR models can fine-tune the pre-trained Wav2Vec2 model on SageMaker using a Hugging Face estimator, and also how to host the model on SageMaker as a real-time inference endpoint using the SageMaker Hugging Face Inference Toolkit. For both training and inference steps, we provided custom defined scripts for greater flexibility, which are enabled and supported by SageMaker Hugging Face DLCs. You can use the method from this post to fine-tune a We2Vec2 model with your own datasets, or to fine-tune and deploy a different transformer model from Hugging Face.

Check out the notebook and code of this project from GitHub, and let us know your comments. For more comprehensive information, see Hugging Face on SageMaker and Use Hugging Face with Amazon SageMaker.

In addition, Hugging Face and AWS announced a partnership in 2022 that makes it even easier to train Hugging Face models on SageMaker. This functionality is available through the development of Hugging Face AWS DLCs. These containers include the Hugging Face Transformers, Tokenizers, and Datasets libraries, which allow us to use these resources for training and inference jobs. For a list of the available DLC images, see Available Deep Learning Containers Images. They are maintained and regularly updated with security patches. You can find many examples of how to train Hugging Face models with these DLCs and the Hugging Face Python SDK in the following GitHub repo.


About the Author

Ying Hou, PhD, is a Machine Learning Prototyping Architect at AWS. Her main areas of interests are deep learning, computer vision, NLP, and time series data prediction. In her spare time, she enjoys reading novels and hiking in national parks in the UK.