AWS for Industries

Using Structural Variant Analysis on AWS with Amazon FSx for Lustre in Novel Therapeutic Discovery

This post is coauthored by Adam Tebbe (VP of Computational Data Science and Technology), Eva Fast (Senior Computational Biologist), Sarthak Vilas Patel (Senior Data Engineer) from Goldfinch Bio, Inc. and Henrique Silva (Machine Learning Lead) from AWS Advanced Consulting Partner Loka. Goldfinch Bio is an early-stage biotechnology company, who is working towards developing novel, genetically-validated therapeutics to treat rare forms of kidney disease. Loka is an Advanced APN software consulting company based in Silicon Valley that helps the most ambitious startups bring their innovations to market faster.

Introduction

Chronic kidney disease (CKD) affects 780 million people globally–one in every ten people. In an attempt to discover novel therapeutics to help this underserved patient population, many of whom have no good treatment options, Goldfinch Bio has compiled numerous genomic data sets. Our solution was to build a Structural Variant Analysis capability on AWS to help us find a cure for CKD. There are few data sets available to commercial organizations in the public domain. Goldfinch incurred monetary, time and resource losses in order to perform whole genome sequencing on thousands of patients with kidney disease. Much of the analysis performed to date on these data sets has solely relied on the identification of single nucleotide polymorphisms (SNPs, single base pair changes in the DNA).

With the recent development of the Broad’s GATK-SV pipeline, the performance of Amazon FSx for Lustre, and discovering AWS Advanced Consulting Partner Loka, Goldfinch Bio has been able to further utilize its existing sequencing data to identify structural variants that will hopefully lead to new medicines for patients suffering with chronic kidney disease. As a result, the analytical value of our existing genomic data increased overnight. Goldfinch Bio firmly believes that analysis of structural variation will be a fundamental piece of the genomic puzzle with the potential to provide a missing link between genomic variation and kidney disease.

Solution Overview

The GATK-SV pipeline was developed by the Broad Institute, written in Workflow Definition Language (WDL) and orchestrated using the Cromwell engine. We knew that the large intermediate and reference files required by the pipeline would be best served with a highly performant file system. Finally, we needed a solutions integrator with deep cloud experience to tie it all together in a reproduceable way so our scientists at Goldfinch can use it on their own.

Pipeline and HPC Orchestration – We chose the Broad’s GATK-SV pipeline and AWS’ own Genomics Workflows on AWS for orchestration. The pipeline was specifically developed to call, filter and integrate structural variants across large cohorts using short-read sequencing data. This is particularly important as existing data can be leveraged, as opposed to alternative sequencing approaches that would require a significant investment. The Genomics Workflows on AWS reference architecture makes it easy to standup Cromwell environments on AWS Batch.

High speed storage – The GATK-SV is a very complex and resource intensive pipeline which runs almost 12.5k+ batch jobs and generates millions of files/objects along the way (stats as per execution for 156 1K-Genome samples). This pipeline was initially ported to work with s3 which used to pull down/localize the required files, perform the task, and then upload the output/staging files back to s3 for each Batch job. This was a major performance bottleneck which led us to explore other options that would be highly scalable and provide higher performance. Amazon FSx for Lustre was a great fit for this scenario due to its scalability, performance, provisioning, mounting capabilities and wide range of file system types like SSD, HDD and Scratch along with customizable throughput ranges.

Additionally, the operating cost is also low as compared to other alternatives. We then integrated FSx with AWS Batch and Cromwell (workflow orchestration engine) to handle FSx along with the infrastructure templates and started seeing the performance boost right away. You can determine from the benchmark table provided later in this post, the execution time came down drastically from 3+ days to 1.3 days which not only saves on the effort but also helps reduce overall cost of EC2 instances used and enables quick analysis of the results. We now have a functionality to use either FSx or S3 depending upon the use case and performance/cost requirements.

Deeply Qualified Solutions Integrator – In order to modify the pipeline and storage, it was not only necessary to change the code contained within WDL files, but also Cromwell, the workflow orchestration engine, and the AWS Infrastructure Templates. Loka is an Advanced Tier partner with a deep understanding of both cloud and scientific workflow definitions, as well as strong familiarity with storage. Their specialized knowledge of HPC, AWS Batch, open-source packages, open data sets, and processes like parallelization which enables thousands of jobs to run simultaneously, meant Goldfinch would have best of breed solutions executed faster than if they had tried on their own.

Deploying GATK-SV

The following steps will help you reproduce our architecture in your own AWS environment. The steps here are high-level guidance.

You can find more detailed deployment instructions on GitHub.

Step 1) Deploying the Genomics Workflows on AWS

  1. Open and follow the Genomics Workflows on AWS deployment guide for
  2. When deploying the Genomics Workflow Core stack, supply the following
    1. For CreateFSx, choose Yes
    2. For Cromwell FSxStorageType, choose Scratch
    3. For FSxStorageVolumeSize, type 24000 (MELT) or 16800 (without MELT)
  3. When deploying the Cromwell Resources stack, supply the following
    1. For FSxFileSystemID, supply the id from the gwfcore template output tab
    2. For FSxFileSystemMount, supply the id from the gwfcore template output tab
    3. For FSxSecurityGroupId, supply the id from the gwfcore template output tab

Step 2) Deploying GATK-SV onto your Cromwell host

We have built a small CloudFormation template that deploys an AWS SSM command document to be run against your Cromwell server. This command document clones the Broad’s GATK-SV repo and makes some tweaks to the Cromwell config files to support connecting with Amazon FSx.

  1. Deploy the SSM command:
    wget https://github.com/goldfinchbio/aws-gatk-sv/blob/master/templates/cf_ssm_document_setup.yaml
    aws cloudformation deploy –stack-name "gatk-sv-ssm-deploy" –template cf_ssm_document_setup.yaml –capabilities CAPABILITY_IAM
  1. Open The AWS Systems Manager Console
  2. In the navigation pane, under Documents, choose ‘Owned by me’.
  3. Search for gatk-sv-ssm-deploy in the search bar
  4. Click the Execute Automation button
    1. For Instance Id, choose the listed Cromwell server
    2. For S3OrFSXPath, use the mount name in the Stack Output from Step 1

Step 3) Execute and monitor the pipeline

  1. Start a shell session on the Cromwell server.
  2. Run cromshell submit commands
    1. The below will run the pipeline for 156 (1000 Genomes) samples.
      cromshell submit /home/ec2-user/gatk-sv/gatk_run/wdl/GATKSVPipelineBatch.wdl /home/ec2-user/gatk-sv/gatk_run/aws_GATKSVPipelineBatch.json /home/ec2-user/gatk-sv/gatk_run/opts.json /home/ec2-user/gatk-sv/gatk_run/wdl/dep.zip
      
  3. Monitor pipeline
    1. cromwell status
    2. Alternatively, consider using get_batch_status.py script to gather the information from AWS Batch and CloudWatch logs to give a consolidated and better view of the resources and job completion details along with higher level and module level summaries.
The AWS-GATK-SV reference architecture diagram featuring Cromwell, AWS Batch, and Amazon FSX.

The AWS-GATK-SV reference architecture diagram featuring Cromwell, AWS Batch, and Amazon FSx.

Measuring quality, performance, and cost

Quality

Using a trial dataset that contains 156 individuals from the 1000Genomes project we were able to show concordance of our results with the original pipeline.

Concordance between our pipeline and The Broad’s published standard

DEL – Deletion, DUP – Duplication, CNV - copy number variant, INS - Insertion, INV - Inversions, CPX - Complex SV, OTH - (Breakends and Translocation)

DEL – Deletion, DUP – Duplication, CNV – copy number variant, INS – Insertion, INV – Inversions, CPX – Complex SV, OTH – (Breakends and Translocation)

The number of various structural variants such as Deletions (DEL) or Duplications (DUP) is consistent between the four conditions (see Figure above). We performed two FSx runs with and without the proprietary structural variant caller MELT which as expected resulted in a reduction of insertions (INS, 19,001 vs 10,091). To further investigate consistencies between pipelines

Wwe compared the exact position in the genome where structural variants were detected by our three callsets (S3 and FSx) and the original pipeline (Broad). This evaluation is very stringent because the false positive/negative class would also include structural variants that are offset by only a few base pairs. Comparing Amazon FSx (with MELT) with the Broad gold standard we reached a precision of 0.95 and a recall/sensitivity of 0.92. Comparing S3 and FSx (without MELT) run resulted in a precision of 0.95 and recall/sensitivity of 0.96 when using S3 as the reference dataset.

Performance and Cost

Furthermore, our optimizations decreased both runtime and reduced our overall costs by a third. Below is a table describing the duration it took to complete each module on Amazon FSx and S3 with and without MELT.

GATK-SV module runtimes and costs with Amazon FSx and MELT

GATK-SV module runtimes and costs with Amazon FSx and MELT

Conclusion

The analysis of structural variation in short read sequencing data will hopefully allow the research community to unlock additional value from existing data and study classes of genomic rearrangements that explain disease phenotypes which have not yet been understood when only considering single nucleotide polymorphisms. Resources such as the UK Biobank and AllofUS increase the number of population scale whole genome sequencing datasets which lend themselves well for structural variant research. While this is a rapidly evolving methodology, it can be overwhelming for researchers with limited technical resources to take leverage a pipeline like GATK-SV. It is our hope that novel drug targets can be discovered to help patients in need of therapeutic intervention. By releasing the customizations and optimizations of this pipeline as an open-source reference architecture, hopefully the community can leverage these improvements on additional data to lead to novel scientific insights.

Want to know more?

Checkout the detailed deployment guide on our GATK-SV github repo.

Contact AWS Advanced Consulting Partner Loka for GATK-SV support or customization in your own environment.

Listen to the April 4th AWS Health Innovation podcast featuring Adam Tebbe of Goldfinch Biopharma and Bobby Mukherjee, Founder and CEO of Loka. iTunes Spotify Stitcher

Adam Tebbe

Adam Tebbe

Adam Tebbe is a VP of Computational Data Science and Technology at Goldfinch Bio, Inc. Adam leads the computational group at Goldfinch Bio, Inc. He has an extensive background in IT, informatics, software engineering and data science. Adam is interested in applications of technology and platform development to enable scientific discovery that will lead to transformations in patient care. Adam has been developing software, tools, and platforms in the cloud for more than 10 years, with a background working across technical teams in small biotech and large pharma.

Eva Fast

Eva Fast

Eva Fast is a Senior Computational Biologist at Goldfinch Bio, Inc. Coming from an experimental biology background she got excited by the rapid data growth within healthcare and decided to switch her focus to computational biology. She has experience in various genetics, genomics, clinical data and imaging analyses and enjoys collaborations to transform these workflows into scalable pipelines. In her free time you can find her on one of her bikes in and around the Boston area.

Henrique Silva

Henrique Silva

Henrique is a Machine Learning Lead at Loka. His career started in the robotics department in academia and evolved into the Machine Learning field. There he's developed several projects that involve cutting edge technologies and high performant cloud architectures. In the last couple of years he has applied his skills working with large amounts of data and high performance computing in the field of life sciences, contributing to several open source projects. When not working on a new data engineering challenge, Henrique enjoys playing Paddle Tennis, known around the world as "Padel".

Sarthak Vilas Patel

Sarthak Vilas Patel

Sarthak Vilas Patel is a Senior Data Engineer at Goldfinch Bio, Inc. He has expertise in architecting, building, maintaining, testing, and supporting highly scalable applications, infrastructures and CI/CD pipelines in diverse industries. He is passionate about cloud computing, solving complex problems and learning new tools and technologies. In his spare time, he enjoys exploring new places, watching movies and going for hikes, trails and long walks.