AWS for Industries

BayerCLAW – Open-Source, Serverless Orchestrator for Scientific Workflows on AWS

Guest blog authored by Jack Tabaska and Ian Davis from the Bayer Crop Sciences team.

At Bayer Crop Science we are applying modern genomic and data science methods to the challenges of global food production. Our research routinely produces enormous volumes of raw data that must be processed quickly and cost-effectively. Automated analysis pipelines (also known as workflows) are crucial for making this happen.

For such workflows, AWS describes the power of using AWS Step Functions and AWS Batch for Genomics Workflows on AWS.  To abstract and simplify that architecture, and enable scientists to easily build their own analysis workflows, we have developed BayerCLAW (Cloud Automated Workflows).  Scientists create a simple YAML file, BayerCLAW automates the creation of all architectural components and implementation details.

At Bayer, we use BayerCLAW for genomics and ETL workloads; we expect it to be suitable for other scientific HPC applications, including high content screening, digital pathology/radiology, and computational chemistry workloads.  As such, we have released BayerCLAW as open source software on GitHub at https://github.com/Bayer-Group/BayerCLAW.

BayerCLAW consists of two main components: a workflow definition language, which describes a workflow as a series of shell script-like steps; and an orchestration engine, which runs submitted jobs through the pipeline. The motivations and technical implementation of BayerCLAW are outlined below.

Why BayerCLAW?

Although we experimented with many of the existing automated analysis pipeline systems, we did not find any that met all our needs.  Thus, we developed BayerCLAW to meet the following five criteria:

  • Simple to author: Scientists simply create containers (or reuse existing ones), list commands in a YAML file, and copy their data to S3.  We find most scientists can read the documentation for BayerCLAW and begin running jobs within about 2 hours, in contrast to other systems that took days just to install.  This allows scientists to focus on their research, instead of operations.
  • Simple to operate: BayerCLAW workflows are data-driven, so they are automatically triggered by the arrival of data files in Amazon Simple Storage Service (S3).
  • Efficient to maintain: BayerCLAW is fully serverless thanks to AWS Batch and AWS Step Functions.  There are no servers to patch and maintain, and no wasted costs from long-running scheduler or worker nodes.  This saves Bayer thousands of dollars per year in recurring infrastructure and personnel costs relative to server-based workflow solutions.
  • Cost effective: By default, BayerCLAW stores data in S3 and uses Amazon EC2 Spot instances for job execution, cutting our compute and storage costs by more than 40% relative to traditional HPC workflows in the cloud.  This allows Bayer scientists to test more hypotheses within the same budget.
  • Reliable at scale: Some of our workflows run thousands of jobs per day in production with no human involvement.  A few percent of Spot jobs are interrupted, potentially affecting up to a quarter of our workflow executions; however, automated retries ensure BayerCLAW workflows complete as expected without manual intervention.  To date, we have had no workflows fail to complete due to Spot terminations.  Furthermore, robust automated logging to AWS CloudWatch ensures that scientists can easily debug true problems with their input data or code.

Our users agree that BayerCLAW meets these design goals for ease of use, efficiency, and reliability:

“BayerCLAW has greatly increased our team’s productivity when creating pipelines by reducing deployment time, facilitating troubleshooting and most importantly being fun to use.”  – José Juan, data scientist

“BayerCLAW allows me to focus on the relationships between my steps and the logic of those steps, since the infrastructure details are handled for me.  It is also easier to get up to speed with others’ workflows as I can simply start by reviewing the stack YAML or execution diagram in Step Functions and drill down from there.”  – Andy, data engineer

“We have used BayerCLAW for several different scientific and data engineering workflows.  It allows us to focus on the problem we are trying to solve instead of the details of the infrastructure.  The gain in focus on the problem and the time saved on the infrastructure have both been invaluable.”  – David, data engineer

Architecture

Figure 1. Bayer CLAW architecture

Building and running a workflow in BayerCLAW requires the following:

  • An AWS account with the BayerCLAW resource stack installed. See https://github.com/Bayer-Group/BayerCLAW/doc/deployment.md for installation instructions.
  • One or more Docker images containing the software your workflow will use. These can be kept in any Docker repository, such as Docker Hub or Amazon Elastic Container Registry (ECR).
  • A workflow template specifying the steps of your workflow and the commands to run at each step.
  • An S3 location (known as the repository) to hold intermediate and output files.

To construct a workflow stack, submit your workflow template to AWS CloudFormation. An AWS Lambda-based macro (compiler) translates the workflow into a native CloudFormation template, which builds an AWS Step Functions state machine, AWS Batch job definitions, and other required resources. Figure 1 depicts the BayerCLAW architecture.

To execute a Workflow upload the job data file to the Amazon S3 launcher bucket.  A launcher bucket supports multiple state machines based on the prefix of the uploaded job data file.  Uploading the file triggers an Amazon EventBridge Rule, which initiates a Step Functions execution. Step Functions then executes the workflow as a series of AWS Batch jobs. Figure 2 describes the process of launching the workflow

Figure 2. Launching a workflow

The BayerCLAW runner program, which resides on each Batch job’s Amazon EC2 instance, downloads input files to the local disk, executes the step’s commands in a bash shell, and uploads output files to the repository.

You can track job executions using the Step Functions console. This provides a simple way to visualize the job status, and links to Amazon CloudWatch logs for individual steps. Figure 3 shows the AWS Console view of our workflow.

Figure 3. Step functions visualization of workflow execution

Walkthrough

The Workflow Language

Workflows in BayerCLAW are defined using a YAML-based specification format. Below is a spec for a simple three-step microbial genome assembly and annotation pipeline:

Transform: BC_Compiler

params:
  job_name: ${job.SAMPLE_ID}
  repository: s3://microbial-genome-bucket/repo/${job.SAMPLE_ID}

steps:
  -
    Assemble:
      image: my_shovill_docker
      inputs:
        reads1: ${job.READS1}
        reads2: ${job.READS2}
      commands:
	# this command writes its output to contigs.fa
        - shovill -R1 ${reads1} -R2 ${reads2} --outdir .
      outputs:
        contigs: contigs.fa
      compute:
        cpus: 4
        memory: 40 Gb
        spot: true
  -
    Annotate:
      image: my_prokka_docker
      inputs:
        contigs: contigs.fa
      commands:
	# this command writes output files named annot.*
        - prokka --outdir . --force --prefix annot ${contigs}
      outputs:
        prots: annot.faa
        annots: annot.gff
  -
    Blast:
      image: my_ncbi_blast_image
      inputs:
        prots: annot.faa
      commands:
        - blastp -query ${inputs} -db $BC_EFS/uniprot.fasta -out ${blast_out}
      outputs:
        blast_out: prots_v_uniprot.txt

The job data file used to run this workflow might look like this:

{
    “SAMPLE_ID”: “SAMPLE_12345”,
    “READS1”: “s3://sequencing-data-bucket/SAMPLE_12345/reads1.fq”,
    “READS2”: “s3://sequencing-data-bucket/SAMPLE_12345/reads2.fq”
}

Here are some of the important features of the workflow specification:

The Transform: line. This line is required at the top of every workflow spec. It tells CloudFormation to use BayerCLAW’s compiler Lambda to transform this spec into a CloudFormation template.

The params block. This block tells BayerCLAW the location of the job’s repository and, optionally, a name for the execution. Note the use of ${job.SAMPLE_ID} in the params block above: this tells BayerCLAW to look up the field named SAMPLE_ID in the job data file and substitute it into the strings provided. Using the sample job data above, the job’s name would be SAMPLE_12345 and the repository would be located at s3://microbial-genome-bucket/repo/SAMPLE_12345.

The steps block. The steps block contains a list of steps to be executed. The fields in a step specification include:

  • image: The name of a Docker image to use. Images in your account’s ECR repository can be referenced simply by name; for images in other locations, use a Docker-style path such as io/library/ubuntu.
  • inputs and outputs: These blocks contain a set of key-value pairs. The values of the inputs block are S3 objects that will be downloaded to the local disk before BayerCLAW starts running commands. Bare filenames are assumed to exist in the repository. Similarly, values of the outputs block are local files that will be uploaded to the repository when the commands are finished. The keys of the inputs and outputs blocks can be used as symbolic names that get substituted into the commands.
  • commands: This is a list of Unix-style command lines, much like a shell script. The sample spec above shows only single-command steps, but multi-command steps are also allowed. Also, note that BayerCLAW can optionally mount an EFS filesystem and expose the path to it as the environment variable $BC_EFS. This allows users to utilize very large files or files that are shared with other processes.
  • compute: This optional block provides hints to AWS Batch for allocating compute resources. By default, each step is allocated 1 CPU and 1 gigabyte of memory. You can use the compute block to change these requests. In addition, the compute block is used to specify the use of spot EC2 instances. For cost savings, the default is to use spot instances for every step, with automatic retry if interrupted. If you have a process that must run without interruption, you can set the spot field to false to run the step on an on-demand EC2 instance.

A detailed description of the BayerCLAW workflow language can be found at https://github.com/Bayer-Group/BayerCLAW/doc/language.md .

Additional Features

Scatter/Gather. In scientific computing, large but “embarrassingly parallelizable” computations are common. BayerCLAW therefore enables users to split a job into an arbitrary number of branches that run in parallel. This is done through the use of a scatter step, as shown below:

- Scatterize:
      scatter:
          contigs: contigs*.fa
      steps:
          - Annotate:
              image: prokka
              inputs:
                  contigs: ${scatter.contigs}
              commands:
                - prokka --outdir . --force --prefix annot ${contigs}
              outputs:
                  # These names do not collide with each other because each child
                  # execution gets its own folder in the repository:
                  faa_file: annot.faa
                  gff_file: annot.gff
      outputs:
          protein_seqs: annot.faa
          annotations: annot.gff

Here, it is assumed that a previous step has placed multiple files in the repository named contig1.fa, contig2.fa, etc.  The Unix-like glob contigs*.fa in the scatter field tells BayerCLAW to run the Annotate step on each of these files in parallel. When all of the branches are finished, paths to the output files are recorded in a JSON-formatted manifest file named after the name of the scatter step (e.g. Scatterize_manifest.json):

{
  "protein_seqs": [
    "s3://my-bucket/repo/SAMPLE_12345/Scatterize/00000/annot.faa",
    "s3://my-bucket/repo/SAMPLE_12345/Scatterize/00001/annot.faa",
    "s3://my-bucket/repo/SAMPLE_12345/Scatterize/00002/annot.faa",
    // etc...
  ],
  "annotations": [
    "s3://my-bucket/repo/SAMPLE_12345/Scatterize/00000/annot.gff",
    "s3://my-bucket/repo/SAMPLE_12345/Scatterize/00001/annot.gff",
    "s3://my-bucket/repo/SAMPLE_12345/Scatterize/00002/annot.gff",
    // etc...
  ]
}

Other types of scattering are possible, such as scattering on a list of parameter values or even scattering on the contents of a previous step’s output file.

QC Checks. Often, it is a best practice to check the intermediate results of a workflow and stop the execution if the results are of poor quality. In BayerCLAW, Batch job steps can have an optional field named qc_check to specify conditions for early termination:

  -
    Assemble:
      image: my_shovill_docker
      inputs:
        reads1: ${job.READS1}
        reads2: ${job.READS2}
      commands:
	# this command writes its output to contigs.fa
        - shovill -R1 ${reads1} -R2 ${reads2} --outdir .
        - qc_checker ${contigs} > ${qc_out} 
      outputs:
        contigs: contigs.fa
        qc_out: qc_out.json
      qc_check:
        qc_result_file: qc_out.json
        stop_early_if: “float(qc_result) < 0.5”
      compute:
        cpus: 4
        memory: 40 Gb
        spot: true

Here, the qc_checker program checks the output from the shovill run and writes its output to a JSON file named qc_out.json:

{
  “qc_result”: 0.9,
  “other”: “stuff”
}

The qc_check field tells BayerCLAW use the qc_result value from qc_out.json to evaluate the stop_early_if condition (which is a line of Python code that returns a Boolean value). If the stop_early_if condition evaluates to True, the Step Functions execution will be aborted.

And more. Other important features of BayerCLAW include:

·       Subpipes, which allow users to create reusable workflow modules;

·       Native Step Functions steps, so a workflow can, for instance, execute a Lambda function or utilize the powerful Parallel step;

·       SNS notifications of workflow execution events.

Summary

We have described BayerCLAW, a simple, reliable, serverless analysis pipeline orchestrator that we at Bayer Crop Science use to implement many of our automated workflows. While we have developed it in the context of genomic research, we believe it is suitable for use for other types of research as well. BayerCLAW is open source software and is available on GitHub at https://github.com/Bayer-Group/BayerCLAW . Extensive documentation, including installation instructions, a tutorial, and a comprehensive description of all of BayerCLAW’s features, can be found in the doc directory of the GitHub repository.

Jack Tabaska

Jack Tabaska

Jack Tabaska is a Senior Data Engineer at Bayer’s Crop Sciences Division. Jack has over 20 years experience in developing automated data analysis methods for genomics and life sciences purposes. Most recently, he has been focused on making AWS Cloud resources available to the broader data science community within Crop Sciences. Jack holds a PhD in Molecular, Cellular, and Developmental Biology from the University of Colorado at Boulder.

Ian Davis

Ian Davis

Ian Davis leads the Research Computing team at Bayer Crop Science, which creates cloud computing infrastructure and research data resources to support bioinformaticians and data scientists in the Plant Biotechnology R&D group. Prior to this role, Ian has a PhD in biochemistry and 20 years experience in diverse areas of computational biology including protein structure, drug docking, natural products discovery, computer vision, genomics, and regulation of gene expression. He is an AWS Certified Solutions Architect who was running genomics analyses on EC2 before it was cool (circa 2011).