Migrating workloads from AWS Data Pipeline to Amazon MWAA - Amazon Managed Workflows for Apache Airflow

Migrating workloads from AWS Data Pipeline to Amazon MWAA

AWS launched the AWS Data Pipeline service in 2012. At that time, customers wanted a service that let them use a variety of compute options to move data between different data sources. As data transfer needs changed over time, so have the solutions to those needs. You now have the option to choose the solution that most closely meets your business requirements. You can migrate your workloads to any of the following AWS services:

  • Use Amazon Managed Workflows for Apache Airflow (Amazon MWAA) to manage workflow orchestration for Apache Airflow.

  • Use Step Functions to orchestrate workflows between multiple AWS services.

  • Use AWS Glue to run and orchestrate Apache Spark applications.

The option you choose depends on your current workload on AWS Data Pipeline. This topic explains how to migrate from AWS Data Pipeline to Amazon MWAA.

Choosing Amazon MWAA

Amazon Managed Workflows for Apache Airflow (Amazon MWAA) is a managed orchestration service for Apache Airflow that lets you setup and operate end-to-end data pipelines in the cloud at scale. Apache Airflow is an open-source tool used to programmatically author, schedule, and monitor sequences of processes and tasks referred to as workflows. With Amazon MWAA, you can use Apache Airflow and the Python programming language to create workflows without having to manage the underlying infrastructure for scalability, availability, and security. Amazon MWAA automatically scales its workflow capacity to meet your needs, and is integrated with AWS security services to help provide you with fast and secure access to your data.

The following highlights some of the benefits of migrating from AWS Data Pipeline to Amazon MWAA:

  • Enhanced scalability and performance – Amazon MWAA provides a flexible and scalable framework for defining and executing workflows. This allows users to handle large and complex workflows with ease, and take advantage of features such as dynamic task scheduling, data-driven workflows and parallelism.

  • Improved monitoring and logging – Amazon MWAA integrates with Amazon CloudWatch to enhance monitoring and logging of your workflows. Amazon MWAA automatically sends system metrics and logs to CloudWatch. This means you can track the progress and performance of your workflows in real-time, and identify any issues that arise.

  • Better integrations with AWS services and third-party software – Amazon MWAA integrates with a variety of other AWS services, such as Amazon S3, AWS Glue, and Amazon Redshift, as well as third-party software such as DBT, Snowflake, and Databricks. This lets you process, and transfer, data across different environments and services.

  • Open-source data pipeline tool – Amazon MWAA leverages the same open-source Apache Airflow product you are familiar with. Apache Airflow is a purpose-built tool designed to handle all aspects of data pipeline management, including ingestion, processing, transferring, integrity testing, quality checks, and ensuring data lineage.

  • Modern and flexible architecture – Amazon MWAA leverages containerization and cloud-native, serverless technologies. This means for more flexibility and portability, as well as easier deployment and management of your workflow environments.

Architecture and concept mapping

AWS Data Pipeline and Amazon MWAA have different architectures and components, which can affect the migration process and the way workflows are defined and executed. This section overviews architecture and components for both services, and highlights some of the key differences.

Both AWS Data Pipeline and Amazon MWAA are fully managed services. When you migrate your workloads to Amazon MWAA you might need to learn new concepts to model your existing workflows using Apache Airflow. However, you will not need to manage infrastructure, patch workers, and manage operating system updates.

The following table associates key concepts in AWS Data Pipeline with those in Amazon MWAA. Use this information as a starting point to design a migration plan.

Concept AWS Data Pipeline Amazon MWAA
Pipeline definition AWS Data Pipeline uses JSON-based configuration file that defines the workflow. Amazon MWAA uses Python-based Directed Acyclic Graphs (DAGs) that define the workflow.
Pipeline execution environment Workflows run on Amazon EC2 instances. AWS Data Pipeline provisions and manages these instances on your behalf. Amazon MWAA uses Amazon ECS containerized environments to run tasks.
Pipeline components Activities are processing tasks that run as part of the workflow. Operators (Tasks) are the fundamental processing units of a workflow.
Preconditions contain conditional statements that must be true before an activity can run. Sensors (Tasks) represent conditional statements that can wait for a resource or task to be completed before running.
A resource in AWS Data Pipeline refers to the AWS compute resource that performs the work that a pipeline activity specifies. Amazon EC2 and Amazon EMR are two available resources. Using tasks in a DAG, you can define a variety of compute resources, including Amazon ECS, Amazon EMR, and Amazon EKS. Amazon MWAA executes Python operations on workers that run on Amazon ECS.
Pipeline execution AWS Data Pipeline supports scheduling runs with regular rate-based, and cron-based patterns. Amazon MWAA supports scheduling with cron expressions and presets, as well as custom timetables.
An instances refers to each run of the pipeline. A DAG run refers to each run of an Apache Airflow workflow.
An attempt refers to a retry of a failed operation. Amazon MWAA supports retries that you define either at the DAG level, or at the task-level.

Example implementations

In many cases you will be able to re-use resources you are currently orchestrating with AWS Data Pipeline after migrating to Amazon MWAA. The following list contains example implementations using Amazon MWAA for the most common AWS Data Pipeline use-cases.

For additional tutorials and examples, see the following:

Pricing comparison

Pricing for AWS Data Pipeline is based on the number of pipelines, as well as how much you use each pipeline. Activities that you run more than once a day (high frequency) cost $1 per month per activity. Activities that you run once a day or less (low frequency) cost $0.60 per month per activity. Inactive Pipelines are priced at $1 per pipeline. For more information, see the AWS Data Pipeline pricing page.

Pricing for Amazon MWAA is based on the duration of time that your managed Apache Airflow environment exists, and any additional auto scaling required to provide more workers, or scheduler capacity. You pay for your Amazon MWAA environment usage on an hourly basis (billed at one-second resolution), with varying fees depending on the size of the environment. Amazon MWAA auto-scales the number of workers based on your environment configuration. AWS calculates the cost of additional workers separately. For more information on the hourly cost of using various Amazon MWAA environment sizes, see the Amazon MWAA pricing page.

Related resources

For more information and best practices for using Amazon MWAA, see the following resources: