Skip to content

awslabs/aws-glue-apache-hudi-operational-data-processing-framework

Operational Data Processing Framework using AWS Glue and Apache Hudi

The Operational Data Processing Framework (ODP Framework) contains three components: 1/ File Manager, 2/ File Processor, and 3/ Configuration Manager. Each component runs independently to solve a portion of the operational data processing use case. The source code is organized in three folders – one for each component and if you customize and adopt this framework for your use cases, we recommend you to promote these components to three separate code repositories in your version control system. You can consider the following repository names:

  1. aws-glue-hudi-odp-framework-file-manager
  2. aws-glue-hudi-odp-framework-file-processor
  3. aws-glue-hudi-odp-framework-config-manager

With this modular approach, you can independently deploy the components to your data lake environment by following your preferred CI/CD Processes. As illustrated in the Overall Architecture section, these components are deployed in conjunction with a Change Data Capture solution. For the sake of completeness, we assume that AWS DMS is used to migrate data from operational databases to Amazon S3 but skip its implementation specifics.


Contents


Data Lake Reference Architecture

A Data Lake solves a variety of Analytics and Machine Learning (ML) use cases dealing with internal and external data producers and consumers. We use a simplified and generic Data Lake reference architecture – illustrated in the diagram below. To ingest data from Operational Databases to Amazon S3 staging bucket of the data lake, either an AWS Database Migration Service (DMS) or any AWS partner solution from AWS Marketplace that has support for Change Data Capture (CDC) can fulfil the requirement. AWS Glue is used to create source-aligned and consumer-aligned datasets and separate Glue jobs to do Feature Engineering part of ML Engineering and Operations. Amazon Athena is used for interactive querying and AWS Lake Formation and Glue Data Catalog for Governance.

In our architecture, we used AWS Database Migration Service (DMS) to ingest data from Operational Data Sources (ODS) to S3 staging layer of data lake. We used AWS Glue to run data ingestion and transformation pipelines. To populate Raw zones of the data lake, we used Apache Hudi as an incremental data processing solution in conjunction with Apache Parquet. Apache Hudi Connector for AWS Glue make it easy to use Apache Hudi within Glue ecosystem (Glue ETL and Glue Data Catalog).

Architecture


ODP Framework Deep Dive and Deployment

Refer to:

  1. File Manager README.
  2. File Processor README.

ODP Framework Demo

Refer to ODP Framework Demo.


Authors

The following people are involved in the design, architecture, development, and testing of this solution:

  1. Srinivas Kandi, Data Architect, Amazon Web Services Inc.
  2. Ravi Itha, Principal Consultant, Amazon Web Services Inc.

License

This project is licensed under the Apache-2.0 License.


About

Operational Data Processing Framework developed using AWS Glue and Apache Hudi. This framework is suitable for Data Lake and Modern Data Platform implementations on the AWS Cloud.

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages