Operational Data Processing Framework using AWS Glue and Apache Hudi

The Operational Data Processing Framework (ODP Framework) contains three components: 1/ File Manager, 2/ File Processor, and 3/ Configuration Manager. Each component runs independently to solve a portion of the operational data processing use case. The source code is organized in three folders – one for each component and if you customize and adopt this framework for your use cases, we recommend you to promote these components to three separate code repositories in your version control system. You can consider the following repository names:

aws-glue-hudi-odp-framework-file-manager
aws-glue-hudi-odp-framework-file-processor
aws-glue-hudi-odp-framework-config-manager

With this modular approach, you can independently deploy the components to your data lake environment by following your preferred CI/CD Processes. As illustrated in the Overall Architecture section, these components are deployed in conjunction with a Change Data Capture solution. For the sake of completeness, we assume that AWS DMS is used to migrate data from operational databases to Amazon S3 but skip its implementation specifics.

Data Lake Reference Architecture

A Data Lake solves a variety of Analytics and Machine Learning (ML) use cases dealing with internal and external data producers and consumers. We use a simplified and generic Data Lake reference architecture – illustrated in the diagram below. To ingest data from Operational Databases to Amazon S3 staging bucket of the data lake, either an AWS Database Migration Service (DMS) or any AWS partner solution from AWS Marketplace that has support for Change Data Capture (CDC) can fulfil the requirement. AWS Glue is used to create source-aligned and consumer-aligned datasets and separate Glue jobs to do Feature Engineering part of ML Engineering and Operations. Amazon Athena is used for interactive querying and AWS Lake Formation and Glue Data Catalog for Governance.

In our architecture, we used AWS Database Migration Service (DMS) to ingest data from Operational Data Sources (ODS) to S3 staging layer of data lake. We used AWS Glue to run data ingestion and transformation pipelines. To populate Raw zones of the data lake, we used Apache Hudi as an incremental data processing solution in conjunction with Apache Parquet. Apache Hudi Connector for AWS Glue make it easy to use Apache Hudi within Glue ecosystem (Glue ETL and Glue Data Catalog).

ODP Framework Deep Dive and Deployment

Refer to:

ODP Framework Demo

Refer to ODP Framework Demo.

Authors

The following people are involved in the design, architecture, development, and testing of this solution:

Srinivas Kandi, Data Architect, Amazon Web Services Inc.
Ravi Itha, Principal Consultant, Amazon Web Services Inc.

License

This project is licensed under the Apache-2.0 License.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
config_manager		config_manager
demo		demo
diagrams		diagrams
file_manager		file_manager
file_processor		file_processor
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Operational Data Processing Framework using AWS Glue and Apache Hudi

Contents

Data Lake Reference Architecture

ODP Framework Deep Dive and Deployment

ODP Framework Demo

Authors

License

About

Releases

Packages

Contributors 3

Languages

License

awslabs/aws-glue-apache-hudi-operational-data-processing-framework

Folders and files

Latest commit

History

Repository files navigation

Operational Data Processing Framework using AWS Glue and Apache Hudi

Contents

Data Lake Reference Architecture

ODP Framework Deep Dive and Deployment

ODP Framework Demo

Authors

License

About

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages