Amazon S3 Small Object Compaction with AWS Lambda and AWS Step Functions

This solution deploys a serverless application to combine ("compact") small objects stored in a given Amazon S3 prefix into a single larger file. Larger files enable cost effective use of S3 storage tiers that have a minimum billable object size (e.g. 128 KB). It can also improve performance when querying data directly with Amazon Athena.

The sample code is written using the AWS Cloud Development Kit in Python.

There are two solutions in this project to demonstrate the benefits of using AWS Step Functions Distributed Map functionality to invoke Lambda functions in parallel and compact the objects faster:

A single Lambda function which iterates over a list of Amazon S3 prefixes to compact.
An AWS Step Function which uses Distributed Map to invoke a Lambda function for each prefix in a given list of objects to compact.

Deploying

Pre-requisites:

Python 3.7 or later including pip and virtualenv
Node.js 14.15.0 or later

Install the latest version of the AWS CDK Toolkit using the Node Package Manager:

npm install -g aws-cdk

Creating a Python virtual environment:

python3 -m venv .venv
source .venv/bin/activate

Install the required Python package dependencies inside the virtualenv:

pip install -r requirements.txt

Deploy the CDK application passing in the context variables:

source_s3_uri = the name of your S3 bucket containing the files to compact.
target_s3_uri = the name of the bucket to write the compacted files to. This will be given s3:PutObject permissions.
previous_days = how many days of historical data to compact and the frequency to run the EventBridge scheduled job.

cdk deploy -c source_s3_uri=s3://my-source-bucket/prefix/ -c target_s3_uri=s3://my-target-bucket/prefix/ -c previous_days=30

To edit the date format of the expected S3 prefix partitioning modify the cdk.context.json file. The default is %Y/%m/%d e.g. my/prefix/2024/01/14/file.json.

Testing

By default there is an EventBridge scheduled rule to invoke the Step Function State Machine every previous_days specified at deployment. This rule is DISABLED by default. To enable this, update the lambdaTriggerRule in compaction_stack.py to enabled=True and redeploy the CDK application.

The standalone Lambda function S3ObjectCompactionStack-standaloneCompactFunction can be invoked using the following test event:

{
  "s3_source_uri": "s3://my-source-bucket/AWSLogs/123456789012/CloudTrail/eu-west-1/",
  "s3_destination_uri": "s3://my-dest-bucket/",
  "date_format": "%Y/%m/%d",
  "duration": 30
}

Depending on the volume of data, this may take several minutes to run as each S3 prefix is being processed in sequence.

The Step Function State Machine CompactionStateMachine can be invoked manually with the following inputs:

{
  "s3_source_uri": "s3://my-source-bucket/AWSLogs/123456789012/CloudTrail/eu-west-1/",
  "s3_destination_uri": "s3://my-dest-bucket/",
  "date_format": "%Y/%m/%d",
  "duration": 30
}

This will process the same data in significantly less time as each prefix is being compacted in parallel.

Generating test data

There is a useful test utility that can be used to generate some random data and seed it across date partitions in a source S3 bucket:

./test/generate_test_data.py -r 1000 -f 10000 -b my-test-bucket

Cleaning Up

Resources can be deleted by running:

cdk destroy

Name	Name	Last commit message	Last commit date
Latest commit jhart0 Implement pagination for compact functions Feb 19, 2024 cefe992 · Feb 19, 2024 History 5 Commits
compaction	compaction	bug fixes and cdk-nag	Jan 19, 2024
diagrams	diagrams	initial commit	Jan 19, 2024
lambda	lambda	Implement pagination for compact functions	Feb 19, 2024
test	test	Update test script parameters	Feb 19, 2024
.gitignore	.gitignore	initial commit	Jan 19, 2024
CODE_OF_CONDUCT.md	CODE_OF_CONDUCT.md	Initial commit	Jan 19, 2024
CONTRIBUTING.md	CONTRIBUTING.md	Initial commit	Jan 19, 2024
LICENSE	LICENSE	Initial commit	Jan 19, 2024
README.md	README.md	Update test script parameters	Feb 19, 2024
app.py	app.py	bug fixes and cdk-nag	Jan 19, 2024
cdk.context.json	cdk.context.json	initial commit	Jan 19, 2024
cdk.json	cdk.json	initial commit	Jan 19, 2024
requirements-dev.txt	requirements-dev.txt	initial commit	Jan 19, 2024
requirements.txt	requirements.txt	bug fixes and cdk-nag	Jan 19, 2024
source.bat	source.bat	initial commit	Jan 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Amazon S3 Small Object Compaction with AWS Lambda and AWS Step Functions

Deploying

Testing

Generating test data

Cleaning Up

About

Releases

Packages

Contributors 2

Languages

License

aws-samples/s3-small-object-compaction

Folders and files

Latest commit

History

Repository files navigation

Amazon S3 Small Object Compaction with AWS Lambda and AWS Step Functions

Deploying

Testing

Generating test data

Cleaning Up

About

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages