Skip to content

This repo contains a complete, working architecture on AWS that will take in streaming data, store it in S3 and DynamoDB, and then emit events with their contents.

mikaelvesavuori/example-aws-stream-data-to-events

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Example AWS Architecture: Streaming data to events

This repo contains a complete, working architecture on AWS that will take in streaming data, store it in S3 and DynamoDB, and then emit events with their contents.

The example intends to be a valid and truthful, yet not completely minimal, example of this pattern. The state of this work is "solid", but can (and should!) be improved and adapted for the actual needs you might have. Please see the Improvements and TODO sections for concise ideas and open suggestions.

Please note that with the trivial database structure in use, your might get database records that get overwritten if they are written at the the same epoch time. You can fix this by updating to a more resilient table format, for example by using a range key too.

How it is structured

The repo consists of three separate services (or stacks, if you will):

  1. ingest: The main solution that handles intake of data and output of events.
  2. consumer: A basic demo application that can be triggered by incoming events of a given type, as well as via a regular API Gateway.
  3. generator: A utility application to generate test data and send it to the ingest solution, either using Kinesis Firehose or SQS.

This README contains the general project information. Please refer to the READMEs of the various services for more details on any specifics of the above.

How it works

  • Ingest context
    • Take in streaming data via Kinesis Firehose (using "direct PUT events")
    • Transform the data with Lambda (Transformer function)
    • Automatically dump the transformed data into an S3 bucket ("data lake")
    • Place the transformed data into a first-in-first-out SQS queue
    • The queue will trigger a Lambda (EventStorer function) that will store the data in DynamoDB, similar to an event store
    • DynamoDB has the Streams feature turned on, which will trigger a third Lambda (IntegrationEventEmitter function) on any database changes
    • The integration event is emitted through EventBridge on a custom event bus
  • Consumer context
    • A basic "consumer" service using Lambda (EventResponder function) will be triggered by the integration event
    • API Gateway is also deployed in front of the service, so users can access the function/service

Properties of the solution

  • Well-known, high-scaling components
  • Mix of eventual (DynamoDB, Lambda) and strong consistency (Kinesis Firehose, FIFO queue)
  • Input constraint is equal to the limits on Kinesis Firehose
  • Output constraint is equal to the limits on EventBridge

Tooling

This project uses:

  • Terraform to describe (and be able to deploy) the infrastructure as code
  • Serverless Framework to describe (and be able to deploy) the application parts and services as code
  • TypeScript to write most of the application code
  • (Optional) Checkov annotations can be seen in the Terraform modules; not strictly required

Diagram

See the below diagram for how the overall flow moves through the various AWS components.

Diagram


Prerequisites

  • Amazon Web Services (AWS) account with sufficient permissions so that you can deploy infrastructure. A naive but simple policy would be full rights for at least CloudWatch, Lambda, API Gateway, DynamoDB, X-Ray, SQS, Kinesis, EventBridge, and S3.
  • Recent Node version installed, ideally version 16 or later.
  • Terraform installed.
  • Ideally Terraform Cloud (local or remote workspace). If you're not using Terraform Cloud, remove the cloud block from ingest/infra/main.tf.

Configuration

Before making any changes to names and such, please do a "find-all" in the repo to find references to any values you might want to change. Search especially for 123412341234 and replace it with your AWS account number.

You should take a look at the Terraform modules (under ingest/infra/) and see that any settings, values, and regions are correct and/or optimized for your needs.

Also, if you're not using Terraform Cloud, remove the cloud block from ingest/infra/main.tf.

Deployment order

  • Deploy ingest
  • Deploy the Terraform infrastructure in ingest/infra
  • Deploy consumer
  • Run generator

Improvements and TODO

  • The use of Base64/ASCII/UTF8 conversions across the events/functions can likely be improved a lot
  • generator/index.js - The current way of turning data into an ArrayBuffer results in garbled data.
  • ingest/infra/modules/kinesis/main.tf - Customer-managed KMS?
  • ingest/infra/modules/s3/main.tf - Multi-region replication? TTL, S3 Glacier?
  • ingest/infra/modules/sqs/main.tf - Add DLQ (Dead Letter Queue)?
  • ingest/infra/modules/dynamodb/main.tf - Replicas (Global Tables) seem to create problems? (currently turned off)
  • ingest/src/Transformer/adapters/web/Transformer.ts - Ensure Kinesis config adds newline separators; see https://aws.amazon.com/blogs/big-data/build-seamless-data-streaming-pipelines-with-amazon-kinesis-data-streams-and-amazon-kinesis-data-firehose-for-amazon-dynamodb-tables/
  • ingest/src/EventStorer/adapters/web/EventStorer.ts: Might be faster to run const dynamo = createNewDynamoRepository() in global scope?
  • Tests?
  • Take incoming events and copy/store them as test data
  • DynamoDB: key field should have something else inserted into it?
  • DynamoDB: Use range key for timestamp?
  • EventBridge: Use "Resources" field?
  • EventBridge: Emit more events or for more details ("systems")?

References

About

This repo contains a complete, working architecture on AWS that will take in streaming data, store it in S3 and DynamoDB, and then emit events with their contents.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published