Skip to content

aws-samples/amazon-emr-with-delta-lake

Deltalake with Amazon EMR

This guide helps you quickly explore the main features of Delta Lake. It provides code snippets that show how to read from and write to Delta tables with Amazon EMR.
For more details, check this video, "Incremental Data Processing using Delta Lake with EMR"

Quickstart

  1. Create s3 bucket for delta lake (e.g. learn-deltalake-2022)
  2. Create an EMR Cluster using AWS CDK (Check details in instructions)
  3. Create an EMR Studio using AWS CDK (Check details in instructions)
  4. Open the Amazon EMR console at https://console.aws.amazon.com/elasticmapreduce/
  5. Open the EMR Studio and create an EMR Studio Workspace
  6. Launch the EMR Studio Workspace
  7. Attach the EMR Cluster to a Jupyter Notebook
  8. Upload deltalake-with-emr-demo.ipynb into the Jupyter Notebook
  9. Set kernel to PySpark, and Run each cells
  10. For running Amazon Athena queries on Delta Lake, Check this

Key Configurations

  • Amazon EMR Applications

    • Hadoop
    • Hive
    • JupyterHub
    • JupyterEnterpriseGateway
    • Livy
    • Apache Spark (>= 3.0)
  • Apache Spark (PySpark)

    {
      "conf": {
        "spark.jars.packages": "io.delta:delta-core_2.12:{version}",
        "spark.sql.extensions": "io.delta.sql.DeltasparkSessionExtension",
        "spark.sql.catalog.spark_catalog": "org.apache.spark.sql.delta.catalog.DeltaCatalog",
      }
    }
    
    • ⚠️ YOU MUST REPLACE {version} with the appropriate one
    • For more details, check this

Compatibility with Apache Spark

ℹ️ The following table lists are lastly updated on 26 Aug 2022

Delta lake version Apache Spark version
2.0.x 3.2.x
1.2.x 3.2.x
1.1.x 3.2.x
1.0.x 3.1.x
0.7.x and 0.8.x 3.0.x
Below 0.7.x 2.4.2 - 2.4.<latest>

References

Security

See CONTRIBUTING for more information.

License

This library is licensed under the MIT-0 License. See the LICENSE file.