AWS Storage Blog

Run queries up to 9x faster using Trino with Amazon S3 Select on Amazon EMR

Customers building data lakes continue to innovate in the ways that they store and access their data. For these customers, performance is critical, particularly when they are accessing large amounts of data. For example, data scientists, data analysts, and data engineers running queries from open source frameworks like Trino want to accelerate access to their data, which lets them spend more time analysing their data and generating insights.

Amazon S3 Select is an Amazon Simple Storage Service (Amazon S3) feature that makes it easy to retrieve specific data from an object using simple SQL expressions without having to retrieve the entire object. Trino is an open source SQL query engine that can be used to run interactive analytics on data stored in Amazon S3. By using Trino with S3 Select, you retrieve only a subset of data from an object, reducing the amount of data returned from Amazon S3 and accelerating query performance.

On November 21, 2022, AWS announced its upstream contributions to open source Trino, which improves query performance when accessing CSV and JSON data formats. Now, you can use the S3 Select Pushdown performance enhancements to reduce the amount of data that must be transferred and processed by Trino. S3 Select Pushdown enables Trino to “push down” the computational work of projection operations (for example, selecting a subset of columns) and predicate operations (for example, applying filtering in a WHERE clause ) to Amazon S3. This allows queries to retrieve only the required data from Amazon S3.

In this post, we discuss the performance benchmarks on Trino release 397 with S3 Select using TPC-DS-like benchmark queries at 3 TB scale. We show that queries run up to 9x faster as a result of pushing down the computational load of scanning and filtering data to Amazon S3 when compared to using Trino without S3 Select.

How to configure the Trino Hive connector

You enable S3 Select Pushdown using the s3_select_pushdown_enabled Hive session property, or using the hive.s3select-pushdown.enabled configuration property. The session property overrides the config property, which lets you enable or disable S3 Select Pushdown on a per-query basis.

Results observed using TPC-DS-like benchmarks

To evaluate the performance improvements on Trino with S3 Select, we ran all 99 TPC-DS-like benchmark queries at 3 TB scale on a 33-node r5.16xlarge EMR v 6.8.0 cluster patched with Trino release 397 and all of the data stored on Amazon S3.

We ran all queries successfully multiple times with and without S3 Select. To ensure consistent results, we monitored the standard deviations for each query run. The average deviation in query runtimes across all 99 queries was 0.70 seconds with S3 Select and 0.99 seconds without S3 Select.

First, we’ll compare the total aggregate runtime and the total aggregate amount of data processed for all 99 queries in the TPC-DS-like 3 TB CSV (uncompressed) dataset. The first graph shows the total aggregate runtime in seconds with and without S3 Select:

The following graph shows the total aggregate data (in terabytes) processed with and without S3 Select:

The next graph shows the total aggregate data (in terabytes) processed with and without S3 Select:

the total aggregate data (in terabytes) processed with and without S3 Select

In our tests, we found that S3 Select sped up all 99 queries. The maximum query acceleration with S3 Select was 9.2x, the minimum query acceleration with S3 Select was 1.1x, and the average query acceleration was 2.5x. The following graph shows the query speedup for each of the 99 queries:

CSV Quey Speedup with S3 Select

In our tests, we found that S3 Select reduced the amount of bytes processed by Trino for all 99 queries. For example, we saw a reduction of 17 TB (99%) of processed data with S3 Select on Query 9. During testing, the average reduction in the amount of processed data per query with S3 Select was 2 TB, and the total reduction in processed data across all 99 queries was 200 TB (21x better) with S3 Select.

The following graph shows the reduction of data processed for each of the 99 queries with S3 Select:

the reduction of data processed for each of the 99 queries with S3 Select

The performance results provided in this post required no tuning to Amazon EMR or Trino, and all of the results are from default configurations. The following is the default Trino configuration used with our EMR cluster.

/etc/trino/conf/config.properties:
coordinator=true
node-scheduler.include-coordinator=false
http-server.threads.max=500
discovery-server.enabled=true
sink.max-buffer-size=1GB
query.max-memory=6606029MB
query.max-memory-per-node=214061170033B
query.max-history=40
query.min-expire-age=30m
query.client.timeout=30m
query.stage-count-warning-threshold=100
query.max-stage-count=150
http-server.http.port=8889
http-server.log.path=/var/log/trino/http-request.log
http-server.log.max-size=67108864B
http-server.log.max-history=5
log.max-size=268435456B
log.max-history=5
jmx.rmiregistry.port=9080
jmx.rmiserver.port=9081

Next, let’s look at the enhancements that we made to Trino that contributed to these results.

Performance enhancements to the Trino Hive connector

Our contributions to Trino improve how Trino sends requests to Amazon S3 by enhancing its use of S3 Select. There are two contributing factors that accelerate the query runtime when S3 Select is used. First, S3 Select reduces the number of bytes transferred between Amazon S3 and Trino by pushing down the filtering to Amazon S3. With S3 Select, Trino retrieves a pre-filtered subset of Amazon S3 data because filtering and projection is performed by S3 Select. Second, using S3 Select to push down the computation work of filtering to Amazon S3 increases Trino’s ability to parallelise projection and predicate operations.

Conclusion

In this post, we presented our results from running our TPC-DS-like 3TB scale benchmark. With the S3 Select Pushdown performance optimizations available in Trino release 397 and later, you can run queries faster than before by using Trino with S3 Select to “pushdown” the computational work of projection and predicate operations to Amazon S3. Our benchmark testing demonstrated up to a 9x performance improvement in query runtime (2.5x on average), and a 21x overall reduction in the total data processed.

If you have a data lake built on Amazon S3 and use Trino today, then you can use S3 Select’s filtering capability to quickly and easily run interactive ad-hoc queries. For more information, see the Trino release notes to learn about the enhancements to the Hive connector.

By default, S3 Select Pushdown is disabled in the Trino Hive connector. Its performance depends on the amount of data filtered by the query. For example, by filtering a large number of rows, you will achieve better performance. We recommend that you benchmark your workloads with and without S3 Select to see if using it may be suitable for your specific use case. For more information on S3 Select pricing, please visit the Amazon S3 pricing page.

Boni Bruno

Boni Bruno

Boni Bruno is a Principal Architect and Workload Specialist at AWS. He enjoys developing solution-driven architectures and sharing informative content to the AWS Storage and Analytics community. Prior to AWS, he was the Chief Solutions Architect for Dell Technologies’ Unstructured Data Solutions Division, where he built numerous solutions around big data processing, storage, machine learning, analytics, and various HPC applications.

Eric Henderson

Eric Henderson

Eric Henderson is a Principal Product Manager with Amazon S3 focusing on S3’s serverless compute and event-driven technology. He loves building products that solve problems for customers, the art of product management, and continuously learning. Outside of work, he enjoys mentoring, coaching, and home brewing.