Big Data

Amazon EMR 7.5 runtime for Apache Spark and Iceberg can run Spark workloads 3.6 instances sooner than Spark 3.5.3 and Iceberg 1.6.1

December 28, 2024

[ad_1]

The Amazon EMR runtime for Apache Spark presents a high-performance runtime setting whereas sustaining 100% API compatibility with open supply Apache Spark and Apache Iceberg desk format. Amazon EMR on EC2, Amazon EMR Serverless, Amazon EMR on Amazon EKS, Amazon EMR on AWS Outposts and AWS Glue all use the optimized runtimes.

On this put up, we display the efficiency advantages of utilizing the Amazon EMR 7.5 runtime for Spark and Iceberg in comparison with open supply Spark 3.5.3 with Iceberg 1.6.1 tables on the TPC-DS 3TB benchmark v2.13.

Iceberg is a well-liked open supply high-performance format for giant analytic tables. Our benchmarks display that Amazon EMR can run TPC-DS 3 TB workloads 3.6 instances sooner, decreasing the runtime from 1.54 hours to 0.42 hours. Moreover, the fee effectivity improves by 2.9 instances, with the entire price reducing from $16.00 to $5.39 when utilizing Amazon Elastic Compute Cloud (Amazon EC2) On-Demand r5d.4xlarge cases, offering observable good points for information processing duties.

It is a additional 32% improve from the optimizations shipped in Amazon EMR 7.1 coated in a earlier put up, Amazon EMR 7.1 runtime for Apache Spark and Iceberg can run Spark workloads 2.7 instances sooner than Apache Spark 3.5.1 and Iceberg 1.5.2. Since then we’ve continued including extra help for DataSource V2 for eight extra present question optimizations within the EMR runtime for Spark.

Along with these DataSource V2 particular enhancements, we’ve made extra optimizations to Spark operators since Amazon EMR 7.1 that additionally contribute to the extra speedup.

Benchmark outcomes for Amazon EMR 7.5 in contrast to4 open supply Spark 3.5.3 and Iceberg 1.6.1

To evaluate the Spark engine’s efficiency with the Iceberg desk format, we carried out benchmark checks utilizing the 3 TB TPC-DS dataset, model 2.13 (our outcomes derived from the TPC-DS dataset are usually not immediately similar to the official TPC-DS outcomes resulting from setup variations). Benchmark checks for the EMR runtime for Spark and Iceberg had been performed on Amazon EMR 7.5 EC2 clusters vs open supply Spark 3.5.3 and Iceberg 1.6.1 on EC2 clusters.

The setup directions and technical particulars can be found in our GitHub repository. To attenuate the affect of exterior catalogs like AWS Glue and Hive, we used the Hadoop catalog for the Iceberg tables. This makes use of the underlying file system, particularly Amazon S3, because the catalog. We will outline this setup by configuring the property spark.sql.catalog..kind. The very fact tables used the default partitioning by the date column, which have a variety of partitions various from 200–2,100. No precalculated statistics had been used for these tables.

We ran a complete of 104 SparkSQL queries in three sequential rounds, and the typical runtime of every question throughout these rounds was taken for comparability. The typical runtime for the three rounds on Amazon EMR 7.5 with Iceberg enabled was 0.42 hours, demonstrating a 3.6-fold pace improve in comparison with open supply Spark 3.5.3 and Iceberg 1.6.1. The next determine presents the entire runtimes in seconds.

The next desk summarizes the metrics.

Metric	Amazon EMR 7.5 on EC2	Amazon EMR 7.1 on EC2	Open Supply Spark 3.5.3 and Iceberg 1.6.1
Common runtime in seconds	1535.62	2033.17	5546.16
Geometric imply over queries in seconds	8.30046	10.13153	20.40555
Price*	$5.39	$7.18	$16.00

*Detailed price estimates are mentioned later on this put up.

The next chart demonstrates the per-query efficiency enchancment of Amazon EMR 7.5 relative to open supply Spark 3.5.3 and Iceberg 1.6.1. The extent of the speedup varies from one question to a different, with the quickest as much as 9.4 instances sooner for q93, with Amazon EMR outperforming open supply Spark with Iceberg tables. The horizontal axis arranges the TPC-DS 3TB benchmark queries in descending order based mostly on the efficiency enchancment seen with Amazon EMR, and the vertical axis depicts the magnitude of this speedup as a ratio.

Price comparability

Our benchmark offers the entire runtime and geometric imply information to evaluate the efficiency of Spark and Iceberg in a fancy, real-world choice help state of affairs. For added insights, we additionally look at the fee facet. We calculate price estimates utilizing formulation that account for EC2 On-Demand cases, Amazon Elastic Block Retailer (Amazon EBS), and Amazon EMR bills.

Amazon EC2 price (consists of SSD price) = variety of cases * r5d.4xlarge hourly price * job runtime in hours
- r5d.4xlarge hourly price = $1.152 per hour in us-east-1
Root Amazon EBS price = variety of cases * Amazon EBS per GB-hourly price * root EBS quantity measurement * job runtime in hours
Amazon EMR price = variety of cases * r5d.4xlarge Amazon EMR price * job runtime in hours
- 4xlarge Amazon EMR price = $0.27 per hour
Complete price = Amazon EC2 price + root Amazon EBS price + Amazon EMR price

The calculations reveal that the Amazon EMR 7.5 benchmark yields a 2.9-fold price effectivity enchancment over open supply Spark 3.5.3 and Iceberg 1.6.1 in operating the benchmark job.

Metric	Amazon EMR 7.5	Amazon EMR 7.1	Open Supply Spark 3.5.1 and Iceberg 1.5.2
Runtime in hours	0.426	0.564	1.540
Variety of EC2 cases (Consists of major node)	9	9	9
Amazon EBS Dimension	20gb	20gb	20gb
Amazon EC2 (Complete runtime price)	$4.35	$5.81	$15.97
Amazon EBS price	$0.01	$0.01	$0.04
Amazon EMR price	$1.02	$1.36	$0
Complete price	$5.38	$7.18	$16.01
Price financial savings	Amazon EMR 7.5 is 2.9 instances higher	Amazon EMR 7.1 is 2.2 instances higher	Baseline

Along with the time-based metrics mentioned to date, information from Spark occasion logs present that Amazon EMR scanned roughly 3.4 instances much less information from Amazon S3 and 4.1 instances fewer information than the open supply model within the TPC-DS 3 TB benchmark. This discount in Amazon S3 information scanning contributes on to price financial savings for Amazon EMR workloads.

Run open supply Spark benchmarks on Iceberg tables

We used separate EC2 clusters, every outfitted with 9 r5d.4xlarge cases, for testing each open supply Spark 3.5.3 and Amazon EMR 7.5 for Iceberg workload. The first node was outfitted with 16 vCPU and 128 GB of reminiscence, and the eight employee nodes collectively had 128 vCPU and 1024 GB of reminiscence. We performed checks utilizing the Amazon EMR default settings to showcase the standard person expertise and minimally adjusted the settings of Spark and Iceberg to take care of a balanced comparability.

The next desk summarizes the Amazon EC2 configurations for the first node and eight employee nodes of kind r5d.4xlarge.

EC2 Occasion	vCPU	Reminiscence (GiB)	Occasion Storage (GB)	EBS Root Quantity (GB)
r5d.4xlarge	16	128	2 x 300 NVMe SSD	20 GB

Stipulations

The next conditions are required to run the benchmarking:

Utilizing the directions within the emr-spark-benchmark GitHub repo, arrange the TPC-DS supply information in your S3 bucket and in your native laptop.
Construct the benchmark utility following the steps supplied in Steps to construct spark-benchmark-assembly utility and duplicate the benchmark utility to your S3 bucket. Alternatively, copy spark-benchmark-assembly-3.5.3.jar to your S3 bucket.
Create Iceberg tables from the TPC-DS supply information. Observe the directions on GitHub to create Iceberg tables utilizing the Hadoop catalog. For instance, the next code makes use of an EMR 7.5 cluster with Iceberg enabled to create the tables:

aws emr add-steps 
--cluster-id  --steps Kind=Spark,Identify="Create Iceberg Tables",
Args=[--class,com.amazonaws.eks.tpcds.CreateIcebergTables,--conf,spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,
--conf,spark.sql.catalog.hadoop_catalog=org.apache.iceberg.spark.SparkCatalog,
--conf,spark.sql.catalog.hadoop_catalog.type=hadoop,
--conf,spark.sql.catalog.hadoop_catalog.warehouse=s3:////,
--conf,spark.sql.catalog.hadoop_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO,
s3:////spark-benchmark-assembly-3.5.3.jar,s3://blogpost-sparkoneks-us-east-1/blog/BLOG_TPCDS-TEST-3T-partitioned/,
/home/hadoop/tpcds-kit/tools,parquet,3000,true,,true,true],ActionOnFailure=CONTINUE --region

Observe the Hadoop catalog warehouse location and database identify from the previous step. We use the identical iceberg tables to run benchmarks with Amazon EMR 7.5 and open supply Spark.

This benchmark utility is constructed from the department tpcds-v2.13_iceberg. When you’re constructing a brand new benchmark utility, swap to the right department after downloading the supply code from the GitHub repo.

Create and configure a YARN cluster on Amazon EC2

To check Iceberg efficiency between Amazon EMR on Amazon EC2 and open supply Spark on Amazon EC2, comply with the directions within the emr-spark-benchmark GitHub repo to create an open supply Spark cluster on Amazon EC2 utilizing Flintrock with eight employee nodes.

Primarily based on the cluster choice for this check, the next configurations are used:

Be certain to exchange the placeholder , within the yarn-site.xml file, with the first node’s IP handle of your Flintrock cluster.

Run the TPC-DS benchmark with Spark 3.5.3 and Iceberg 1.6.1

Full the next steps to run the TPC-DS benchmark:

Log in to the open supply cluster major node utilizing flintrock login $CLUSTER_NAME.
Submit your Spark job:
1. Select the right Iceberg catalog warehouse location and database that has the created Iceberg tables.
2. The outcomes are created in s3:///benchmark_run.
3. You’ll be able to observe progress in /media/ephemeral0/spark_run.log.

spark-submit 
--master yarn 
--deploy-mode consumer 
--class com.amazonaws.eks.tpcds.BenchmarkSQL 
--conf spark.driver.cores=4 
--conf spark.driver.reminiscence=10g 
--conf spark.executor.cores=16 
--conf spark.executor.reminiscence=100g 
--conf spark.executor.cases=8 
--conf spark.community.timeout=2000 
--conf spark.executor.heartbeatInterval=300s 
--conf spark.dynamicAllocation.enabled=false 
--conf spark.shuffle.service.enabled=false 
--conf spark.hadoop.fs.s3a.aws.credentials.supplier=com.amazonaws.auth.InstanceProfileCredentialsProvider 
--conf spark.hadoop.fs.s3.impl=org.apache.hadoop.fs.s3a.S3AFileSystem 
--conf spark.jars.packages=org.apache.hadoop:hadoop-aws:3.3.4,org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.6.1,org.apache.iceberg:iceberg-aws-bundle:1.6.1 
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions   
--conf spark.sql.catalog.native=org.apache.iceberg.spark.SparkCatalog    
--conf spark.sql.catalog.native.kind=hadoop  
--conf spark.sql.catalog.native.warehouse=s3a://// 
--conf spark.sql.defaultCatalog=native   
--conf spark.sql.catalog.native.io-impl=org.apache.iceberg.aws.s3.S3FileIO   
spark-benchmark-assembly-3.5.3.jar   
s3:///benchmark_run 3000 1 false  
q1-v2.13,q10-v2.13,q11-v2.13,q12-v2.13,q13-v2.13,q14a-v2.13,q14b-v2.13,q15-v2.13,q16-v2.13,
q17-v2.13,q18-v2.13,q19-v2.13,q2-v2.13,q20-v2.13,q21-v2.13,q22-v2.13,q23a-v2.13,q23b-v2.13,
q24a-v2.13,q24b-v2.13,q25-v2.13,q26-v2.13,q27-v2.13,q28-v2.13,q29-v2.13,q3-v2.13,q30-v2.13,
q31-v2.13,q32-v2.13,q33-v2.13,q34-v2.13,q35-v2.13,q36-v2.13,q37-v2.13,q38-v2.13,q39a-v2.13,
q39b-v2.13,q4-v2.13,q40-v2.13,q41-v2.13,q42-v2.13,q43-v2.13,q44-v2.13,q45-v2.13,q46-v2.13,
q47-v2.13,q48-v2.13,q49-v2.13,q5-v2.13,q50-v2.13,q51-v2.13,q52-v2.13,q53-v2.13,q54-v2.13,
q55-v2.13,q56-v2.13,q57-v2.13,q58-v2.13,q59-v2.13,q6-v2.13,q60-v2.13,q61-v2.13,q62-v2.13,
q63-v2.13,q64-v2.13,q65-v2.13,q66-v2.13,q67-v2.13,q68-v2.13,q69-v2.13,q7-v2.13,q70-v2.13,
q71-v2.13,q72-v2.13,q73-v2.13,q74-v2.13,q75-v2.13,q76-v2.13,q77-v2.13,q78-v2.13,q79-v2.13,
q8-v2.13,q80-v2.13,q81-v2.13,q82-v2.13,q83-v2.13,q84-v2.13,q85-v2.13,q86-v2.13,q87-v2.13,
q88-v2.13,q89-v2.13,q9-v2.13,q90-v2.13,q91-v2.13,q92-v2.13,q93-v2.13,q94-v2.13,q95-v2.13,
q96-v2.13,q97-v2.13,q98-v2.13,q99-v2.13,ss_max-v2.13    
true  > /media/ephemeral0/spark_run.log 2>&1 &!

Summarize the outcomes

After the Spark job finishes, retrieve the check end result file from the output S3 bucket at s3:///benchmark_run/timestamp=xxxx/abstract.csv/xxx.csv. This may be completed both by way of the Amazon S3 console by navigating to the required bucket location or by utilizing the AWS Command Line Interface (AWS CLI). The Spark benchmark utility organizes the information by making a timestamp folder and putting a abstract file inside a folder labeled abstract.csv. The output CSV recordsdata include 4 columns with out headers:

Question identify
Median time
Minimal time
Most time

With the information from three separate check runs with one iteration every time, we are able to calculate the typical and geometric imply of the benchmark runtimes.

Run the TPC-DS benchmark with the EMR runtime for Spark

Many of the directions are just like Steps to run Spark Benchmarking with just a few Iceberg-specific particulars.

Stipulations

Full the next prerequisite steps:

Run aws configure to configure the AWS CLI shell to level to the benchmarking AWS account. Consult with Configure the AWS CLI for directions.
Add the benchmark utility JAR file to Amazon S3.

Deploy the EMR cluster and run the benchmark job

Full the next steps to run the benchmark job:

Use the AWS CLI command as proven in Deploy EMR on EC2 Cluster and run benchmark job to spin up an EMR on EC2 cluster. Be certain to allow Iceberg. See Create an Iceberg cluster for extra particulars. Select the right Amazon EMR model, root quantity measurement, and similar useful resource configuration because the open supply Flintrock setup. Consult with create-cluster for an in depth description of the AWS CLI choices.
Retailer the cluster ID from the response. We want this for the following step.
Submit the benchmark job in Amazon EMR utilizing add-steps from the AWS CLI:
1. Change with the cluster ID from Step 2.
2. The benchmark utility is at s3:///spark-benchmark-assembly-3.5.3.jar.
3. Select the right Iceberg catalog warehouse location and database that has the created Iceberg tables. This needs to be the identical because the one used for the open supply TPC-DS benchmark run.
4. The outcomes will probably be in s3:///benchmark_run.

aws emr add-steps   --cluster-id 
--steps Kind=Spark,Identify="SPARK Iceberg EMR TPCDS Benchmark Job",
Args=[--class,com.amazonaws.eks.tpcds.BenchmarkSQL,
--conf,spark.driver.cores=4,
--conf,spark.driver.memory=10g,
--conf,spark.executor.cores=16,
--conf,spark.executor.memory=100g,
--conf,spark.executor.instances=8,
--conf,spark.network.timeout=2000,
--conf,spark.executor.heartbeatInterval=300s,
--conf,spark.dynamicAllocation.enabled=false,
--conf,spark.shuffle.service.enabled=false,
--conf,spark.sql.iceberg.data-prefetch.enabled=true,
--conf,spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,
--conf,spark.sql.catalog.local=org.apache.iceberg.spark.SparkCatalog,
--conf,spark.sql.catalog.local.type=hadoop,
--conf,spark.sql.catalog.local.warehouse=s3:///,
--conf,spark.sql.defaultCatalog=local,
--conf,spark.sql.catalog.local.io-impl=org.apache.iceberg.aws.s3.S3FileIO,
s3:///spark-benchmark-assembly-3.5.3.jar,
s3:///benchmark_run,3000,1,false,
'q1-v2.13,q10-v2.13,q11-v2.13,q12-v2.13,q13-v2.13,q14a-v2.13,q14b-v2.13,q15-v2.13,q16-v2.13,q17-v2.13,q18-v2.13,q19-v2.13,q2-v2.13,q20-v2.13,q21-v2.13,q22-v2.13,q23a-v2.13,q23b-v2.13,q24a-v2.13,q24b-v2.13,q25-v2.13,q26-v2.13,q27-v2.13,q28-v2.13,q29-v2.13,q3-v2.13,q30-v2.13,q31-v2.13,q32-v2.13,q33-v2.13,q34-v2.13,q35-v2.13,q36-v2.13,q37-v2.13,q38-v2.13,q39a-v2.13,q39b-v2.13,q4-v2.13,q40-v2.13,q41-v2.13,q42-v2.13,q43-v2.13,q44-v2.13,q45-v2.13,q46-v2.13,q47-v2.13,q48-v2.13,q49-v2.13,q5-v2.13,q50-v2.13,q51-v2.13,q52-v2.13,q53-v2.13,q54-v2.13,q55-v2.13,q56-v2.13,q57-v2.13,q58-v2.13,q59-v2.13,q6-v2.13,q60-v2.13,q61-v2.13,q62-v2.13,q63-v2.13,q64-v2.13,q65-v2.13,q66-v2.13,q67-v2.13,q68-v2.13,q69-v2.13,q7-v2.13,q70-v2.13,q71-v2.13,q72-v2.13,q73-v2.13,q74-v2.13,q75-v2.13,q76-v2.13,q77-v2.13,q78-v2.13,q79-v2.13,q8-v2.13,q80-v2.13,q81-v2.13,q82-v2.13,q83-v2.13,q84-v2.13,q85-v2.13,q86-v2.13,q87-v2.13,q88-v2.13,q89-v2.13,q9-v2.13,q90-v2.13,q91-v2.13,q92-v2.13,q93-v2.13,q94-v2.13,q95-v2.13,q96-v2.13,q97-v2.13,q98-v2.13,q99-v2.13,ss_max-v2.13',
true,],ActionOnFailure=CONTINUE --region

Summarize the outcomes

After the step is full, you possibly can see the summarized benchmark end result at s3:///benchmark_run/timestamp=xxxx/abstract.csv/xxx.csv in the identical means because the earlier run and compute the typical and geometric imply of the question runtimes.

Clear up

To forestall any future expenses, delete the assets you created by following the directions supplied within the Cleanup part of the GitHub repository.

Abstract

Amazon EMR is constantly enhancing the EMR runtime for Spark when used with Iceberg tables, attaining a efficiency that’s 3.6 instances sooner than open supply Spark 3.5.3 and Iceberg 1.6.1 with EMR 7.5 on TPC-DS 3 TB, v2.13. It is a additional improve of 32% from EMR 7.1. We encourage you to maintain updated with the most recent Amazon EMR releases to completely profit from ongoing efficiency enhancements.

To remain knowledgeable, subscribe to the AWS Large Knowledge Weblog’s RSS feed, the place you’ll find updates on the EMR runtime for Spark and Iceberg, in addition to tips about configuration finest practices and tuning suggestions.

Concerning the Authors

Atul Felix Payapilly is a software program improvement engineer for Amazon EMR at Amazon Net Providers.

Udit Mehrotra is an Engineering Supervisor for EMR at Amazon Net Providers.

[ad_2]

Benchmark outcomes for Amazon EMR 7.5 in contrast to4 open supply Spark 3.5.3 and Iceberg 1.6.1

Price comparability

Run open supply Spark benchmarks on Iceberg tables

Stipulations

Create and configure a YARN cluster on Amazon EC2

Run the TPC-DS benchmark with Spark 3.5.3 and Iceberg 1.6.1

Summarize the outcomes

Run the TPC-DS benchmark with the EMR runtime for Spark

Stipulations

Deploy the EMR cluster and run the benchmark job

Summarize the outcomes

Clear up

Abstract

Concerning the Authors

RELATED ARTICLESMORE FROM AUTHOR

Snowflake’s $6 Billion AWS Guess Reveals What Enterprise Agentic AI Runs On |

Machine Studying System Design: 10 Interview Issues Solved

How the Exactly MCP Server Brings Location Intelligence Immediately Into Your AI Workflows

The Milky Approach Was Rewired by a Cataclysmic Collision Billions of...

RELATED ARTICLES MORE FROM AUTHOR