Security Pyspark job submission Storage Metastore integration Debugging Troubleshooting Amazon EMR on EKS issues Node placement Performance Cost optimization Using AWS Outposts

Links to Amazon EMR on EKS best practices guides on GitHub

We've built the Amazon EMR on EKS Best Practices Guide using open source community collaboration so that we can iterate quickly and provide recommendations for aspects of creating and running a virtual cluster. We recommend that you use the Amazon EMR on EKS best practices guide for the sections. Choose the links in each section to go to the GitHub site.

Security

Note

For more information on security with Amazon EMR on EKS, see Amazon EMR on EKS security best practices.

Encryption best practices: how to use encryption for data at rest and in transit.

Managing network security describes how to configure security groups for pods for Amazon EMR on EKS while you connect to data sources that are hosted in AWS services like Amazon RDS and Amazon Redshift.

Using AWS secrets manager to store secrets.

Pyspark job submission

Pyspark job submission: specifies different types of packaging for pySpark applications using packaging formats like zip, egg, wheel, and pex.

Storage

Using EBS volumes:: how to use static and dynamic provisioning for jobs that need EBS volumes.

Using Amazon FSx for Lustre volumes: how to use static and dynamic provisioning for jobs that need Amazon FSx for Luster volumes.

Using Instance store volumes: how to use instance store volumes for job processing.

Metastore integration

Using Hive metastore: offers different ways to use Hive metastore.

Using AWS Glue: offers different ways to configure AWS Glue catalog.

Debugging

Using Spark debugging: how to change the log level.

Connecting to Spark UI on the driver pod.

How to use self-hosted Spark history server with Amazon EMR on EKS.

Troubleshooting Amazon EMR on EKS issues

Troubleshooting.

Node placement

Using Kubernetes node selectors for single-az and other use cases.

Using Fargate node placement.

Performance

Using Dynamic Resource Allocation (DRA).

By default, spark.dynamicAllocation.preallocateExecutors is enabled in Amazon EMR Spark. When spark.dynamicAllocation.initialExecutors and spark.dynamicAllocation.minExecutors are not set, Spark may request a large number of executors at startup based on estimated task counts, even for small workloads. To avoid excessive container churn, use one of the following approaches:

Set spark.dynamicAllocation.initialExecutors or spark.dynamicAllocation.minExecutors to a value appropriate for your workload size.
Set spark.dynamicAllocation.preallocateExecutors.maxEstimatedTasks to a lower value to limit the number of executors requested at startup.
Set spark.dynamicAllocation.preallocateExecutors to false to disable executor preallocation entirely.

EKS best practices for the Amazon VPC Container Network Interface plugin (CNI), Cluster Autoscaler, and Core DNS.

Cost optimization

Using spot instances: Amazon EC2 spot instance best practices and how to use the Spark node decommission feature.

Using AWS Outposts

Running Amazon EMR on EKS using AWS Outposts

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Getting started with Amazon EMR on EKS

Customizing Docker images