

# Running Spark jobs with spark-submit
spark-submit

Amazon EMR releases 6.10.0 and higher support `spark-submit` as a command-line tool that you can use to submit and execute Spark applications to an Amazon EMR on EKS cluster.

**Note**  
Amazon EMR calculates pricing on Amazon EKS based on vCPU and memory consumption. This calculation applies to driver and executor pods. This calculation starts from when you download your Amazon EMR application image until the Amazon EKS pod terminates and is rounded to the nearest second.

**Topics**
+ [

# Setting up spark-submit for Amazon EMR on EKS
](spark-submit-setup.md)
+ [

# Getting started with spark-submit for Amazon EMR on EKS
](spark-submit-gs.md)
+ [

# Verify Spark driver service account security requirements for spark-submit
](spark-submit-security.md)

# Setting up spark-submit for Amazon EMR on EKS
Setting up

Complete the following tasks to get set up before you can run an application with spark-submit on Amazon EMR on EKS. If you've already signed up for Amazon Web Services (AWS) and have used Amazon EKS, you are almost ready to use Amazon EMR on EKS. If you've already completed any of the prerequisites, you can skip those and move on to the next one.
+ **[Install or update to the latest version of the AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html) ** – If you've already installed the AWS CLI, confirm that you have the latest version.
+ **[Set up kubectl and eksctl](https://docs.aws.amazon.com/eks/latest/userguide/install-kubectl.html) ** – eksctl is a command line tool that you use to communicate with Amazon EKS.
+ **[Get started with Amazon EKS – eksctl](https://docs.aws.amazon.com/eks/latest/userguide/getting-started-eksctl.html) ** – Follow the steps to create a new Kubernetes cluster with nodes in Amazon EKS.
+ **[Select an Amazon EMR base image URI](docker-custom-images-tag.md) (release 6.10.0 or higher)** – the `spark-submit` command is supported with Amazon EMR releases 6.10.0 and higher.
+ Confirm that the driver service account has appropriate permissions to create and watch executor pods. For more information, see [Verify Spark driver service account security requirements for spark-submit](spark-submit-security.md).
+ Set up your local [AWS credentials profile](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html).
+ From the Amazon EKS console, choose your EKS cluster, then find the EKS cluster endpoint, located under **Overview**, **Details**, then **API server endpoint**.

# Getting started with spark-submit for Amazon EMR on EKS
Getting started

Amazon EMR 6.10.0 and higher supports spark-submit for running Spark applications on an Amazon EKS cluster. The section that follows shows you how to submit a command for a Spark application.

## Run a Spark application
Run a Spark application

To run the Spark application, follow these steps:

1. Before you can run a Spark application with the `spark-submit` command, complete the steps in [Setting up spark-submit for Amazon EMR on EKS](spark-submit-setup.md). 

1. Run a container with an Amazon EMR on EKS base image. See [How to select a base image URI](https://docs.aws.amazon.com/emr/latest/EMR-on-EKS-DevelopmentGuide/docker-custom-images-tag.html) for more information.

   ```
   kubectl run -it containerName --image=EMRonEKSImage --command -n namespace /bin/bash
   ```

1. Set the values for the following environment variables:

   ```
   export SPARK_HOME=spark-home
   export MASTER_URL=k8s://Amazon EKS-cluster-endpoint
   ```

1. Now, submit the Spark application with the following command:

   ```
   $SPARK_HOME/bin/spark-submit \
    --class org.apache.spark.examples.SparkPi \
    --master $MASTER_URL \
    --conf spark.kubernetes.container.image=895885662937.dkr.ecr.us-west-2.amazonaws.com/spark/emr-6.10.0:latest \
    --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
    --deploy-mode cluster \
    --conf spark.kubernetes.namespace=spark-operator \
    local:///usr/lib/spark/examples/jars/spark-examples.jar 20
   ```

For more information about submitting applications to Spark, see [Submitting applications](https://spark.apache.org/docs/latest/submitting-applications.html) in the Apache Spark documentation.

**Important**  
`spark-submit` only supports cluster mode as the submission mechanism.

# Verify Spark driver service account security requirements for spark-submit
Security

The Spark driver pod uses a Kubernetes service account to access the Kubernetes API server to create and watch executor pods. Driver service account must have appropriate permissions to list, create, edit, patch and delete pods in your cluster. You can verify that you can list these resources by running the following command:

```
kubectl auth can-i list|create|edit|delete|patch pods
```

Verify that you have the necessary permissions by running each command.

```
kubectl auth can-i list pods
kubectl auth can-i create pods
kubectl auth can-i edit pods
kubectl auth can-i delete pods
kubectl auth can-i patch pods
```

The following rules apply to this service role: 

```
 rules:
- apiGroups:
  - ""
  resources:
  - pods
  verbs:
  - "*"
- apiGroups:
  - ""
  resources:
  - services
  verbs:
  - "*"
- apiGroups:
  - ""
  resources:
  - configmaps
  verbs:
  - "*"
- apiGroups:
  - ""
  resources:
  - persistentvolumeclaims
  verbs:
  - "*"
```

# Setting up IAM roles for service accounts (IRSA) for spark-submit
IAM roles for service roles for spark-submit

The following sections explain how to set up IAM roles for service accounts (IRSA) to authenticate and authorize Kubernetes service accounts so you can run Spark applications stored in Amazon S3.

## Prerequisites


Before trying any of the examples in this documentation, make sure that you have completed the following prerequisites:
+ [Finished setting up spark-submit](https://docs.aws.amazon.com/emr/latest/EMR-on-EKS-DevelopmentGuide/spark-submit-setup.html)
+ [Created an S3 bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/creating-bucket.html) and [uploaded](https://docs.aws.amazon.com/AmazonS3/latest/userguide/uploading-an-object-bucket.html) the spark application jar

## Configuring a Kubernetes service account to assume an IAM role
Configuring a Kubernetes service account

The following steps cover how to configure a Kubernetes service account to assume an AWS Identity and Access Management (IAM) role. After you configure the pods to use the service account, they can then access any AWS service that the role has permissions to access.

1. Create a policy file to allow read-only access to the Amazon S3 object you [uploaded](https://docs.aws.amazon.com/AmazonS3/latest/userguide/uploading-an-object-bucket.html):

   ```
   cat >my-policy.json <<EOF
   {
       "Version": "2012-10-17",		 	 	 
       "Statement": [
           {
               "Effect": "Allow",
               "Action": [
                   "s3:GetObject",
                   "s3:ListBucket"
               ],
               "Resource": [
                   "arn:aws:s3:::<my-spark-jar-bucket>",
                   "arn:aws:s3:::<my-spark-jar-bucket>/*"
               ]
           }
       ]
   }
   EOF
   ```

1. Create the IAM policy.

   ```
   aws iam create-policy --policy-name my-policy --policy-document file://my-policy.json
   ```

1. Create an IAM role and associate it with a Kubernetes service account for the Spark driver

   ```
   eksctl create iamserviceaccount --name my-spark-driver-sa --namespace spark-operator \
   --cluster my-cluster --role-name "my-role" \
   --attach-policy-arn arn:aws:iam::111122223333:policy/my-policy --approve
   ```

1. Create a YAML file with the required [permissions](https://docs.aws.amazon.com/emr/latest/EMR-on-EKS-DevelopmentGuide/spark-submit-security.html) for the Spark driver service account:

   ```
   cat >spark-rbac.yaml <<EOF
   apiVersion: rbac.authorization.k8s.io/v1
   kind: Role
   metadata:
     namespace: default
     name: emr-containers-role-spark
   rules:
   - apiGroups:
     - ""
     resources:
     - pods
     verbs:
     - "*"
   - apiGroups:
     - ""
     resources:
     - services
     verbs:
     - "*"
   - apiGroups:
     - ""
     resources:
     - configmaps
     verbs:
     - "*"
   - apiGroups:
     - ""
     resources:
     - persistentvolumeclaims
     verbs:
     - "*"
   ---
   apiVersion: rbac.authorization.k8s.io/v1
   kind: RoleBinding
   metadata:
     name: spark-role-binding
     namespace: default
   roleRef:
     apiGroup: rbac.authorization.k8s.io
     kind: Role
     name: emr-containers-role-spark
   subjects:
   - kind: ServiceAccount
     name: emr-containers-sa-spark
     namespace: default
   EOF
   ```

1. Apply the cluster role binding configurations.

   ```
   kubectl apply -f spark-rbac.yaml
   ```

1. The `kubectl` command should return confirmation of the created account.

   ```
   serviceaccount/emr-containers-sa-spark created
   clusterrolebinding.rbac.authorization.k8s.io/emr-containers-role-spark configured
   ```

## Running the Spark application
Running the Spark application

Amazon EMR 6.10.0 and higher supports spark-submit for running Spark applications on an Amazon EKS cluster. To run the Spark application, follow these steps:

1. Make sure that you have completed the steps in [ Setting up spark-submit for Amazon EMR on EKS](https://docs.aws.amazon.com/emr/latest/EMR-on-EKS-DevelopmentGuide/spark-submit-setup.html).

1. Set the values for the following environment variables:

   ```
   export SPARK_HOME=spark-home
   export MASTER_URL=k8s://Amazon EKS-cluster-endpoint
   ```

1. Now, submit the Spark application with the following command:

   ```
   $SPARK_HOME/bin/spark-submit \
    --class org.apache.spark.examples.SparkPi \
    --master $MASTER_URL \
    --conf spark.kubernetes.container.image=895885662937.dkr.ecr.us-west-2.amazonaws.com/spark/emr-6.15.0:latest \
    --conf spark.kubernetes.authenticate.driver.serviceAccountName=emr-containers-sa-spark \
    --deploy-mode cluster \
    --conf spark.kubernetes.namespace=default \
    --conf "spark.driver.extraClassPath=/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/home/hadoop/extrajars/*" \
    --conf "spark.driver.extraLibraryPath=/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/docker/usr/lib/hadoop/lib/native:/docker/usr/lib/hadoop-lzo/lib/native" \
    --conf "spark.executor.extraClassPath=/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/home/hadoop/extrajars/*" \
    --conf "spark.executor.extraLibraryPath=/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/docker/usr/lib/hadoop/lib/native:/docker/usr/lib/hadoop-lzo/lib/native" \
    --conf spark.hadoop.fs.s3.customAWSCredentialsProvider=com.amazonaws.auth.WebIdentityTokenCredentialsProvider \
    --conf spark.hadoop.fs.s3.impl=com.amazon.ws.emr.hadoop.fs.EmrFileSystem \
    --conf spark.hadoop.fs.AbstractFileSystem.s3.impl=org.apache.hadoop.fs.s3.EMRFSDelegate \
    --conf spark.hadoop.fs.s3.buffer.dir=/mnt/s3 \
    --conf spark.hadoop.fs.s3.getObject.initialSocketTimeoutMilliseconds="2000" \
    --conf spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version.emr_internal_use_only.EmrFileSystem="2" \
    --conf spark.hadoop.mapreduce.fileoutputcommitter.cleanup-failures.ignored.emr_internal_use_only.EmrFileSystem="true" \
    s3://my-pod-bucket/spark-examples.jar 20
   ```

1. After the spark driver finishes the Spark job, you should see a log line at the end of the submission indicating that the Spark job has finished.

   ```
   23/11/24 17:02:14 INFO LoggingPodStatusWatcherImpl: Application org.apache.spark.examples.SparkPi with submission ID default:org-apache-spark-examples-sparkpi-4980808c03ff3115-driver finished
   23/11/24 17:02:14 INFO ShutdownHookManager: Shutdown hook called
   ```

## Cleanup


When you're done running your applications, you can perform cleanup with the following command.

```
kubectl delete -f spark-rbac.yaml
```