

# Verify Spark driver service account security requirements for spark-submit
Security

The Spark driver pod uses a Kubernetes service account to access the Kubernetes API server to create and watch executor pods. Driver service account must have appropriate permissions to list, create, edit, patch and delete pods in your cluster. You can verify that you can list these resources by running the following command:

```
kubectl auth can-i list|create|edit|delete|patch pods
```

Verify that you have the necessary permissions by running each command.

```
kubectl auth can-i list pods
kubectl auth can-i create pods
kubectl auth can-i edit pods
kubectl auth can-i delete pods
kubectl auth can-i patch pods
```

The following rules apply to this service role: 

```
 rules:
- apiGroups:
  - ""
  resources:
  - pods
  verbs:
  - "*"
- apiGroups:
  - ""
  resources:
  - services
  verbs:
  - "*"
- apiGroups:
  - ""
  resources:
  - configmaps
  verbs:
  - "*"
- apiGroups:
  - ""
  resources:
  - persistentvolumeclaims
  verbs:
  - "*"
```

# Setting up IAM roles for service accounts (IRSA) for spark-submit
IAM roles for service roles for spark-submit

The following sections explain how to set up IAM roles for service accounts (IRSA) to authenticate and authorize Kubernetes service accounts so you can run Spark applications stored in Amazon S3.

## Prerequisites


Before trying any of the examples in this documentation, make sure that you have completed the following prerequisites:
+ [Finished setting up spark-submit](https://docs.aws.amazon.com/emr/latest/EMR-on-EKS-DevelopmentGuide/spark-submit-setup.html)
+ [Created an S3 bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/creating-bucket.html) and [uploaded](https://docs.aws.amazon.com/AmazonS3/latest/userguide/uploading-an-object-bucket.html) the spark application jar

## Configuring a Kubernetes service account to assume an IAM role
Configuring a Kubernetes service account

The following steps cover how to configure a Kubernetes service account to assume an AWS Identity and Access Management (IAM) role. After you configure the pods to use the service account, they can then access any AWS service that the role has permissions to access.

1. Create a policy file to allow read-only access to the Amazon S3 object you [uploaded](https://docs.aws.amazon.com/AmazonS3/latest/userguide/uploading-an-object-bucket.html):

   ```
   cat >my-policy.json <<EOF
   {
       "Version": "2012-10-17",		 	 	 
       "Statement": [
           {
               "Effect": "Allow",
               "Action": [
                   "s3:GetObject",
                   "s3:ListBucket"
               ],
               "Resource": [
                   "arn:aws:s3:::<my-spark-jar-bucket>",
                   "arn:aws:s3:::<my-spark-jar-bucket>/*"
               ]
           }
       ]
   }
   EOF
   ```

1. Create the IAM policy.

   ```
   aws iam create-policy --policy-name my-policy --policy-document file://my-policy.json
   ```

1. Create an IAM role and associate it with a Kubernetes service account for the Spark driver

   ```
   eksctl create iamserviceaccount --name my-spark-driver-sa --namespace spark-operator \
   --cluster my-cluster --role-name "my-role" \
   --attach-policy-arn arn:aws:iam::111122223333:policy/my-policy --approve
   ```

1. Create a YAML file with the required [permissions](https://docs.aws.amazon.com/emr/latest/EMR-on-EKS-DevelopmentGuide/spark-submit-security.html) for the Spark driver service account:

   ```
   cat >spark-rbac.yaml <<EOF
   apiVersion: rbac.authorization.k8s.io/v1
   kind: Role
   metadata:
     namespace: default
     name: emr-containers-role-spark
   rules:
   - apiGroups:
     - ""
     resources:
     - pods
     verbs:
     - "*"
   - apiGroups:
     - ""
     resources:
     - services
     verbs:
     - "*"
   - apiGroups:
     - ""
     resources:
     - configmaps
     verbs:
     - "*"
   - apiGroups:
     - ""
     resources:
     - persistentvolumeclaims
     verbs:
     - "*"
   ---
   apiVersion: rbac.authorization.k8s.io/v1
   kind: RoleBinding
   metadata:
     name: spark-role-binding
     namespace: default
   roleRef:
     apiGroup: rbac.authorization.k8s.io
     kind: Role
     name: emr-containers-role-spark
   subjects:
   - kind: ServiceAccount
     name: emr-containers-sa-spark
     namespace: default
   EOF
   ```

1. Apply the cluster role binding configurations.

   ```
   kubectl apply -f spark-rbac.yaml
   ```

1. The `kubectl` command should return confirmation of the created account.

   ```
   serviceaccount/emr-containers-sa-spark created
   clusterrolebinding.rbac.authorization.k8s.io/emr-containers-role-spark configured
   ```

## Running the Spark application
Running the Spark application

Amazon EMR 6.10.0 and higher supports spark-submit for running Spark applications on an Amazon EKS cluster. To run the Spark application, follow these steps:

1. Make sure that you have completed the steps in [ Setting up spark-submit for Amazon EMR on EKS](https://docs.aws.amazon.com/emr/latest/EMR-on-EKS-DevelopmentGuide/spark-submit-setup.html).

1. Set the values for the following environment variables:

   ```
   export SPARK_HOME=spark-home
   export MASTER_URL=k8s://Amazon EKS-cluster-endpoint
   ```

1. Now, submit the Spark application with the following command:

   ```
   $SPARK_HOME/bin/spark-submit \
    --class org.apache.spark.examples.SparkPi \
    --master $MASTER_URL \
    --conf spark.kubernetes.container.image=895885662937.dkr.ecr.us-west-2.amazonaws.com/spark/emr-6.15.0:latest \
    --conf spark.kubernetes.authenticate.driver.serviceAccountName=emr-containers-sa-spark \
    --deploy-mode cluster \
    --conf spark.kubernetes.namespace=default \
    --conf "spark.driver.extraClassPath=/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/home/hadoop/extrajars/*" \
    --conf "spark.driver.extraLibraryPath=/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/docker/usr/lib/hadoop/lib/native:/docker/usr/lib/hadoop-lzo/lib/native" \
    --conf "spark.executor.extraClassPath=/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/home/hadoop/extrajars/*" \
    --conf "spark.executor.extraLibraryPath=/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/docker/usr/lib/hadoop/lib/native:/docker/usr/lib/hadoop-lzo/lib/native" \
    --conf spark.hadoop.fs.s3.customAWSCredentialsProvider=com.amazonaws.auth.WebIdentityTokenCredentialsProvider \
    --conf spark.hadoop.fs.s3.impl=com.amazon.ws.emr.hadoop.fs.EmrFileSystem \
    --conf spark.hadoop.fs.AbstractFileSystem.s3.impl=org.apache.hadoop.fs.s3.EMRFSDelegate \
    --conf spark.hadoop.fs.s3.buffer.dir=/mnt/s3 \
    --conf spark.hadoop.fs.s3.getObject.initialSocketTimeoutMilliseconds="2000" \
    --conf spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version.emr_internal_use_only.EmrFileSystem="2" \
    --conf spark.hadoop.mapreduce.fileoutputcommitter.cleanup-failures.ignored.emr_internal_use_only.EmrFileSystem="true" \
    s3://my-pod-bucket/spark-examples.jar 20
   ```

1. After the spark driver finishes the Spark job, you should see a log line at the end of the submission indicating that the Spark job has finished.

   ```
   23/11/24 17:02:14 INFO LoggingPodStatusWatcherImpl: Application org.apache.spark.examples.SparkPi with submission ID default:org-apache-spark-examples-sparkpi-4980808c03ff3115-driver finished
   23/11/24 17:02:14 INFO ShutdownHookManager: Shutdown hook called
   ```

## Cleanup


When you're done running your applications, you can perform cleanup with the following command.

```
kubectl delete -f spark-rbac.yaml
```