# Managing Amazon EMR on EKS job runs
<a name="emr-eks-jobs-manage"></a>

The following sections cover topics that help you manage your Amazon EMR on EKS job runs. These include configuring job run parameters when you use the AWS CLI, configuring how your log data is stored, running Spark SQL scripts to run queries, understanding job run states, and knowing how to monitor jobs. You can work through these topics, generally in order, if you want to set up and complete a job run to process data.

**Topics**
+ [Managing job runs with the AWS CLI](emr-eks-jobs-CLI.md)
+ [Running Spark SQL scripts through the StartJobRun API](emr-eks-jobs-spark-sql-parameters.md)
+ [Job run states](emr-eks-jobs-states.md)
+ [Viewing jobs in the Amazon EMR console](emr-eks-jobs-console.md)
+ [Common errors when running jobs](emr-eks-jobs-error.md)

# Managing job runs with the AWS CLI
<a name="emr-eks-jobs-CLI"></a>

This topic covers how to manage job runs with the AWS Command Line Interface (AWS CLI). It goes into detail regarding properties, like security parameters, the driver, and various override settings. It also includes subtopics that cover various ways to configure logging.

**Topics**
+ [Options for configuring a job run](#emr-eks-jobs-parameters)
+ [Configure a job run to use Amazon S3 logs](emr-eks-jobs-s3.md)
+ [Configure a job run to use Amazon CloudWatch Logs](emr-eks-jobs-cloudwatch.md)
+ [List job runs](#emr-eks-jobs-list)
+ [Describe a job run](#emr-eks-jobs-describe)
+ [Cancel a job run](#emr-eks-jobs-cancel)

## Options for configuring a job run
<a name="emr-eks-jobs-parameters"></a>

Use the following options to configure job run parameters:
+ `--execution-role-arn`: You must provide an IAM role that is used for running jobs. For more information, see [Using job execution roles with Amazon EMR on EKS](iam-execution-role.md). 
+ `--release-label`: You can deploy Amazon EMR on EKS with Amazon EMR versions 5.32.0 and 6.2.0 and later. Amazon EMR on EKS is not supported in previous Amazon EMR release versions. For more information, see [Amazon EMR on EKS releases](emr-eks-releases.md). 
+ `--job-driver`: Job driver is used to provide input on the main job. This is a union type field where you can only pass one of the values for the job type that you want to run. Supported job types include:
  + Spark submit jobs - Used to run a command through Spark submit. You can use this job type to run Scala, PySpark, SparkR, SparkSQL and any other supported jobs through Spark Submit. This job type has the following parameters:
    + Entrypoint - This is the HCFS (Hadoop compatible file system) reference to the main jar/py file you want to run.
    + EntryPointArguments - This is an array of arguments you want to pass to your main jar/py file. You should handle reading these parameters using your entrypoint code. Each argument in the array should be separated by a comma. EntryPointArguments cannot contain brackets or parentheses, such as (), \$1\$1, or []. 
    + SparkSubmitParameters - These are the additional spark parameters you want to send to the job. Use this parameter to override default Spark properties such as driver memory or number of executors like —conf or —class. For additional information, see [Launching Applications with spark-submit](https://spark.apache.org/docs/latest/submitting-applications.html#launching-applications-with-spark-submit).
  + Spark SQL jobs - Used to run a SQL query file through Spark SQL. You can use this job type to run SparkSQL jobs. This job type has the following parameters:
    + Entrypoint - This is the HCFS (Hadoop compatible file system) reference to the SQL query file you want to run.

      For a list of additional Spark parameters you can use for a Spark SQL job, see [Running Spark SQL scripts through the StartJobRun API](emr-eks-jobs-spark-sql-parameters.md).
+ `--configuration-overrides`: You can override the default configurations for applications by supplying a configuration object. You can use a shorthand syntax to provide the configuration or you can reference the configuration object in a JSON file. Configuration objects consist of a classification, properties, and optional nested configurations. Properties consist of the settings you want to override in that file. You can specify multiple classifications for multiple applications in a single JSON object. The configuration classifications that are available vary by Amazon EMR release version. For a list of configuration classifications that are available for each release version of Amazon EMR, see [Amazon EMR on EKS releases](emr-eks-releases.md).

  If you pass the same configuration in an application override and in Spark submit parameters, the Spark submit parameters take precedence. The complete configuration priority list follows, in order of highest priority to lowest priority.
  + Configuration supplied when creating `SparkSession`.
  + Configuration supplied as part of `sparkSubmitParameters` using `—conf`.
  + Configuration provided as part of application overrides.
  + Optimized configurations chosen by Amazon EMR for the release.
  + Default open source configurations for the application.

  To monitor job runs using Amazon CloudWatch or Amazon S3, you must provide the configuration details for CloudWatch. For more information, see [Configure a job run to use Amazon S3 logs](emr-eks-jobs-s3.md) and [Configure a job run to use Amazon CloudWatch Logs](emr-eks-jobs-cloudwatch.md). If the S3 bucket or CloudWatch log group does not exist, then Amazon EMR creates it before uploading logs to the bucket.
+ For an additional list of Kubernetes configuration options, see [Spark Properties on Kubernetes](https://spark.apache.org/docs/latest/running-on-kubernetes.html#configuration). 

  The following Spark configurations are not supported.
  + `spark.kubernetes.authenticate.driver.serviceAccountName`
  + `spark.kubernetes.authenticate.executor.serviceAccountName`
  + `spark.kubernetes.namespace`
  + `spark.kubernetes.driver.pod.name`
  + `spark.kubernetes.container.image.pullPolicy`
  + `spark.kubernetes.container.image`
**Note**  
You can use `spark.kubernetes.container.image` for customized Docker images. For more information, see [Customizing Docker images for Amazon EMR on EKS](docker-custom-images.md).

# Configure a job run to use Amazon S3 logs
<a name="emr-eks-jobs-s3"></a>

To be able to monitor the job progress and to troubleshoot failures, you must configure your jobs to send log information to Amazon S3, Amazon CloudWatch Logs, or both. This topic helps you get started publishing application logs to Amazon S3 on your jobs that are launched with Amazon EMR on EKS.

**S3 logs IAM policy**

Before your jobs can send log data to Amazon S3, the following permissions must be included in the permissions policy for the job execution role. Replace *amzn-s3-demo-logging-bucket* with the name of your logging bucket.

------
#### [ JSON ]

****  

```
{
  "Version":"2012-10-17",		 	 	 
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:PutObject",
        "s3:GetObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::amzn-s3-demo-bucket",
        "arn:aws:s3:::amzn-s3-demo-bucket/*"
      ],
      "Sid": "AllowS3Putobject"
    }
  ]
}
```

------

**Note**  
Amazon EMR on EKS can also create an Amazon S3 bucket. If an Amazon S3 bucket is not available, include the `“s3:CreateBucket”` permission in the IAM policy.

After you've given your execution role the proper permissions to send logs to Amazon S3, your log data are sent to the following Amazon S3 locations when `s3MonitoringConfiguration` is passed in the `monitoringConfiguration` section of a `start-job-run` request, as shown in [Managing job runs with the AWS CLI](emr-eks-jobs-CLI.md).
+ Submitter Logs - /*logUri*/*virtual-cluster-id*/jobs/*job-id*/containers/*pod-name*/(stderr.gz/stdout.gz)
+ Driver Logs - /*logUri*/*virtual-cluster-id*/jobs/*job-id*/containers/*spark-application-id*/spark-*job-id*-driver/(stderr.gz/stdout.gz)
+ Executor Logs - /*logUri*/*virtual-cluster-id*/jobs/*job-id*/containers/*spark-application-id*/*executor-pod-name*/(stderr.gz/stdout.gz)

# Configure a job run to use Amazon CloudWatch Logs
<a name="emr-eks-jobs-cloudwatch"></a>

To monitor job progress and to troubleshoot failures, you must configure your jobs to send log information to Amazon S3, Amazon CloudWatch Logs, or both. This topic helps you get started using CloudWatch Logs on your jobs that are launched with Amazon EMR on EKS. For more information about CloudWatch Logs, see [Monitoring Log Files](https://docs.aws.amazon.com/AmazonCloudWatch/latest/DeveloperGuide/WhatIsCloudWatchLogs.html) in the Amazon CloudWatch User Guide.

**CloudWatch Logs IAM policy**

For your jobs to send log data to CloudWatch Logs, the following permissions must be included in the permissions policy for the job execution role. Replace *my\$1log\$1group\$1name* and *my\$1log\$1stream\$1prefix* with names of your CloudWatch log group and log stream names, respectively. Amazon EMR on EKS creates the log group and log stream if they do not exist as long as the execution role ARN has appropriate permissions. 

------
#### [ JSON ]

****  

```
{
  "Version":"2012-10-17",		 	 	 
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "logs:CreateLogStream",
        "logs:DescribeLogGroups",
        "logs:DescribeLogStreams"
      ],
      "Resource": [
        "arn:aws:logs:*:*:*"
      ],
      "Sid": "AllowLOGSCreatelogstream"
    },
    {
      "Effect": "Allow",
      "Action": [
        "logs:PutLogEvents"
      ],
      "Resource": [
        "arn:aws:logs:*:*:log-group:my_log_group_name:log-stream:my_log_stream_prefix/*"
      ],
      "Sid": "AllowLOGSPutlogevents"
    }
  ]
}
```

------

**Note**  
Amazon EMR on EKS can also create a log stream. If a log stream does not exist, the IAM policy should include the`"logs:CreateLogGroup"` permission.

After you've given your execution role the proper permissions, your application sends its log data to CloudWatch Logs when `cloudWatchMonitoringConfiguration` is passed in the `monitoringConfiguration` section of a `start-job-run` request, as shown in [Managing job runs with the AWS CLI](emr-eks-jobs-CLI.md).

In the `StartJobRun` API, *log\$1group\$1name *is the log group name for CloudWatch, and *log\$1stream\$1prefix* is the log stream name prefix for CloudWatch. You can view and search these logs in the AWS Management Console.
+ Submitter logs - *logGroup*/*logStreamPrefix*/*virtual-cluster-id*/jobs/*job-id*/containers/*pod-name*/(stderr/stdout)
+ Driver logs - *logGroup*/*logStreamPrefix*/*virtual-cluster-id*/jobs/*job-id*/containers/*spark-application-id*/spark-*job-id*-driver/(stderrstdout)
+ Executor logs - *logGroup*/*logStreamPrefix*/*virtual-cluster-id*/jobs/*job-id*/containers/*spark-application-id*/*executor-pod-name*/(stderr/stdout)

## List job runs
<a name="emr-eks-jobs-list"></a>

You can run `list-job-run` to show the states of job runs, as the following example demonstrates. 

```
aws emr-containers list-job-runs --virtual-cluster-id <cluster-id>
```

## Describe a job run
<a name="emr-eks-jobs-describe"></a>

You can run `describe-job-run` to get more details about the job, such as job state, state details, and job name, as the following example demonstrates. 

```
aws emr-containers describe-job-run --virtual-cluster-id cluster-id --id job-run-id
```

## Cancel a job run
<a name="emr-eks-jobs-cancel"></a>

You can run `cancel-job-run` to cancel running jobs, as the following example demonstrates.

```
aws emr-containers cancel-job-run --virtual-cluster-id cluster-id --id job-run-id
```

# Running Spark SQL scripts through the StartJobRun API
<a name="emr-eks-jobs-spark-sql-parameters"></a>

Amazon EMR on EKS releases 6.7.0 and higher include a Spark SQL job driver so that you can run Spark SQL scripts through the `StartJobRun` API. You can supply SQL entry-point files to directly run Spark SQL queries on Amazon EMR on EKS with the `StartJobRun` API, without any modifications to existing Spark SQL scripts. The following table lists Spark parameters that are supported for the Spark SQL jobs through the StartJobRun API.

You can choose from the following Spark parameters to send to a Spark SQL job. Use these parameters to override default Spark properties.


| Option | Description | 
| --- | --- | 
|  --name NAME  | Application Name | 
| --jars JARS | Comma separated list of jars to be included with driver and execute classpath. | 
| --packages | Comma-separated list of maven coordinates of jars to include on the driver and executor classpaths. | 
| --exclude-packages | Comma-separated list of groupId:artifactId, to exclude while resolving the dependencies provided in –packages to avoid dependency conflicts. | 
| --repositories | Comma-separated list of additional remote repositories to search for the maven coordinates given with –packages. | 
| --files FILES | Comma-separated list of files to be placed in the working directory of each executor. | 
| --conf PROP=VALUE | Spark configuration property. | 
| --properties-file FILE | Path to a file from which to load extra properties. | 
| --driver-memory MEM | Memory for driver. Default 1024MB. | 
| --driver-java-options | Extra Java options to pass to the driver. | 
| --driver-library-path | Extra library path entries to pass to the driver. | 
| --driver-class-path | Extra classpath entries to pass to the driver. | 
| --executor-memory MEM | Memory per executor. Default 1GB. | 
| --driver-cores NUM | Number of cores used by the driver. | 
| --total-executor-cores NUM | Total cores for all executors. | 
| --executor-cores NUM | Number of cores used by each executor. | 
| --num-executors NUM | Number of executors to launch. | 
| -hivevar <key=value> | Variable substitution to apply to Hive commands, for example, -hivevar A=B | 
| -hiveconf <property=value> | Value to use for the given property. | 

For a Spark SQL job, create a start-job-run-request.json file and specify the required parameters for your job run, as in the following example:

```
{
  "name": "myjob", 
  "virtualClusterId": "123456",  
  "executionRoleArn": "iam_role_name_for_job_execution", 
  "releaseLabel": "emr-6.7.0-latest", 
  "jobDriver": {
    "sparkSqlJobDriver": {
      "entryPoint": "entryPoint_location",
       "sparkSqlParameters": "--conf spark.executor.instances=2 --conf spark.executor.memory=2G --conf spark.executor.cores=2 --conf spark.driver.cores=1"
    }
  }, 
  "configurationOverrides": {
    "applicationConfiguration": [
      {
        "classification": "spark-defaults", 
        "properties": {
          "spark.driver.memory":"2G"
         }
      }
    ], 
    "monitoringConfiguration": {
      "persistentAppUI": "ENABLED", 
      "cloudWatchMonitoringConfiguration": {
        "logGroupName": "my_log_group", 
        "logStreamNamePrefix": "log_stream_prefix"
      }, 
      "s3MonitoringConfiguration": {
        "logUri": "s3://my_s3_log_location"
      }
    }
  }
}
```

# Job run states
<a name="emr-eks-jobs-states"></a>

When you submit a job run to an Amazon EMR on EKS job queue, the job run enters the `PENDING` state. It then passes through the following states until it succeeds (exits with code `0`) or fails (exits with a non-zero code). 

Job runs can have the following states:
+ `PENDING` ‐ The initial job state when the job run is submitted to Amazon EMR on EKS. The job is waiting to be submitted to the virtual cluster, and Amazon EMR on EKS is working on submitting this job.
+ `SUBMITTED` ‐ A job run that has been successfully submitted to the virtual cluster. The cluster scheduler then tries to run this job on the cluster.
+ `RUNNING` ‐ A job run that is running in the virtual cluster. In Spark applications, this means that the Spark driver process is in the `running` state.
+ `FAILED` ‐ A job run that failed to be submitted to the virtual cluster or that completed unsuccessfully. Look at StateDetails and FailureReason to find additional information about this job failure.
+ `COMPLETED` ‐ A job run that has completed successfully.
+ `CANCEL_PENDING` ‐ A job run has been requested for cancellation. Amazon EMR on EKS is trying to cancel the job on the virtual cluster.
+ `CANCELLED` ‐ A job run that was cancelled successfully.

# Viewing jobs in the Amazon EMR console
<a name="emr-eks-jobs-console"></a>

Job run data is avilable to view, so you can monitor each job as it passes through the states. To view jobs in the Amazon EMR console, perform the following steps.

1. In the Amazon EMR console lefthand menu, under Amazon EMR on EKS, choose **Virtual clusters**.

1. From the list of virtual clusters, select the virtual cluster for which you want to view jobs.

1. On the **Job runs** table, select **View logs** to view the details of a job run.

**Note**  
Support for the one-click experience is enabled by default. It can be turned off by setting `persistentAppUI` to `DISABLED` in `monitoringConfiguration` during job submission. For more information, see [View Persistent Application User Interfaces](https://docs.aws.amazon.com/emr/latest/ManagementGuide/app-history-spark-UI.html).

# Common errors when running jobs
<a name="emr-eks-jobs-error"></a>

The following errors may occur when you run `StartJobRun` API. The table lists each error and provides mitigation steps so you can address issues quickly.


| Error Message | Error Condition | Recommended Next Step | 
| --- | --- | --- | 
|  error: argument --*argument* is required  | Required parameters are missing. | Add the missing arguments to the API request. | 
| An error occurred (AccessDeniedException) when calling the StartJobRun operation: User: ARN is not authorized to perform: emr-containers:StartJobRun | Execution role is missing. | See Using [Using job execution roles with Amazon EMR on EKS](iam-execution-role.md).  | 
|  An error occurred (AccessDeniedException) when calling the StartJobRun operation: User: *ARN* is not authorized to perform: emr-containers:StartJobRun  |  Caller doesn't have permission to the execution role [valid / not valid format] via condition keys.  | See [Using job execution roles with Amazon EMR on EKS](iam-execution-role.md).  | 
|  An error occurred (AccessDeniedException) when calling the StartJobRun operation: User: *ARN* is not authorized to perform: emr-containers:StartJobRun  |  Job submitter and Execution role ARN are from different accounts.  | Ensure that job submitter and execution role ARN are from the same AWS account. | 
|  1 validation error detected: Value *Role* at 'executionRoleArn' failed to satisfy the ARN regular expression pattern: ^arn:(aws[a-zA-Z0-9-]\$1):iam::(\$1d\$112\$1)?:(role((\$1u002F)\$1(\$1u002F[\$1u0021-\$1u007F]\$1\$1u002F))[\$1w\$1=,.@-]\$1)  |  Caller has permissions for the execution role via condition keys, but the role does not satisfy the constraints of ARN format.  | Provide the execution role following the ARN format. See [Using job execution roles with Amazon EMR on EKS](iam-execution-role.md).  | 
|  An error occurred (ResourceNotFoundException) when calling the StartJobRun operation: Virtual cluster *Virtual Cluster ID* doesn't exist.  |  Virtual cluster ID is not found.  | Provide a virtual cluster ID registered with Amazon EMR on EKS. | 
|  An error occurred (ValidationException) when calling the StartJobRun operation: Virtual cluster state *state* is not valid to create resource JobRun.  |  Virtual cluster is not ready to execute job.  | See [Virtual cluster states](virtual-cluster.md#virtual-cluster-states).  | 
|  An error occurred (ResourceNotFoundException) when calling the StartJobRun operation: Release *RELEASE* doesn't exist.  |  The release specified in job submission is incorrect.  | See [Amazon EMR on EKS releases](emr-eks-releases.md).  | 
|  An error occurred (AccessDeniedException) when calling the StartJobRun operation: User: *ARN* is not authorized to perform: emr-containers:StartJobRun on resource: *ARN* with an explicit deny. An error occurred (AccessDeniedException) when calling the StartJobRun operation: User: *ARN* is not authorized to perform: emr-containers:StartJobRun on resource: *ARN*  | User is not authorized to call StartJobRun. | See [Using job execution roles with Amazon EMR on EKS](iam-execution-role.md).  | 
|  An error occurred (ValidationException) when calling the StartJobRun operation: configurationOverrides.monitoringConfiguration.s3MonitoringConfiguration.logUri failed to satisfy constraint : %s  |  S3 path URI syntax is not valid.  | logUri should be in the format of s3://...  | 

The following errors may occur when you run `DescribeJobRun` API before the job runs.


| Error Message | Error Condition | Recommended Next Step | 
| --- | --- | --- | 
|  stateDetails: JobRun submission failed.  Classification *classification* not supported. failureReason: VALIDATION\$1ERROR state: FAILED.  | Parameters in StartJobRun are not valid. | See [Amazon EMR on EKS releases](emr-eks-releases.md).  | 
|  stateDetails: Cluster *EKS Cluster ID* does not exist. failureReason: CLUSTER\$1UNAVAILABLE state: FAILED  | The EKS cluster is not available. | Check if the EKS cluster exists and has the right permissions. For more information, see [Setting up Amazon EMR on EKS](setting-up.md). | 
|  stateDetails: Cluster *EKS Cluster ID* does not have sufficient permissions. failureReason: CLUSTER\$1UNAVAILABLE state: FAILED  |  Amazon EMR does not have permissions to access the EKS cluster.  | Verify that permissions are set up for Amazon EMR on the registered namespace. For more information, see [Setting up Amazon EMR on EKS](setting-up.md). | 
|  stateDetails: Cluster *EKS Cluster ID* is currently not reachable. failureReason: CLUSTER\$1UNAVAILABLE state: FAILED  |  EKS cluster is not reachable.  | Check if EKS Cluster exists and has the right permissions. For more information, see [Setting up Amazon EMR on EKS](setting-up.md). | 
|  stateDetails: JobRun submission failed due to an internal error. failureReason: INTERNAL\$1ERROR state: FAILED  |  An internal error has occurred with the EKS cluster.  | N/A | 
|  stateDetails: Cluster *EKS Cluster ID* does not have sufficient resources. failureReason: USER\$1ERROR state: FAILED  |  There are insufficient resources in the EKS cluster to run the job.  | Add more capacity to the EKS node group or set up EKS Autoscaler. For more information, see [Cluster Autoscaler](https://docs.aws.amazon.com/eks/latest/userguide/cluster-autoscaler.html). | 

The following errors may occur when you run `DescribeJobRun` API after the job runs.


| Error Message | Error Condition | Recommended Next Step | 
| --- | --- | --- | 
|  stateDetails: Trouble monitoring your JobRun.  Cluster *EKS Cluster ID* does not exist. failureReason: CLUSTER\$1UNAVAILABLE state: FAILED  | The EKS cluster does not exist. | Check if EKS Cluster exists and has the right permissions. For more information, see [Setting up Amazon EMR on EKS](setting-up.md). | 
|  stateDetails: Trouble monitoring your JobRun. Cluster *EKS Cluster ID* does not have sufficient permissions. failureReason: CLUSTER\$1UNAVAILABLE state: FAILED  | Amazon EMR does not have permissions to access the EKS cluster. | Verify that permissions are set up for Amazon EMR on the registered namespace. For more information, see [Setting up Amazon EMR on EKS](setting-up.md). | 
|  stateDetails: Trouble monitoring your JobRun. Cluster *EKS Cluster ID* is currently not reachable. failureReason: CLUSTER\$1UNAVAILABLE state: FAILED  |  The EKS cluster is not reachable.  | Check if EKS Cluster exists and has the right permissions. For more information, see [Setting up Amazon EMR on EKS](setting-up.md). | 
|  stateDetails: Trouble monitoring your JobRun due to an internal error failureReason: INTERNAL\$1ERROR state: FAILED  |  An internal error has occurred and is preventing JobRun monitoring.  | N/A | 

The following error may occur when a job cannot start and the job waits in the SUBMITTED state for 15 minutes. This can be caused by a lack of cluster resources.


| Error Message | Error Condition | Recommended Next Step | 
| --- | --- | --- | 
|  cluster timeout  | The job has been in the SUBMITTED state for 15 minutes or more. | You can override the default setting of 15 minutes for this parameter with the configuration override shown below.  | 

Use the following configuration to change the cluster timeout setting to 30 minutes. Notice that you provide the new `job-start-timeout` value in seconds:

```
{
"configurationOverrides": {
  "applicationConfiguration": [{
      "classification": "emr-containers-defaults",
      "properties": {
          "job-start-timeout":"1800"
      }
  }]
}
```