

# Amazon EMR
<a name="automation-ref-emr"></a>

 AWS Systems Manager Automation provides predefined runbooks for Amazon EMR. For more information about runbooks, see [Working with runbooks](https://docs.aws.amazon.com/systems-manager/latest/userguide/automation-documents.html). For information about how to view runbook content, see [View runbook content](automation-runbook-reference.md#view-automation-json). 

**Topics**
+ [`AWSSupport-AnalyzeEMRLogs`](automation-awssupport-analyzeemrlogs.md)
+ [`AWSSupport-DiagnoseEMRLogsWithAthena`](awssupport-diagnose-emr-logs-with-athena.md)

# `AWSSupport-AnalyzeEMRLogs`
<a name="automation-awssupport-analyzeemrlogs"></a>

 **Description** 

This runbook helps identify errors while running a job on an Amazon EMR cluster. The runbook analyzes a list of defined logs on the file system and looks for a list of predefined keywords. These log entries are used to create Amazon CloudWatch Events events so you can take any needed actions based on the events. Optionally, the runbook publishes log entries to the Amazon CloudWatch Logs log group of your choosing. This runbook currently looks for the following errors and patterns in log files:
+  container\$1out\$1of\$1memory – YARN container ran out of memory, running job may fail. 
+  yarn\$1nodemanager\$1health: CORE or TASK node is running low on disk space and will not be able to run tasks. 
+  node\$1state\$1change: CORE or TASK node is unreachable by the MASTER node. 
+  step\$1failure: An EMR Step has failed. 
+  no\$1core\$1nodes\$1running: No CORE nodes are currently running, cluster is unhealthy. 
+  hdfs\$1missing\$1blocks: There are missing HDFS blocks which could lead to data loss. 
+  hdfs\$1high\$1util: HDFS Utilization is high, which may affect jobs and cluster health. 
+  instance\$1controller\$1restart: Instance-Controller process has restarted. This process is essential for cluster health. 
+  instance\$1controller\$1restart\$1legacy: Instance-Controller process has restarted. This process is essential for cluster health. 
+  high\$1load: High Load Average detected, may affect node health reporting or result in timeouts or slowdowns. 
+  yarn\$1node\$1blacklisted: CORE or TASK node has been blacklisted by YARN from running tasks. 
+  yarn\$1node\$1lost: CORE or TASK node has been marked as LOST by YARN, possible connectivity issues. 

 Instances associated with the `ClusterID` that you specify must be managed by AWS Systems Manager. You can run this automation once, schedule the automation to run at a specific time interval, or remove a schedule created previously by an automation. This runbook supports Amazon EMR release versions 5.20 to 6.30. 

 [Run this Automation (console)](https://console.aws.amazon.com/systems-manager/automation/execute/AWSSupport-AnalyzeEMRLogs) 

**Document type**

Automation

**Owner**

Amazon

**Platforms**

Linux, macOS, Windows

**Parameters**
+ AutomationAssumeRole

  Type: String

  Description: (Optional) The Amazon Resource Name (ARN) of the AWS Identity and Access Management (IAM) role that allows Systems Manager Automation to perform the actions on your behalf. If no role is specified, Systems Manager Automation uses the permissions of the user that starts this runbook.
+ ClusterID

  Type: String

  Description: (Required) The ID of the cluster whose nodes logs you want to analyze.
+ Operation

  Type: String

  Valid values: Run Once \$1 Schedule \$1 Remove Schedule 

  Description: (Required) The operation to perform on the cluster.
+ IntervalTime

  Type: String

  Valid values: 5 minutes \$1 10 minutes \$1 15 minutes

   Description: (Optional) The duration of time between running the automation. This parameter is only applicable if you specify `Schedule` for the `Operation` parameter. 
+ LogToCloudWatchLogs

  Type: String

  Valid values: yes \$1 no

   Description: (Optional) If you specify `yes` for the value of this parameter, the automation creates a CloudWatch Logs log group with the name specified in the `CloudWatchLogGroup` parameter to store any matched log entries. 
+ CloudWatchLogGroup

  Type: String

   Description: (Optional) The name of the CloudWatch Logs log group you want to store any matched log entries in. This parameter is only applicable if you specify `yes` for the `LogToCloudWatchLogs` parameter. 
+ CreateLogInsightsDashboard

  Type: String

  Valid values: yes \$1 no

   Description: (Optional) If you specify `yes` , CloudWatch dashboard is created if it does not already exist. This parameter is only applicable if you specify `yes` for the `LogToCloudWatchLogs` parameter. 
+ CreateMetricFilters

  Type: String

  Valid values: yes \$1 no

   Description: (Optional) Specify `yes` if you want to create metric filters for the CloudWatch Logs log group. This parameter is only applicable if you specify `yes` for the `LogToCloudWatchLogs` parameter. 

**Required IAM permissions**

The `AutomationAssumeRole` parameter requires the following actions to use the runbook successfully.
+  `ssm:StartAutomationExecution` 
+  `ssm:GetDocument` 
+  `ssm:ListDocuments` 
+  `ssm:DescribeAutomationExecutions` 
+  `ssm:DescribeAutomationStepExecutions` 
+  `ssm:GetAutomationExecution` 
+  `ssm:DescribeInstanceInformation` 
+  `ssm:ListCommandInvocations` 
+  `ssm:ListCommands` 
+  `ssm:SendCommand` 
+  `iam:CreateRole` 
+  `iam:DeleteRole` 
+  `iam:GetRolePolicy` 
+  `iam:PutRolePolicy` 
+  `iam:DeleteRolePolicy` 
+  `iam:passrole` 
+  `cloudformation:DescribeStacks` 
+  `cloudformation:DeleteStack` 
+  `cloudformation:CreateStack` 
+  `events:DeleteRule` 
+  `events:RemoveTargets` 
+  `events:PutTargets` 
+  `events:PutRule` 
+  `events:DescribeRule` 
+  `logs:DescribeLogGroups` 
+  `logs:CreateLogGroup` 
+  `logs:PutMetricFilter` 
+  `cloudwatch:PutDashboard` 
+  `elasticmapreduce:ListInstances` 
+  `elasticmapreduce:DescribeCluster` 

 **Document Steps** 
+  `aws:executeAwsApi` - Gathers information about the Amazon EMR cluster specified in the `ClusterID` parameter. 
+  `aws:branch` - Branches based on input. 
  +  If the provided operation is `Run Once` or `Schedule` : 
    +  `aws:assertAwsResourceProperty` - Verifies the cluster is available. 
    +  `aws:executeAwsApi` - Gathers the IDs of all instances running in the cluster. 
    +  `aws:assertAwsResourceProperty` - Verifies the SSM Agent is running on all instances in the cluster. 
    +  `aws:branch` - Branches based on whether you specified to run the automation once or on a schedule. 
      +  If the provided operation is `Run Once` : 
        +  `aws:branch` - Branches based on the value specified in the `LogToCloudWatchLogs` parameter. 
          +  If `LogToCloudWatchLogs` value is `yes` : 
            +  `aws:executeScript` - Checks if a CloudWatch Logs log group with the name specified in parameter `CloudWatchLogGroup` already exists. If not, the group is created with the name specified. 
            +  `aws:branch` - Branches based on the value specified in the `CreateMetricFilters` parameter. 
              +  If `CreateMetricFilters` value is `yes` : 
                +  `aws:executeAwsApi` - 12 steps are ran for each metric filter 
                +  `aws:branch` - Branches based on the value specified in the `CreateLogInsightsDashboard` parameter. 
                  +  If `CreateLogInsightsDashboard` value is `yes` : 
                    +  `aws:executeAwsApi` - Creates a CloudWatch dashboard with the same name specified in the `CloudWatchLogGroup` parameter, if it does not already exist. 
                  +  If `CreateLogInsightsDashboard` value is `no` : 
                    +  `aws:runCommand` - Runs a shell script to find log patterns on each instance in the cluster. 
              +  If `CreateMetricFilters` value is `no` : 
                +  `aws:branch` - Branches based on the value specified in `CreateLogInsightsDashboard` parameter. 
                  +  If `CreateLogInsightsDashboard` value is `yes` : 
                    +  `aws:executeAwsApi` - Creates a CloudWatch dashboard with the same name specified in the `CloudWatchLogGroup` parameter, if it does not already exist. 
                  +  If `CreateLogInsightsDashboard` value is `no` : 
                    +  `aws:runCommand` - Runs a shell script to find log patterns on each instance in the cluster. 
          +  If `LogToCloudWatchLogs` value is `no` : 
            +  `aws:executeAwsApi` - Runs a shell script to find log patterns on each instance in the cluster. 
      +  If the provided operation is `Schedule` : 
        +  `aws:createStack` - Creates an Amazon EventBridge event that targets this runbook. 
  +  If the provided operation is `Remove Schedule` : 
    +  `aws:executeAwsApi` - Verifies a schedule exists for the cluster. 
    +  `aws:deleteStack` - Deletes the schedule. 

 **Outputs** 

GetClusterInformation.ClusterName

GetClusterInformation.ClusterState

ListingClusterInstances.InstanceIDs

CreatingScheduleCloudFormationStack.StackStatus

RemovingScheduleByDeletingScheduleCloudFormationStack.StackStatus

CheckIfLogGroupExists.output

FindLogPatternOnEMRNode.CommandId

# `AWSSupport-DiagnoseEMRLogsWithAthena`
<a name="awssupport-diagnose-emr-logs-with-athena"></a>

**Description** 

The `AWSSupport-DiagnoseEMRLogsWithAthena` runbook helps diagnose Amazon EMR logs using Amazon Athena in integration with AWS Glue Data Catalog. Amazon Athena is used to query the Amazon EMR log files for containers, node logs, or both, with optional parameters for specific date ranges or keyword-based searches.

The runbook can automatically retrieve the Amazon EMR log location for an existing cluster, or you can specify the Amazon S3 log location. To analyze the logs, the runbook: 
+ Creates an AWS Glue database and executes Amazon Athena Data Definition Language (DDL) queries on the Amazon EMR Amazon S3 log location to create tables for cluster logs and a list of known issues. 
+ Executes Data Manipulation Language (DML) queries to search for known issue patterns in the Amazon EMR logs. The queries return a list of detected issues, their occurrence count, and the number of matched keywords by Amazon S3 file path. 
+ The results are uploaded to an Amazon S3 bucket you specify under the prefix `saw_diagnose_EMR_known_issues`. 
+ The runbook returns the Amazon Athena query results, highlighting findings, recommendations, and references to Amazon Knowledge Center (KC) articles sourced from a predefined subset. 
+  Upon completion or failure, the AWS Glue database and the known issues files uploaded to the Amazon S3 bucket are deleted. 

 **How does it work?** 

 The `AWSSupport-DiagnoseEMRLogsWithAthena` perform analysis of Amazon EMR logs using Amazon Athena to detect errors and highlight findings, recommendations and relevant Knowledge Center articles. 

The runbook performs the following steps: 
+ Get Amazon EMR cluster log location using cluster ID or input Amazon S3 location to retrieve log location and size.
+ Provide Athena costs estimate based on log location size.
+ Get approval to proceed by requesting approval from designated IAM principals before running Athena queries and continuing to the next steps.
+ Upload known issues to the specified Amazon S3 bucket, creates an AWS Glue database and tables.
+ Execute Athena queries on the Amazon EMR logs data. Queries can search by date range, keywords, both criteria, or run without filters based on the provided inputs.
+ Analyze results to highlight findings, recommendations, and relevant KC articles.
+ Output links for Amazon Athena DML queries results.
+ Clean up the environment by removing created database, tables, and uploaded known issues.

**Document type**

Automation

**Owner**

Amazon

**Platforms**

/

The AutomationAssumeRole parameter requires the following actions to successfully use the runbook: 
+ athena:GetQueryExecution
+ athena:StartQueryExecution
+ athena:GetPreparedStatement
+ athena:CreatePreparedStatement
+ glue:GetDatabase
+ glue:CreateDatabase
+ glue:DeleteDatabase
+ glue:CreateTable
+ glue:GetTable
+ glue:DeleteTable 
+ elasticmapreduce:DescribeCluster
+ s3:ListBucket
+ s3:GetBucketVersioning
+ s3:ListBucketVersions
+ s3:GetBucketPublicAccessBlock
+ s3:GetBucketPolicyStatus
+ s3:GetObject
+ s3:GetBucketLocation
+ pricing:GetProducts
+ pricing:GetAttributeValues
+ pricing:DescribeServices
+ pricing:ListPriceLists

**Important**  
 To restrict access to only the resources needed by this automation, attach the following policy to the IAM role that trusts the SSM Service. Replace the Partition, Region and Account with the appropriate values for the partition, region and account number where the run book is executed.

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "elasticmapreduce:DescribeCluster",
                "glue:GetDatabase",
                "athena:GetQueryExecution",
                "athena:StartQueryExecution",
                "athena:GetPreparedStatement",
                "athena:CreatePreparedStatement",
                "s3:ListBucket",
                "s3:GetBucketVersioning",
                "s3:ListBucketVersions",
                "s3:GetBucketPublicAccessBlock",
                "s3:GetBucketPolicyStatus",
                "s3:GetObject",
                "s3:GetBucketLocation",
                "pricing:GetProducts",
                "pricing:GetAttributeValues",
                "pricing:DescribeServices",
                "pricing:ListPriceLists"
            ],
            "Resource": "*"
        },
        {
            "Sid": "RestrictPutObjects",
            "Effect": "Allow",
            "Action": [
                "s3:PutObject"
            ],
            "Resource": [
                "arn:aws:s3:::*/*/results/*",
                "arn:aws:s3:::*/*/saw_diagnose_emr_known_issues/*"
            ]
        },
        {
            "Sid": "RestrictDeleteAccess",
            "Effect": "Allow",
            "Action": [
                "s3:DeleteObject",
                "s3:DeleteObjectVersion"
            ],
            "Resource": [
                "arn:aws:s3:::*/*/saw_diagnose_emr_known_issues/*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "glue:GetDatabase",
                "glue:CreateDatabase",
                "glue:DeleteDatabase"
            ],
            "Resource": [
                "arn:aws:glue:us-east-1:111122223333:database/saw_diagnose_emr_database_*",
                "arn:aws:glue:us-east-1:111122223333:table/saw_diagnose_emr_database_*/*",
                "arn:aws:glue:us-east-1:111122223333:userDefinedFunction/saw_diagnose_emr_database_*/*",
                "arn:aws:glue:us-east-1:111122223333:catalog"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "glue:CreateTable",
                "glue:GetTable",
                "glue:DeleteTable"
            ],
            "Resource": [
                "arn:aws:glue:us-east-1:111122223333:table/saw_diagnose_emr_database_*/saw_diagnose_emr_known_issues",
                "arn:aws:glue:us-east-1:111122223333:table/saw_diagnose_emr_database_*/saw_diagnose_emr_logs_table",
                "arn:aws:glue:us-east-1:111122223333:table/saw_diagnose_emr_database_*/j_*",
                "arn:aws:glue:us-east-1:111122223333:database/saw_diagnose_emr_database_*",
                "arn:aws:glue:us-east-1:111122223333:catalog"
            ]
        }
    ]
}
```

------

 **Instructions** 

Follow these steps to configure the automation:

1. Navigate [AWSSupport-DiagnoseEMRLogsWithAthena](https://console.aws.amazon.com/systems-manager/documents/AWSSupport-DiagnoseEMRLogsWithAthena/description) in the AWS Systems Manager under Documents.

1. Select Execute automation.

1. For the input parameters enter the following:
   + **AutomationAssumeRole (Optional):**

     The Amazon Resource Name (ARN) of the AWS Identity and Access Management (IAM) role that allows Systems Manager Automation to perform the actions on your behalf. If no role is specified, Systems Manager Automation uses the permissions of the user that starts this runbook.
   + **ClusterID (Required):**

     The Amazon EMR cluster ID.
   + **S3LogLocation (Optional):**

     The Amazon S3 Amazon EMR log location. Input the Path-style URL Amazon S3 location, for example: `s3://amzn-s3-demo-bucket/myfolder/j-1K48XXXXXXHCB/`. Provide this parameter if the Amazon EMR cluster has been terminated for more than `30` days.
   + **S3BucketName (Required):**

      The Amazon S3 bucket name to upload a list of known issues, and the output of Amazon Athena queries. The bucket should have [Block Public Access Enabled](https://docs.aws.amazon.com/AmazonS3/latest/userguide/access-control-block-public-access.html) and be in the same AWS region and account as the Amazon EMR cluster.
   + **Approvers (Required):**

     The list of AWS authenticated principals who are able to either approve or reject the action. You can specify principals by using any of the following formats: user name, user ARN, IAM role ARN, or IAM assume role ARN. The maximum number of approvers is 10.
   + **FetchNodeLogsOnly (Optional):**

      If set to `true`, the automation diagnoses the Amazon EMR application containers logs. The default value is `false`.
   + **FetchContainersLogsOnly (Optional):**

      If set to `true`, the automation diagnoses the Amazon EMR containers logs. The default value is `false`.
   + **EndSearchDate (Optional):**

      The end date for log searches. If provided, the automation will exclusively search for logs generated up to the specified date in the format YYYY-MM-DD (for example: `2024-12-30`).
   + **DaysToCheck (Optional):**

      When `EndSearchDate` is provided, this parameter is required to determine the number of days to retrospectively search for logs from the specified `EndSearchDate`. The maximum value is `30` days. The default value is `1`.
   + **SearchKeywords (Optional):**

      The list of keywords to search in the logs, separated by commas. The keywords cannot contain single or double quotes.  
![\[Input parameters form for AWS Systems Manager Automation with various fields and options.\]](http://docs.aws.amazon.com/systems-manager-automation-runbooks/latest/userguide/images/awssupport-diagnose-emr-logs-with-athena_input_parameters.png)

1. Select **Execute.**

1. The automation initiates.

1. The document performs the following steps:
   + **getLogLocation:**

     Retrieves the Amazon S3 log location by querying the specified Amazon EMR Cluster ID. If the automation is unable to query the log location from the Amazon EMR cluster ID, the runbook uses the `S3LogLocation` input parameter.
   + **branchOnValidLog:**

     Verifies the Amazon EMR logs location. If the location is valid, proceed to estimate the Amazon Athena potential costs when executing queries on the Amazon EMR logs.
   + **estimateAthenaCosts:**

     Determines the size of Amazon EMR logs and provides a cost estimate for executing Athena scans on the log dataset. For non-commercial regions (non-AWS partitions), this step just provides the log size without estimating costs. Costs can be calculated using the Athena pricing documentation in the specified region.
   + **approveAutomation:**

     Waits for the designated IAM principals approval to proceed with the next steps of the automation. The approve notification contains the estimated cost of Amazon Athena scan on the Amazon EMR logs, and details about the resources being provisioned by the automation.
   + **uploadKnownIssuesExecuteAthenaQueries:**

     Uploads the predefined known issues to the Amazon S3 bucket specified in the `S3BucketName` parameter. Creates AWS Glue database and tables. Executes Amazon Athena queries in the AWS Glue database based on the input parameters.
   + **getQueryExecutionStatus:**

     Waits until the Amazon Athena query execution is in `SUCCEEDED` state. The Amazon Athena DML query searches for errors and exceptions in Amazon EMR cluster logs.
   + **analyzeAthenaResults:**

     Analyzes the Amazon Athena results to provide findings, recommendations, and Knowledge Center (KC) articles sourced from a predefined set of mappings.
   + **getAnalyzeResultsQuery1ExecutionStatus:**

     Waits until the query execution is in `SUCCEEDED` state. The Amazon Athena DML query analyzes the results from the previous DML query. This analysis query will return matched exceptions with resolutions and KC articles
   + **getAnalyzeResultsQuery2ExecutionStatus:**

     Waits until the query execution is in `SUCCEEDED` state. The Amazon Athena DML query analyzes the results from the previous DML query. This analysis query will return a list of exceptions/errors detected in each Amazon S3 log path.
   + **printAthenaQueriesMessage:**

     Prints links for the Amazon Athena DML queries results.
   + **cleanupResources:**

     Clean-ups resources by deleting the created AWS Glue database and delete known issues files that were created in the Amazon EMR logs bucket.

1. After completed, review the Outputs section for the detailed results of the execution:

   **Output provides three links for Athena query results:**
   + List of all errors and frequently occurred exceptions found in the Amazon EMR cluster logs, along with the corresponding log locations (Amazon S3 prefix).
   + Summary of unique known exceptions matched in the Amazon EMR logs, along with recommended resolutions and KC articles to help in troubleshooting.
   + Details on where specific errors and exceptions appear in the Amazon S3 log paths, to support further diagnosis.  
![\[Output section showing query links for exception summaries and analysis in AWS logs.\]](http://docs.aws.amazon.com/systems-manager-automation-runbooks/latest/userguide/images/awssupport-diagnose-emr-logs-with-athena_outputs.png)

 **References** 

Systems Manager Automation
+ [Run this Automation (console)](https://console.aws.amazon.com/systems-manager/documents/AWSSupport-DiagnoseEMRLogsWithAthena/description)
+ [Run an automation](https://docs.aws.amazon.com//systems-manager/latest/userguide/automation-working-executing.html)
+ [Setting up an Automation](https://docs.aws.amazon.com//systems-manager/latest/userguide/automation-setup.html)
+ [Support Automation Workflows landing page](https://aws.amazon.com/premiumsupport/technology/saw/)

AWS service documentation
+ Refer to[Troubleshooting Amazon EMR Clusters](https://docs.aws.amazon.com//emr/latest/ManagementGuide/emr-troubleshoot.html) for more information