

# Troubleshoot a slow Amazon EMR cluster
<a name="emr-troubleshoot-slow"></a>

 This section walks you through the process of troubleshooting a cluster that is still running, but is taking a long time to return results. For more information about what to do if the cluster has terminated with an error code, see [Troubleshoot an Amazon EMR cluster that has failed with an error code](emr-troubleshoot-failed.md) 

 Amazon EMR enables you to specify the number and kind of instances in the cluster. These specifications are the primary means of affecting the speed with which your data processing completes. One thing you might consider is re-running the cluster, this time specifying EC2 instances with greater resources, or specifying a larger number of instances in the cluster. For more information, see [Configure Amazon EMR cluster hardware and networking](emr-plan-instances.md). 

 The following topics walk you through the process of identifying alternative causes of a slow cluster. 

**Topics**
+ [

# Step 1: Gather data about the issue with the Amazon EMR cluster
](emr-troubleshoot-slow-1.md)
+ [

# Step 2: Check the EMR cluster environment
](emr-troubleshoot-slow-2.md)
+ [

# Step 3: Examine the log files for the Amazon EMR cluster
](emr-troubleshoot-slow-3.md)
+ [

# Step 4: Check Amazon EMR cluster and instance health
](emr-troubleshoot-slow-4.md)
+ [

# Step 5: Check for suspended groups
](emr-troubleshoot-slow-5.md)
+ [

# Step 6: Review configuration settings for the Amazon EMR cluster
](emr-troubleshoot-slow-6.md)
+ [

# Step 7: Examine input data for the Amazon EMR cluster
](emr-troubleshoot-slow-7.md)

# Step 1: Gather data about the issue with the Amazon EMR cluster
<a name="emr-troubleshoot-slow-1"></a>

 The first step in troubleshooting a cluster is to gather information about what went wrong and the current status and configuration of the cluster. This information will be used in the following steps to confirm or rule out possible causes of the issue. 

## Define the problem
<a name="emr-troubleshoot-slow-1-problem"></a>

 A clear definition of the problem is the first place to begin. Some questions to ask yourself: 
+  What did I expect to happen? What happened instead? 
+  When did this problem first occur? How often has it happened since? 
+  Has anything changed in how I configure or run my cluster? 

## Cluster details
<a name="emr-troubleshoot-slow-1-cluster"></a>

 The following cluster details are useful in helping track down issues. For more information on how to gather this information, see [View Amazon EMR cluster status and details](emr-manage-view-clusters.md). 
+  Identifier of the cluster. (Also called a job flow identifier.) 
+  AWS Region and Availability Zone the cluster was launched into. 
+  State of the cluster, including details of the last state change. 
+  Type and number of EC2 instances specified for the master, core, and task nodes. 

# Step 2: Check the EMR cluster environment
<a name="emr-troubleshoot-slow-2"></a>

Check your environment to see if there are service outages or you have exceeded an AWS service limit.

**Topics**
+ [

## Check for service outages
](#emr-troubleshoot-slow-2-outages)
+ [

## Check usage limits
](#emr-troubleshoot-slow-2-limits)
+ [

## Check the Amazon VPC subnet configuration
](#emr-troubleshoot-slow-2-vpc)
+ [

## Restart the cluster
](#emr-troubleshoot-slow-2-restart)

## Check for service outages
<a name="emr-troubleshoot-slow-2-outages"></a>

 Amazon EMR uses several Amazon Web Services internally. It runs virtual servers on Amazon EC2, stores data and scripts on Amazon S3, and reports metrics to CloudWatch. Events that disrupt these services are rare — but when they occur — can cause issues in Amazon EMR. 

 Before you go further, check the [Service Health Dashboard](http://status.aws.amazon.com). Check the Region where you launched your cluster to see whether there are disruption events in any of these services. 

## Check usage limits
<a name="emr-troubleshoot-slow-2-limits"></a>

 If you are launching a large cluster, have launched many clusters simultaneously, or you are a user sharing an AWS account with other users, the cluster may have failed because you exceeded an AWS service limit. 

 Amazon EC2 limits the number of virtual server instances running on a single AWS Region to 20 on-demand or reserved instances. If you launch a cluster with more than 20 nodes, or launch a cluster that causes the total number of EC2 instances active on your AWS account to exceed 20, the cluster will not be able to launch all of the EC2 instances it requires and may fail. When this happens, Amazon EMR returns an `EC2 QUOTA EXCEEDED` error. You can request that AWS increase the number of EC2 instances that you can run on your account by submitting a [Request to Increase Amazon EC2 Instance Limit](http://aws.amazon.com/contact-us/ec2-request/) application. 

 Another thing that may cause you to exceed your usage limits is the delay between when a cluster is terminated and when it releases all of its resources. Depending on its configuration, it may take up to 5-20 minutes for a cluster to fully terminate and release allocated resources. If you are getting an `EC2 QUOTA EXCEEDED` error when you attempt to launch a cluster, it may be because resources from a recently terminated cluster may not yet have been released. In this case, you can either [request that your Amazon EC2 quota be increased](https://aws.amazon.com/contact-us/ec2-request/), or you can wait twenty minutes and re-launch the cluster. 

 Amazon S3 limits the number of buckets created on an account to 100. If your cluster creates a new bucket that exceeds this limit, the bucket creation will fail and may cause the cluster to fail. 

## Check the Amazon VPC subnet configuration
<a name="emr-troubleshoot-slow-2-vpc"></a>

If your cluster was launched in a Amazon VPC subnet, the subnet needs to be configured as described in [Configure networking in a VPC for Amazon EMR](emr-plan-vpc-subnet.md). In addition, check that the subnet you launch the cluster into has enough free elastic IP addresses to assign one to each node in the cluster.

## Restart the cluster
<a name="emr-troubleshoot-slow-2-restart"></a>

 The slow down in processing may be caused by a transient condition. Consider terminating and restarting the cluster to see if performance improves. 

# Step 3: Examine the log files for the Amazon EMR cluster
<a name="emr-troubleshoot-slow-3"></a>

 The next step is to examine the log files in order to locate an error code or other indication of the issue that your cluster experienced. For information on the log files available, where to find them, and how to view them, see [View Amazon EMR log files](emr-manage-view-web-log-files.md). 

 It may take some investigative work to determine what happened. Hadoop runs the work of the jobs in task attempts on various nodes in the cluster. Amazon EMR can initiate speculative task attempts, terminating the other task attempts that do not complete first. This generates significant activity that is logged to the controller, stderr and syslog log files as it happens. In addition, multiple tasks attempts are running simultaneously, but a log file can only display results linearly. 

 Start by checking the bootstrap action logs for errors or unexpected configuration changes during the launch of the cluster. From there, look in the step logs to identify Hadoop jobs launched as part of a step with errors. Examine the Hadoop job logs to identify the failed task attempts. The task attempt log will contain details about what caused a task attempt to fail. 

The following sections describe how to use the various log files to identify error in your cluster.

## Check the bootstrap action logs
<a name="emr-troubleshoot-slow-3-bootstrap-logs"></a>

 Bootstrap actions run scripts on the cluster as it is launched. They are commonly used to install additional software on the cluster or to alter configuration settings from the default values. Checking these logs may provide insight into errors that occurred during set up of the cluster as well as configuration settings changes that could affect performance. 

## Check the step logs
<a name="emr-troubleshoot-slow-3-step-logs"></a>

 There are four types of step logs. 
+ **controller—**Contains files generated by Amazon EMR (Amazon EMR) that arise from errors encountered while trying to run your step. If your step fails while loading, you can find the stack trace in this log. Errors loading or accessing your application are often described here, as are missing mapper file errors. 
+  **stderr—**Contains error messages that occurred while processing the step. Application loading errors are often described here. This log sometimes contains a stack trace. 
+ **stdout—**Contains status generated by your mapper and reducer executables. Application loading errors are often described here. This log sometimes contains application error messages.
+ **syslog—**Contains logs from non-Amazon software, such as Apache and Hadoop. Streaming errors are often described here.

 Check stderr for obvious errors. If stderr displays a short list of errors, the step came to a quick stop with an error thrown. This is most often caused by an error in the mapper and reducer applications being run in the cluster. 

 Examine the last lines of controller and syslog for notices of errors or failures. Follow any notices about failed tasks, particularly if it says "Job Failed". 

## Check the task attempt logs
<a name="emr-troubleshoot-slow-3-task-logs"></a>

 If the previous analysis of the step logs turned up one or more failed tasks, investigate the logs of the corresponding task attempts for more detailed error information. 

## Check the Hadoop daemon logs
<a name="emr-troubleshoot-slow-3-hadoop-logs"></a>

 In rare cases, Hadoop itself might fail. To see if that is the case, you must look at the Hadoop logs. They are located at `/var/log/hadoop/` on each node. 

 You can use the JobTracker logs to map a failed task attempt to the node it was run on. Once you know the node associated with the task attempt, you can check the health of the EC2 instance hosting that node to see if there were any issues such as running out of CPU or memory. 

# Step 4: Check Amazon EMR cluster and instance health
<a name="emr-troubleshoot-slow-4"></a>

 An Amazon EMR cluster is made up of nodes running on Amazon EC2 instances. If those instances become resource-bound (such as running out of CPU or memory), experience network connectivity issues, or are terminated, the speed of cluster processing suffers. 

 There are up to three types of nodes in a cluster: 
+  **master node** — manages the cluster. If it experiences a performance issue, the entire cluster is affected. 
+  **core nodes** — process map-reduce tasks and maintain the Hadoop Distributed Filesystem (HDFS). If one of these nodes experiences a performance issue, it can slow down HDFS operations as well as map-reduce processing. You can add additional core nodes to a cluster to improve performance, but cannot remove core nodes. For more information, see [Manually resize a running Amazon EMR cluster](emr-manage-resize.md). 
+  **task nodes** — process map-reduce tasks. These are purely computational resources and do not store data. You can add task nodes to a cluster to speed up performance, or remove task nodes that are not needed. For more information, see [Manually resize a running Amazon EMR cluster](emr-manage-resize.md). 

 When you look at the health of a cluster, you should look at both the performance of the cluster overall, as well as the performance of individual instances. There are several tools you can use: 

## Check cluster health with CloudWatch
<a name="emr-troubleshoot-slow-4-cw"></a>

 Every Amazon EMR cluster reports metrics to CloudWatch. These metrics provide summary performance information about the cluster, such as the total load, HDFS utilization, running tasks, remaining tasks, corrupt blocks, and more. Looking at the CloudWatch metrics gives you the big picture about what is going on with your cluster and can provide insight into what is causing the slow down in processing. In addition to using CloudWatch to analyze an existing performance issue, you can set alarms that cause CloudWatch to alert if a future performance issue occurs. For more information, see [Monitoring Amazon EMR metrics with CloudWatch](UsingEMR_ViewingMetrics.md). 

## Check job status and HDFS health
<a name="emr-troubleshoot-slow-4-web-ui"></a>

Use the **Application user interfaces** tab on the cluster details page to view YARN application details. For certain applications, you can drill into further detail and access logs directly. This is particularly useful for Spark applications. For more information, see [View Amazon EMR application history](emr-cluster-application-history.md).

Hadoop provides a series of web interfaces you can use to view information. For more information about how to access these web interfaces, see [View web interfaces hosted on Amazon EMR clusters](emr-web-interfaces.md). 
+  JobTracker — provides information about the progress of job being processed by the cluster. You can use this interface to identify when a job has become stuck. 
+  HDFS NameNode — provides information about the percentage of HDFS utilization and available space on each node. You can use this interface to identify when HDFS is becoming resource bound and requires additional capacity. 
+  TaskTracker — provides information about the tasks of the job being processed by the cluster. You can use this interface to identify when a task has become stuck. 

## Check instance health with Amazon EC2
<a name="emr-troubleshoot-slow-4-ec2"></a>

 Another way to look for information about the status of the instances in your cluster is to use the Amazon EC2 console. Because each node in the cluster runs on an EC2 instance, you can use tools provided by Amazon EC2 to check their status. For more information, see [View cluster instances in Amazon EC2](UsingEMR_Tagging.md). 

# Step 5: Check for suspended groups
<a name="emr-troubleshoot-slow-5"></a>

 An instance group becomes suspended when it encounters too many errors while trying to launch nodes. For example, if new nodes repeatedly fail while performing bootstrap actions, the instance group will — after some time — go into the `SUSPENDED` state rather than continuously attempt to provision new nodes. 

 A node could fail to come up if: 
+ Hadoop or the cluster is somehow broken and does not accept a new node into the cluster
+ A bootstrap action fails on the new node
+ The node is not functioning correctly and fails to check in with Hadoop

If an instance group is in the `SUSPENDED` state, and the cluster is in a `WAITING` state, you can add a cluster step to reset the desired number of core and task nodes. Adding the step resumes processing of the cluster and put the instance group back into a `RUNNING` state. 

For more information about how to reset a cluster in a suspended state, see [Suspended state](emr-manage-resize.md#emr-manage-resizeSuspended). 

# Step 6: Review configuration settings for the Amazon EMR cluster
<a name="emr-troubleshoot-slow-6"></a>

 Configuration settings specify details about how a cluster runs, such as how many times to retry a task and how much memory is available for sorting. When you launch a cluster using Amazon EMR, there are Amazon EMR-specific settings in addition to the standard Hadoop configuration settings. The configuration settings are stored on the master node of the cluster. You can check the configuration settings to ensure that your cluster has the resources it requires to run efficiently. 

 Amazon EMR defines default Hadoop configuration settings that it uses to launch a cluster. The values are based on the AMI and the instance type you specify for the cluster. You can modify the configuration settings from the default values using a bootstrap action or by specifying new values in job execution parameters. For more information, see [Create bootstrap actions to install additional software with an Amazon EMR cluster](emr-plan-bootstrap.md). To determine whether a bootstrap action changed the configuration settings, check the bootstrap action logs. 

 Amazon EMR logs the Hadoop settings used to execute each job. The log data is stored in a file named `job_job-id_conf.xml` under the `/mnt/var/log/hadoop/history/` directory of the master node, where *job-id* is replaced by the identifier of the job. If you've enabled log archiving, this data is copied to Amazon S3 in the `logs/date/jobflow-id/jobs` folder, where *date* is the date the job ran, and *jobflow-id* is the identifier of the cluster. 

 The following Hadoop job configuration settings are especially useful for investigating performance issues. For more information about the Hadoop configuration settings and how they affect the behavior of Hadoop, go to [http://hadoop.apache.org/docs/](http://hadoop.apache.org/docs/). 

**Warning**  
Setting `dfs.replication` to 1 on clusters with fewer than four nodes can lead to HDFS data loss if a single node goes down. We recommend you use a cluster with at least four core nodes for production workloads.
Amazon EMR will not allow clusters to scale core nodes below `dfs.replication`. For example, if `dfs.replication = 2`, the minimum number of core nodes is 2.
When you use Managed Scaling, Auto-scaling, or choose to manually resize your cluster, we recommend that you to set `dfs.replication` to 2 or higher.


| Configuration setting | Description | 
| --- | --- | 
| dfs.replication | The number of HDFS nodes to which a single block (like the hard drive block) is copied to in order to produce a RAID-like environment. Determines the number of HDFS nodes which contain a copy of the block.  | 
| io.sort.mb | Total memory available for sorting. This value should be 10x io.sort.factor. This setting can also be used to calculate total memory used by task node by figuring io.sort.mb multiplied by mapred.tasktracker.ap.tasks.maximum. | 
| io.sort.spill.percent | Used during sort, at which point the disk will start to be used because the allotted sort memory is getting full. | 
| mapred.child.java.opts | Deprecated. Use mapred.map.child.java.opts and mapred.reduce.child.java.opts instead. The Java options TaskTracker uses when launching a JVM for a task to execute within. A common parameter is "-Xmx" for setting max memory size. | 
| mapred.map.child.java.opts | The Java options TaskTracker uses when launching a JVM for a map task to execute within. A common parameter is "-Xmx" for setting max memory heap size. | 
| mapred.map.tasks.speculative.execution | Determines whether map task attempts of the same task may be launched in parallel. | 
| mapred.reduce.tasks.speculative.execution | Determines whether reduce task attempts of the same task may be launched in parallel. | 
| mapred.map.max.attempts | The maximum number of times a map task can be attempted. If all fail, then the map task is marked as failed. | 
| mapred.reduce.child.java.opts | The Java options TaskTracker uses when launching a JVM for a reduce task to execute within. A common parameter is "-Xmx" for setting max memory heap size. | 
| mapred.reduce.max.attempts | The maximum number of times a reduce task can be attempted. If all fail, then the map task is marked as failed. | 
| mapred.reduce.slowstart.completed.maps | The amount of maps tasks that should complete before reduce tasks are attempted. Not waiting long enough may cause "Too many fetch-failure" errors in attempts. | 
| mapred.reuse.jvm.num.tasks | A task runs within a single JVM. Specifies how many tasks may reuse the same JVM. | 
| mapred.tasktracker.map.tasks.maximum | The max amount of tasks that can execute in parallel per task node during mapping. | 
| mapred.tasktracker.reduce.tasks.maximum | The max amount of tasks that can execute in parallel per task node during reducing. | 

 If your cluster tasks are memory-intensive, you can enhance performance by using fewer tasks per core node and reducing your job tracker heap size. 

# Step 7: Examine input data for the Amazon EMR cluster
<a name="emr-troubleshoot-slow-7"></a>

 Look at your input data. Is it distributed evenly among your key values? If your data is heavily skewed towards one or few key values, the processing load may be mapped to a small number of nodes, while other nodes idle. This imbalanced distribution of work can result in slower processing times. 

 An example of an imbalanced data set would be running a cluster to alphabetize words, but having a data set that contained only words beginning with the letter "a". When the work was mapped out, the node processing values beginning with "a" would be overwhelmed, while nodes processing words beginning with other letters would go idle. 