# AWS Glue Spark and PySpark jobs
<a name="spark_and_pyspark"></a>

AWS Glue support Spark and PySpark jobs. A Spark job is run in an Apache Spark environment managed by AWS Glue. It processes data in batches. A streaming ETL job is similar to a Spark job, except that it performs ETL on data streams. It uses the Apache Spark Structured Streaming framework. Some Spark job features are not available to streaming ETL jobs.

The following sections provide information on AWS Glue Spark and PySpark jobs.

**Topics**
+ [Configuring job properties for Spark jobs in AWS Glue](add-job.md)
+ [Editing Spark scripts in the AWS Glue console](edit-script-spark.md)
+ [Jobs (legacy)](console-edit-script.md)
+ [Tracking processed data using job bookmarks](monitor-continuations.md)
+ [Storing Spark shuffle data](monitor-spark-shuffle-manager.md)
+ [Monitoring AWS Glue Spark jobs](monitor-spark.md)
+ [Generative AI troubleshooting for Apache Spark in AWS Glue](troubleshoot-spark.md)
+ [Using materialized views with AWS Glue](materialized-views.md)

# Configuring job properties for Spark jobs in AWS Glue
<a name="add-job"></a>

When you define your job on the AWS Glue console, you provide values for properties to control the AWS Glue runtime environment. 

## Defining job properties for Spark jobs
<a name="create-job"></a>

The following list describes the properties of a Spark job. For the properties of a Python shell job, see [Defining job properties for Python shell jobs](add-job-python.md#create-job-python-properties). For properties of a streaming ETL job, see [Defining job properties for a streaming ETL job](add-job-streaming.md#create-job-streaming-properties).

The properties are listed in the order in which they appear on the **Add job** wizard on AWS Glue console.

**Name**  
Provide a UTF-8 string with a maximum length of 255 characters. 

**Description**  
Provide an optional description of up to 2048 characters. 

**IAM Role**  
Specify the IAM role that is used for authorization to resources used to run the job and access data stores. For more information about permissions for running jobs in AWS Glue, see [Identity and access management for AWS Glue](security-iam.md).

**Type**  
The type of ETL job. This is set automatically based on the type of data sources you select.  
+ **Spark** runs an Apache Spark ETL script with the job command `glueetl`.
+ **Spark Streaming** runs a Apache Spark streaming ETL script with the job command `gluestreaming`. For more information, see [Streaming ETL jobs in AWS Glue](add-job-streaming.md).
+ **Python shell** run a Python script with the job command `pythonshell`. For more information, see [Configuring job properties for Python shell jobs in AWS Glue](add-job-python.md).

**AWS Glue version**  
AWS Glue version determines the versions of Apache Spark and Python that are available to the job, as specified in the following table.      
<a name="table-glue-versions"></a>[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/add-job.html)
Jobs that are created without specifying a AWS Glue version default to AWS Glue 5.1.

**Language**  
The code in the ETL script defines your job's logic. The script can be coded in Python or Scala. You can choose whether the script that the job runs is generated by AWS Glue or provided by you. You provide the script name and location in Amazon Simple Storage Service (Amazon S3). Confirm that there isn't a file with the same name as the script directory in the path. To learn more about writing scripts, see [AWS Glue programming guide](edit-script.md).

**Worker type**  
The following worker types are available:  
The resources available on AWS Glue workers are measured in DPUs. A DPU is a relative measure of processing power that consists of 4 vCPUs of compute capacity and 16 GB of memory.  
+ **G.025X** – When you choose this type, you also provide a value for **Number of workers**. Each worker maps to 0.25 DPU (2 vCPUs, 4 GB of memory) with 84GB disk (approximately 34GB free). We recommend this worker type for low volume streaming jobs. This worker type is only available for AWS Glue version 3.0 or later streaming jobs.
+ **G.1X** – When you choose this type, you also provide a value for **Number of workers**. Each worker maps to 1 DPU (4 vCPUs, 16 GB of memory) with 94GB disk (approximately 44GB free). We recommend this worker type for workloads such as data transforms, joins, and queries, to offers a scalable and cost effective way to run most jobs.
+ **G.2X** – When you choose this type, you also provide a value for **Number of workers**. Each worker maps to 2 DPU (8 vCPUs, 32 GB of memory) with 138GB disk (approximately 78GB free). We recommend this worker type for workloads such as data transforms, joins, and queries, to offers a scalable and cost effective way to run most jobs.
+ **G.4X** – When you choose this type, you also provide a value for **Number of workers**. Each worker maps to 4 DPU (16 vCPUs, 64 GB of memory) with 256GB disk (approximately 230GB free). We recommend this worker type for jobs whose workloads contain your most demanding transforms, aggregations, joins, and queries. 
+ **G.8X** – When you choose this type, you also provide a value for **Number of workers**. Each worker maps to 8 DPU (32 vCPUs, 128 GB of memory) with 512GB disk (approximately 485GB free). We recommend this worker type for jobs whose workloads contain your most demanding transforms, aggregations, joins, and queries.
+ **G.12X** – When you choose this type, you also provide a value for **Number of workers**. Each worker maps to 12 DPU (48 vCPUs, 192 GB of memory) with 768GB disk (approximately 741GB free). We recommend this worker type for jobs with very large and resource-intensive workloads that require significant compute capacity. 
+ **G.16X** – When you choose this type, you also provide a value for **Number of workers**. Each worker maps to 16 DPU (64 vCPUs, 256 GB of memory) with 1024GB disk (approximately 996GB free). We recommend this worker type for jobs with the largest and most resource-intensive workloads that require maximum compute capacity. 
+ **R.1X** – When you choose this type, you also provide a value for **Number of workers**. Each worker maps to 1 DPU with memory-optimized configuration. We recommend this worker type for memory-intensive workloads that frequently encounter out-of-memory errors or require high memory-to-CPU ratios. 
+ **R.2X** – When you choose this type, you also provide a value for **Number of workers**. Each worker maps to 2 DPU with memory-optimized configuration. We recommend this worker type for memory-intensive workloads that frequently encounter out-of-memory errors or require high memory-to-CPU ratios. 
+ **R.4X** – When you choose this type, you also provide a value for **Number of workers**. Each worker maps to 4 DPU with memory-optimized configuration. We recommend this worker type for large memory-intensive workloads that frequently encounter out-of-memory errors or require high memory-to-CPU ratios. 
+ **R.8X** – When you choose this type, you also provide a value for **Number of workers**. Each worker maps to 8 DPU with memory-optimized configuration. We recommend this worker type for very large memory-intensive workloads that frequently encounter out-of-memory errors or require high memory-to-CPU ratios. 
**Worker Type Specifications**  
The following table provides detailed specifications for all available G worker types:    
**G Worker Type Specifications**    
<a name="table-worker-specifications"></a>[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/add-job.html)
**Important:** G.12X and G.16X worker types, as well as all R worker types (R.1X through R.8X), have higher startup latency.  
You are charged an hourly rate based on the number of DPUs used to run your ETL jobs. For more information, see the [AWS Glue pricing page](https://aws.amazon.com/glue/pricing/).  
For AWS Glue version 1.0 or earlier jobs, when you configure a job using the console and specify a **Worker type** of **Standard**, the **Maximum capacity** is set and the **Number of workers** becomes the value of **Maximum capacity** - 1. If you use the AWS Command Line Interface (AWS CLI) or AWS SDK, you can specify the **Max capacity** parameter, or you can specify both **Worker type** and the **Number of workers**.  
For AWS Glue version 2.0 or later jobs, you cannot specify a **Maximum capacity**. Instead, you should specify a **Worker type** and the **Number of workers**.  
**G.4X** and **G.8X** worker types are available only for AWS Glue version 3.0 or later Spark ETL jobs in the following AWS Regions: US East (Ohio), US East (N. Virginia), US West (N. California), US West (Oregon), Asia Pacific (Mumbai), Asia Pacific (Seoul), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Tokyo), Canada (Central), Europe (Frankfurt), Europe (Ireland), Europe (London), Europe (Spain), Europe (Stockholm), and South America (São Paulo).  
**G.12X**, **G.16X**, and **R.1X** through **R.8X** worker types are available only for AWS Glue version 4.0 or later Spark ETL jobs in the following AWS Regions: US East (N. Virginia), US West (Oregon), US East (Ohio), Europe (Ireland), and Europe (Frankfurt). Additional regions will be supported in future releases.

**Requested number of workers**  
For most worker types, you must specify the number of workers that are allocated when the job runs. 

**Job bookmark**  
Specify how AWS Glue processes state information when the job runs. You can have it remember previously processed data, update state information, or ignore state information. For more information, see [Tracking processed data using job bookmarks](monitor-continuations.md).

**Job run queuing**  
Specifies whether job runs are queued to run later when they cannot run immediately due to service quotas.  
When checked, job run queuing is enabled for the job runs. If not populated, the job runs will not be considered for queueing.  
If this setting does not match the value set in the job run, then the value from the job run field will be used.

**Flex execution**  
When you configure a job using AWS Studio or the API you may specify a standard or flexible job execution class. Your jobs may have varying degrees of priority and time sensitivity. The standard execution-class is ideal for time-sensitive workloads that require fast job startup and dedicated resources.  
The flexible execution class is appropriate for non-urgent jobs such as pre-production jobs, testing, and one-time data loads. Flexible job runs are supported for jobs using AWS Glue version 3.0 or later and `G.1X` or `G.2X` worker types. The new worker types (`G.12X`, `G.16X`, and `R.1X` through `R.8X`) do not support flexible execution.  

[![AWS Videos](http://img.youtube.com/vi/https://www.youtube.com/embed/FnHCoTuDLXU/0.jpg)](http://www.youtube.com/watch?v=https://www.youtube.com/embed/FnHCoTuDLXU)

Flex job runs are billed based on the number of workers running at any point in time. Number of workers may be added or removed for a running flexible job run. Instead of billing as a simple calculation of `Max Capacity` \$1 `Execution Time`, each worker will contribute for the time it ran during the job run. The bill is the sum of (`Number of DPUs per worker` \$1 `time each worker ran`).  
For more information, see the help panel in AWS Studio, or [Jobs](aws-glue-api-jobs-job.md) and [Job runs](aws-glue-api-jobs-runs.md).

**Number of retries**  
Specify the number of times, from 0 to 10, that AWS Glue should automatically restart the job if it fails. Jobs that reach the timeout limit are not restarted.

**Job timeout**  
Sets the maximum execution time in minutes. The maximum is 7 days or 10,080 minutes. Otherwise, the jobs will throw an exception.  
When the value is left blank, the timeout is defaulted to 2880 minutes.  
Any existing AWS Glue jobs that had a timeout value greater than 7 days will be defaulted to 7 days. For instance if you specified a timeout of 20 days for a batch job, it will be stopped on the 7th day.  
**Best practices for job timeouts**  
Jobs are billed based on execution time. To avoid unexpected charges, configure timeout values appropriate for the expected execution time of your job. 

**Advanced Properties**    
**Script filename**  
A unique script name for your job. Cannot be named **Untitled job**.  
**Script path**  
The Amazon S3 location of the script. The path must be in the form `s3://bucket/prefix/path/`. It must end with a slash (`/`) and not include any files.  
**Job metrics**  
Turn on or turn off the creation of Amazon CloudWatch metrics when this job runs. To see profiling data, you must enable this option. For more information about how to turn on and visualize metrics, see [Job monitoring and debugging](monitor-profile-glue-job-cloudwatch-metrics.md).   
**Job observability metrics**  
Turn on the creation of additional observability CloudWatch metrics when this job runs. For more information, see [Monitoring with AWS Glue Observability metrics](monitor-observability.md).  
**Continuous logging**  
Turn on continuous logging to Amazon CloudWatch. If this option is not enabled, logs are available only after the job completes. For more information, see [Logging for AWS Glue jobs](monitor-continuous-logging.md).  
**Spark UI**  
Turn on the use of Spark UI for monitoring this job. For more information, see [Enabling the Apache Spark web UI for AWS Glue jobs](monitor-spark-ui-jobs.md).   
**Spark UI logs path**  
The path to write logs when Spark UI is enabled.  
**Spark UI logging and monitoring configuration**  
Choose one of the following options:  
+ *Standard*: write logs using the AWS Glue job run ID as the filename. Turn on Spark UI monitoring in the AWS Glue console.
+ *Legacy*: write logs using 'spark-application-\$1timestamp\$1' as the filename. Do not turn on Spark UI monitoring.
+ *Standard and legacy*: write logs to both the standard and legacy locations. Turn on Spark UI monitoring in the AWS Glue console.  
**Maximum concurrency**  
Sets the maximum number of concurrent runs that are allowed for this job. The default is 1. An error is returned when this threshold is reached. The maximum value you can specify is controlled by a service limit. For example, if a previous run of a job is still running when a new instance is started, you might want to return an error to prevent two instances of the same job from running concurrently.   
**Temporary path**  
Provide the location of a working directory in Amazon S3 where temporary intermediate results are written when AWS Glue runs the script. Confirm that there isn't a file with the same name as the temporary directory in the path. This directory is used when AWS Glue reads and writes to Amazon Redshift and by certain AWS Glue transforms.  
AWS Glue creates a temporary bucket for jobs if a bucket doesn't already exist in a region. This bucket might permit public access. You can either modify the bucket in Amazon S3 to set the public access block, or delete the bucket later after all jobs in that region have completed.  
**Delay notification threshold (minutes)**  
Sets the threshold (in minutes) before a delay notification is sent. You can set this threshold to send notifications when a `RUNNING`, `STARTING`, or `STOPPING` job run takes more than an expected number of minutes.  
**Security configuration**  
Choose a security configuration from the list. A security configuration specifies how the data at the Amazon S3 target is encrypted: no encryption, server-side encryption with AWS KMS-managed keys (SSE-KMS), or Amazon S3-managed encryption keys (SSE-S3).  
**Server-side encryption**  
If you select this option, when the ETL job writes to Amazon S3, the data is encrypted at rest using SSE-S3 encryption. Both your Amazon S3 data target and any data that is written to an Amazon S3 temporary directory is encrypted. This option is passed as a job parameter. For more information, see [Protecting Data Using Server-Side Encryption with Amazon S3-Managed Encryption Keys (SSE-S3)](https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingServerSideEncryption.html) in the *Amazon Simple Storage Service User Guide*.  
This option is ignored if a security configuration is specified.  
**Use Glue data catalog as the Hive metastore**  
Select to use the AWS Glue Data Catalog as the Hive metastore. The IAM role used for the job must have the `glue:CreateDatabase` permission. A database called “default” is created in the Data Catalog if it does not exist.

**Connections**  
Choose a VPC configuration to access Amazon S3 data sources located in your virtual private cloud (VPC). You can create and manage Network connection in AWS Glue. For more information, see [Connecting to data](glue-connections.md). 

**Libraries**    
**Python library path, Dependent JARs path, and Referenced files path**  
Specify these options if your script requires them. You can define the comma-separated Amazon S3 paths for these options when you define the job. You can override these paths when you run the job. For more information, see [Providing your own custom scripts](console-custom-created.md).  
**Job parameters**  
A set of key-value pairs that are passed as named parameters to the script. These are default values that are used when the script is run, but you can override them in triggers or when you run the job. You must prefix the key name with `--`; for example: `--myKey`. You pass job parameters as a map when using the AWS Command Line Interface.  
For examples, see Python parameters in [Passing and accessing Python parameters in AWS Glue](aws-glue-programming-python-calling.md#aws-glue-programming-python-calling-parameters).  
**Tags**  
Tag your job with a **Tag key** and an optional **Tag value**. After tag keys are created, they are read-only. Use tags on some resources to help you organize and identify them. For more information, see [AWS tags in AWS Glue](monitor-tags.md). 

## Restrictions for jobs that access Lake Formation managed tables
<a name="lf-table-restrictions"></a>

Keep in mind the following notes and restrictions when creating jobs that read from or write to tables managed by AWS Lake Formation: 
+ The following features are not supported in jobs that access tables with cell-level filters:
  + [Job bookmarks](https://docs.aws.amazon.com/glue/latest/dg/monitor-continuations.html) and [bounded execution](https://docs.aws.amazon.com/glue/latest/dg/bounded-execution.html)
  + [Push-down predicates](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-partitions.html#aws-glue-programming-etl-partitions-pushdowns)
  + [Server-side catalog partition predicates](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-partitions.html#aws-glue-programming-etl-partitions-cat-predicates)
  + [enableUpdateCatalog](https://docs.aws.amazon.com/glue/latest/dg/update-from-job.html)

# Editing Spark scripts in the AWS Glue console
<a name="edit-script-spark"></a>

A script contains the code that extracts data from sources, transforms it, and loads it into targets. AWS Glue runs a script when it starts a job.

AWS Glue ETL scripts can be coded in Python or Scala. Python scripts use a language that is an extension of the PySpark Python dialect for extract, transform, and load (ETL) jobs. The script contains *extended constructs* to deal with ETL transformations. When you automatically generate the source code logic for your job, a script is created. You can edit this script, or you can provide your own script to process your ETL work.

 For information about defining and editing scripts in AWS Glue, see [AWS Glue programming guide](edit-script.md).

## Additional libraries or files
<a name="w2aac37c11c12c13b9"></a>

If your script requires additional libraries or files, you can specify them as follows:

**Python library path**  
Comma-separated Amazon Simple Storage Service (Amazon S3) paths to Python libraries that are required by the script.  
Only pure Python libraries can be used. Libraries that rely on C extensions, such as the pandas Python Data Analysis Library, are not yet supported.

**Dependent jars path**  
Comma-separated Amazon S3 paths to JAR files that are required by the script.  
Currently, only pure Java or Scala (2.11) libraries can be used.

**Referenced files path**  
Comma-separated Amazon S3 paths to additional files (for example, configuration files) that are required by the script.

# Jobs (legacy)
<a name="console-edit-script"></a>

A script contains the code that performs extract, transform, and load (ETL) work. You can provide your own script, or AWS Glue can generate a script with guidance from you. For information about creating your own scripts, see [Providing your own custom scripts](console-custom-created.md).

You can edit a script in the AWS Glue console. When you edit a script, you can add sources, targets, and transforms.

**To edit a script**

1. Sign in to the AWS Management Console and open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/). Then choose the **Jobs** tab.

1. Choose a job in the list, and then choose **Action**, **Edit script** to open the script editor.

   You can also access the script editor from the job details page. Choose the **Script** tab, and then choose **Edit script**.

   
## Script editor
<a name="console-edit-script-editor"></a>

The AWS Glue script editor lets you insert, modify, and delete sources, targets, and transforms in your script. The script editor displays both the script and a diagram to help you visualize the flow of data.

To create a diagram for the script, choose **Generate diagram**. AWS Glue uses annotation lines in the script beginning with **\$1\$1** to render the diagram. To correctly represent your script in the diagram, you must keep the parameters in the annotations and the parameters in the Apache Spark code in sync.

The script editor lets you add code templates wherever your cursor is positioned in the script. At the top of the editor, choose from the following options:
+ To add a source table to the script, choose **Source**.
+ To add a target table to the script, choose **Target**.
+ To add a target location to the script, choose **Target location**.
+ To add a transform to the script, choose **Transform**. For information about the functions that are called in your script, see [Program AWS Glue ETL scripts in PySpark](aws-glue-programming-python.md).
+ To add a Spigot transform to the script, choose **Spigot**.

In the inserted code, modify the `parameters` in both the annotations and Apache Spark code. For example, if you add a **Spigot** transform, verify that the `path` is replaced in both the `@args` annotation line and the `output` code line.

The **Logs** tab shows the logs that are associated with your job as it runs. The most recent 1,000 lines are displayed.

The **Schema** tab shows the schema of the selected sources and targets, when available in the Data Catalog. 

# Tracking processed data using job bookmarks
<a name="monitor-continuations"></a>

AWS Glue tracks data that has already been processed during a previous run of an ETL job by persisting state information from the job run. This persisted state information is called a *job bookmark*. Job bookmarks help AWS Glue maintain state information and prevent the reprocessing of old data. With job bookmarks, you can process new data when rerunning on a scheduled interval. A job bookmark is composed of the states for various elements of jobs, such as sources, transformations, and targets. For example, your ETL job might read new partitions in an Amazon S3 file. AWS Glue tracks which partitions the job has processed successfully to prevent duplicate processing and duplicate data in the job's target data store.

Job bookmarks are implemented for JDBC data sources, the Relationalize transform, and some Amazon Simple Storage Service (Amazon S3) sources. The following table lists the Amazon S3 source formats that AWS Glue supports for job bookmarks.


| AWS Glue version | Amazon S3 source formats | 
| --- | --- | 
| Version 0.9 | JSON, CSV, Apache Avro, XML | 
| Version 1.0 and later | JSON, CSV, Apache Avro, XML, Parquet, ORC | 

For information about AWS Glue versions, see [Defining job properties for Spark jobs](add-job.md#create-job).

The job bookmarks feature has additional functionalities when accessed through AWS Glue scripts. When browsing your generated script, you may see transformation contexts, which are related to this feature. For more information, see [Using job bookmarks](programming-etl-connect-bookmarks.md).

**Topics**
+ [Using job bookmarks in AWS Glue](#monitor-continuations-implement)
+ [Operational details of the job bookmarks feature](#monitor-continuations-script)

## Using job bookmarks in AWS Glue
<a name="monitor-continuations-implement"></a>

The job bookmark option is passed as a parameter when the job is started. The following table describes the options for setting job bookmarks on the AWS Glue console.


****  

| Job bookmark | Description | 
| --- | --- | 
| Enable | Causes the job to update the state after a run to keep track of previously processed data. If your job has a source with job bookmark support, it will keep track of processed data, and when a job runs, it processes new data since the last checkpoint. | 
| Disable | Job bookmarks are not used, and the job always processes the entire dataset. You are responsible for managing the output from previous job runs. This is the default. | 
| Pause |  Process incremental data since the last successful run or the data in the range identified by the following sub-options, without updating the state of last bookmark. You are responsible for managing the output from previous job runs. The two sub-options are: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitor-continuations.html) The job bookmark state is not updated when this option set is specified. The sub-options are optional, however when used both the sub-options needs to be provided.  | 

For details about the parameters passed to a job on the command line, and specifically for job bookmarks, see [Using job parameters in AWS Glue jobs](aws-glue-programming-etl-glue-arguments.md).

For Amazon S3 input sources, AWS Glue job bookmarks check the last modified time of the objects to verify which objects need to be reprocessed. If your input source data has been modified since your last job run, the files are reprocessed when you run the job again.

For JDBC sources, the following rules apply:
+ For each table, AWS Glue uses one or more columns as bookmark keys to determine new and processed data. The bookmark keys combine to form a single compound key.
+ AWS Glue by default uses the primary key as the bookmark key, provided that it is sequentially increasing or decreasing (with no gaps).
+ You can specify the columns to use as bookmark keys in your AWS Glue script. For more information about using Job bookmarks in AWS Glue scripts, see [Using job bookmarks](programming-etl-connect-bookmarks.md).
+ AWS Glue doesn't support using columns with case-sensitive names as job bookmark keys.

You can rewind your job bookmarks for your AWS Glue Spark ETL jobs to any previous job run. You can support data backfilling scenarios better by rewinding your job bookmarks to any previous job run, resulting in the subsequent job run reprocessing data only from the bookmarked job run.

If you intend to reprocess all the data using the same job, reset the job bookmark. To reset the job bookmark state, use the AWS Glue console, the [ResetJobBookmark action (Python: reset\$1job\$1bookmark)](aws-glue-api-jobs-runs.md#aws-glue-api-jobs-runs-ResetJobBookmark) API operation, or the AWS CLI. For example, enter the following command using the AWS CLI:

```
    aws glue reset-job-bookmark --job-name my-job-name
```

When you rewind or reset a bookmark, AWS Glue does not clean the target files because there could be multiple targets and targets are not tracked with job bookmarks. Only source files are tracked with job bookmarks. You can create different output targets when rewinding and reprocessing the source files to avoid duplicate data in your output.

AWS Glue keeps track of job bookmarks by job. If you delete a job, the job bookmark is deleted.

In some cases, you might have enabled AWS Glue job bookmarks but your ETL job is reprocessing data that was already processed in an earlier run. For information about resolving common causes of this error, see [Troubleshooting Glue common setup errors](glue-troubleshooting-errors.md).

## Operational details of the job bookmarks feature
<a name="monitor-continuations-script"></a>

This section describes more of the operational details of using job bookmarks.

Job bookmarks store the states for a job. Each instance of the state is keyed by a job name and a version number. When a script invokes `job.init`, it retrieves its state and always gets the latest version. Within a state, there are multiple state elements, which are specific to each source, transformation, and sink instance in the script. These state elements are identified by a transformation context that is attached to the corresponding element (source, transformation, or sink) in the script. The state elements are saved atomically when `job.commit` is invoked from the user script. The script gets the job name and the control option for the job bookmarks from the arguments.

The state elements in the job bookmark are source, transformation, or sink-specific data. For example, suppose that you want to read incremental data from an Amazon S3 location that is being constantly written to by an upstream job or process. In this case, the script must determine what has been processed so far. The job bookmark implementation for the Amazon S3 source saves information so that when the job runs again, it can filter only the new objects using the saved information and recompute the state for the next run of the job. A timestamp is used to filter the new files.

In addition to the state elements, job bookmarks have a *run number*, an *attempt number*, and a *version number*. The run number tracks the run of the job, and the attempt number records the attempts for a job run. The job run number is a monotonically increasing number that is incremented for every successful run. The attempt number tracks the attempts for each run, and is only incremented when there is a run after a failed attempt. The version number increases monotonically and tracks the updates to a job bookmark.

In the AWS Glue service database, the bookmark states for all the transformations are stored together as key-value pairs:

```
{
  "job_name" : ...,
  "run_id": ...,
  "run_number": ..,
  "attempt_number": ...
  "states": {
    "transformation_ctx1" : {
      bookmark_state1
    },
    "transformation_ctx2" : {
      bookmark_state2
    }
  }
}
```

**Best practices**  
The following are best practices for using job bookmarks.
+ *Do not change the data source property with the bookmark enabled*. For example, there is a datasource0 pointing to an Amazon S3 input path A, and the job has been reading from a source which has been running for several rounds with the bookmark enabled. If you change the input path of datasource0 to Amazon S3 path B without changing the `transformation_ctx`, the AWS Glue job will use the old bookmark state stored. That will result in missing or skipping files in the input path B as AWS Glue would assume that those files had been processed in previous runs. 
+ *Use a catalog table with bookmarks for better partition management*. Bookmarks works both for data sources from the Data Catalog or from options. However, it's difficult to remove/add new partitions with the from options approach. Using a catalog table with crawlers can provide better automation to track the newly added [partitions](https://docs.aws.amazon.com/glue/latest/dg/tables-described.html#tables-partition) and give you the flexibility to select particular partitions with a [pushdown predicate](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-partitions.html).
+ *Use the [AWS Glue Amazon S3 file lister](https://aws.amazon.com/premiumsupport/knowledge-center/glue-oom-java-heap-space-error/) for large datasets*. A bookmark will list all files under each input partition and do the filering, so if there are too many files under a single partition the bookmark can run into driver OOM. Use the AWS Glue Amazon S3 file lister to avoid listing all files in memory at once.

# Storing Spark shuffle data
<a name="monitor-spark-shuffle-manager"></a>

Shuffling is an important step in a Spark job whenever data is rearranged between partitions. This is required because wide transformations such as `join`, ` groupByKey`, `reduceByKey`, and `repartition` require information from other partitions to complete processing. Spark gathers the required data from each partition and combines it into a new partition. During a shuffle, data is written to disk and transferred across the network. As a result, the shuffle operation is bound to local disk capacity. Spark throws a `No space left on device` or ` MetadataFetchFailedException` error when there is not enough disk space left on the executor and there is no recovery.

**Note**  
 AWS Glue Spark shuffle plugin with Amazon S3 is only supported for AWS Glue ETL jobs. 

**Solution**  
With AWS Glue, you can now use Amazon S3 to store Spark shuffle data. Amazon S3 is an object storage service that offers industry-leading scalability, data availability, security, and performance. This solution disaggregates compute and storage for your Spark jobs, and gives complete elasticity and low-cost shuffle storage, allowing you to run your most shuffle-intensive workloads reliably.

![\[Spark workflow showing Map and Reduce stages using Amazon S3 for shuffle data storage.\]](http://docs.aws.amazon.com/glue/latest/dg/images/gs-s3-shuffle-diagram.png)


We are introducing a new Cloud Shuffle Storage Plugin for Apache Spark to use Amazon S3. You can turn on Amazon S3 shuffling to run your AWS Glue jobs reliably without failures if they are known to be bound by the local disk capacity for large shuffle operations. In some cases, shuffling to Amazon S3 is marginally slower than local disk (or EBS) if you have a large number of small partitions or shuffle files written out to Amazon S3.

## Prerequisites for using Cloud Shuffle Storage Plugin
<a name="monitor-spark-shuffle-manager-prereqs"></a>

 In order to use the Cloud Shuffle Storage Plugin with AWS Glue ETL jobs, you need the following: 
+ An Amazon S3 bucket located in the same region as your job run, for storing the intermediate shuffle and spilled data. The Amazon S3 prefix of shuffle storage can be specified with `--conf spark.shuffle.glue.s3ShuffleBucket=s3://shuffle-bucket/prefix/`, as in the following example:

  ```
  --conf spark.shuffle.glue.s3ShuffleBucket=s3://glue-shuffle-123456789-us-east-1/glue-shuffle-data/
  ```
+  Set the Amazon S3 storage lifecycle policies on the *prefix* (such as `glue-shuffle-data`) as the shuffle manager does not clean the files after the job is done. The intermediate shuffle and spilled data should be deleted after a job is finished. Users can set a short lifecycle policies on the prefix. Instructions for setting up an Amazon S3 lifecycle policy are available at [Setting lifecycle configuration on a bucket](https://docs.aws.amazon.com//AmazonS3/latest/userguide/how-to-set-lifecycle-configuration-intro.html) in the Amazon Simple Storage Service User Guide.

## Using AWS Glue Spark shuffle manager from the AWS console
<a name="monitor-spark-shuffle-manager-using-console"></a>

To set up the AWS Glue Spark shuffle manager using the AWS Glue console or AWS Glue Studio when configuring a job: choose the ** --write-shuffle-files-to-s3** job parameter to turn on Amazon S3 shuffling for the job.

![\[Job parameters interface showing --write-shuffle-files- parameter and option to add more.\]](http://docs.aws.amazon.com/glue/latest/dg/images/gs-s3-shuffle.png)


## Using AWS Glue Spark shuffle plugin
<a name="monitor-spark-shuffle-manager-using"></a>

The following job parameters turn on and tune the AWS Glue shuffle manager. These parameters are flags, so any values provided are not considered.
+ `--write-shuffle-files-to-s3` — The main flag, which enables the AWS Glue Spark shuffle manager to use Amazon S3 buckets for writing and reading shuffle data. When the flag is not specified, the shuffle manager is not used.
+ `--write-shuffle-spills-to-s3` — (Supported only on AWS Glue version 2.0). An optional flag that allows you to offload spill files to Amazon S3 buckets, which provides additional resiliency to your Spark job. This is only required for large workloads that spill a lot of data to disk. When the flag is not specified, no intermediate spill files are written.
+ ` --conf spark.shuffle.glue.s3ShuffleBucket=s3://<shuffle-bucket>` — Another optional flag that specifies the Amazon S3 bucket where you write the shuffle files. By default, `--TempDir`/shuffle-data. AWS Glue 3.0\$1 supports writing shuffle files to multiple buckets by specifying buckets with comma delimiter, as in `--conf spark.shuffle.glue.s3ShuffleBucket=s3://shuffle-bucket-1/prefix,s3://shuffle-bucket-2/prefix/`. Using multiple buckets improves performance. 

You need to provide security configuration settings to enable encryption at-rest for the shuffle data. For more information about security configurations, see [Setting up encryption in AWS Glue](set-up-encryption.md). AWS Glue supports all other shuffle related configurations provided by Spark.

**Software binaries for the Cloud Shuffle Storage plugin**  
You can also download the software binaries of the Cloud Shuffle Storage Plugin for Apache Spark under the Apache 2.0 license and run it on any Spark environment. The new plugin comes with out-of-the box support for Amazon S3, and can also be easily configured to use other forms of cloud storage such as [Google Cloud Storage and Microsoft Azure Blob Storage](https://github.com/aws-samples/aws-glue-samples/blob/master/docs/cloud-shuffle-plugin/README.md). For more information, see [Cloud Shuffle Storage Plugin for Apache Spark](https://docs.aws.amazon.com/glue/latest/dg/cloud-shuffle-storage-plugin.html).

**Notes and limitations**  
The following are notes or limitations for the AWS Glue shuffle manager:
+  AWS Glue shuffle manager doesn't automatically delete the (temporary) shuffle data files stored in your Amazon S3 bucket after a job is completed. To ensure data protection, follow the instructions in [Prerequisites for using Cloud Shuffle Storage Plugin](#monitor-spark-shuffle-manager-prereqs) before enabling the Cloud Shuffle Storage Plugin. 
+ You can use this feature if your data is skewed.

# Cloud Shuffle Storage Plugin for Apache Spark
<a name="cloud-shuffle-storage-plugin"></a>

The Cloud Shuffle Storage Plugin is an Apache Spark plugin compatible with the [`ShuffleDataIO` API](https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/shuffle/api/ShuffleDataIO.java) which allows storing shuffle data on cloud storage systems (such as Amazon S3). It helps you to supplement or replace local disk storage capacity for large shuffle operations, commonly triggered by transformations such as `join`, `reduceByKey`, `groupByKey` and `repartition` in your Spark applications, thereby reducing common failures or price/performance dislocation of your serverless data analytics jobs and pipelines.

**AWS Glue**  
AWS Glue versions 3.0 and 4.0 comes with the plugin pre-installed and ready to enable shuffling to Amazon S3 without any extra steps. For more information, see [AWS Glue Spark shuffle plugin with Amazon S3](https://docs.aws.amazon.com/glue/latest/dg/monitor-spark-shuffle-manager.html) to enable the feature for your Spark applications.

**Other Spark environments**  
The plugin requires the following Spark configurations to be set on other Spark environments:
+ `--conf spark.shuffle.sort.io.plugin.class=com.amazonaws.spark.shuffle.io.cloud.ChopperPlugin`: This informs Spark to use this plugin for Shuffle IO.
+ `--conf spark.shuffle.storage.path=s3://bucket-name/shuffle-file-dir`: The path where your shuffle files will be stored.

**Note**  
The plugin overwrites one Spark core class. As a result, the plugin jar needs to be loaded before Spark jars. You can do this using `userClassPathFirst` in on-prem YARN environments if the plugin is used outside AWS Glue.

## Bundling the plugin with your Spark applications
<a name="cloud-shuffle-storage-plugin-bundling"></a>

You can bundle the plugin with your Spark applications and Spark distributions (versions 3.1 and above) by adding the plugin dependency in your Maven `pom.xml` while developing your Spark applications locally. For more information on the plugin and Spark versions, see [Plugin versions](#cloud-shuffle-storage-plugin-versions).

```
<repositories>
   ...
    <repository>
        <id>aws-glue-etl-artifacts</id>
        <url>https://aws-glue-etl-artifacts.s3.amazonaws.com/release/ </url>
    </repository>
</repositories>
...
<dependency>
    <groupId>com.amazonaws</groupId>
    <artifactId>chopper-plugin</artifactId>
    <version>3.1-amzn-LATEST</version>
</dependency>
```

You can alternatively download the binaries from AWS Glue Maven artifacts directly and include them in your Spark application as follows.

```
#!/bin/bash
sudo wget -v https://aws-glue-etl-artifacts.s3.amazonaws.com/release/com/amazonaws/chopper-plugin/3.1-amzn-LATEST/chopper-plugin-3.1-amzn-LATEST.jar -P /usr/lib/spark/jars/
```

Example spark-submit

```
spark-submit --deploy-mode cluster \
--conf spark.shuffle.storage.s3.path=s3://<ShuffleBucket>/<shuffle-dir> \
--conf spark.driver.extraClassPath=<Path to plugin jar> \ 
--conf spark.executor.extraClassPath=<Path to plugin jar> \
--class <your test class name> s3://<ShuffleBucket>/<Your application jar> \
```

## Optional configurations
<a name="cloud-shuffle-storage-plugin-optional"></a>

These are optional configuration values that control Amazon S3 shuffle behavior. 
+ `spark.shuffle.storage.s3.enableServerSideEncryption`: Enable/disable S3 SSE for shuffle and spill files. Default value is `true`.
+ `spark.shuffle.storage.s3.serverSideEncryption.algorithm`: The SSE algorithm to be used. Default value is `AES256`.
+ `spark.shuffle.storage.s3.serverSideEncryption.kms.key`: The KMS key ARN when SSE aws:kms is enabled.

Along with these configurations, you may need to set configurations such as `spark.hadoop.fs.s3.enableServerSideEncryption` and **other environment-specific configurations** to ensure appropriate encryption is applied for your use case.

## Plugin versions
<a name="cloud-shuffle-storage-plugin-versions"></a>

This plugin is supported for the Spark versions associated with each AWS Glue version. The following table shows the AWS Glue version, Spark version and associated plugin version with Amazon S3 location for the plugin's software binary.


| AWS Glue version | Spark version | Plugin version | Amazon S3 location | 
| --- | --- | --- | --- | 
| 3.0 | 3.1 | 3.1-amzn-LATEST |  s3://aws-glue-etl-artifacts/release/com/amazonaws/chopper-plugin/3.1-amzn-0/chopper-plugin-3.1-amzn-LATEST.jar  | 
| 4.0 | 3.3 | 3.3-amzn-LATEST |  s3://aws-glue-etl-artifacts/release/com/amazonaws/chopper-plugin/3.3-amzn-0/chopper-plugin-3.3-amzn-LATEST.jar  | 

## License
<a name="cloud-shuffle-storage-plugin-binary-license"></a>

The software binary for this plugin is licensed under the Apache-2.0 License.

# Monitoring AWS Glue Spark jobs
<a name="monitor-spark"></a>

**Topics**
+ [Spark Metrics available in AWS Glue Studio](#console-jobs-details-metrics-spark)
+ [Monitoring jobs using the Apache Spark web UI](monitor-spark-ui.md)
+ [Monitoring with AWS Glue job run insights](monitor-job-insights.md)
+ [Monitoring with Amazon CloudWatch](monitor-cloudwatch.md)
+ [Job monitoring and debugging](monitor-profile-glue-job-cloudwatch-metrics.md)

## Spark Metrics available in AWS Glue Studio
<a name="console-jobs-details-metrics-spark"></a>

The **Metrics** tab shows metrics collected when a job runs and profiling is turned on. The following graphs are shown in Spark jobs: 
+ ETL Data Movement
+ Memory Profile: Driver and Executors

Choose **View additional metrics** to show the following graphs:
+ ETL Data Movement
+ Memory Profile: Driver and Executors
+ Data Shuffle Across Executors
+ CPU Load: Driver and Executors
+ Job Execution: Active Executors, Completed Stages & Maximum Needed Executors

Data for these graphs is pushed to CloudWatch metrics if the job is configured to collect metrics. For more information about how to turn on metrics and interpret the graphs, see [Job monitoring and debugging](monitor-profile-glue-job-cloudwatch-metrics.md). 

**Example ETL data movement graph**  
The ETL Data Movement graph shows the following metrics:  
+ The number of bytes read from Amazon S3 by all executors—[`glue.ALL.s3.filesystem.read_bytes`](monitoring-awsglue-with-cloudwatch-metrics.md#glue.ALL.s3.filesystem.read_bytes)
+ The number of bytes written to Amazon S3 by all executors—[`glue.ALL.s3.filesystem.write_bytes`](monitoring-awsglue-with-cloudwatch-metrics.md#glue.ALL.s3.filesystem.write_bytes)

![\[The graph for ETL Data Movement in the Metrics tab of the AWS Glue console.\]](http://docs.aws.amazon.com/glue/latest/dg/images/job_detailed_etl.png)


**Example Memory profile graph**  
The Memory Profile graph shows the following metrics:  
+ The fraction of memory used by the JVM heap for this driver (scale: 0–1) by the driver, an executor identified by *executorId*, or all executors—
  + [`glue.driver.jvm.heap.usage`](monitoring-awsglue-with-cloudwatch-metrics.md#glue.driver.jvm.heap.usage)
  + [`glue.executorId.jvm.heap.usage`](monitoring-awsglue-with-cloudwatch-metrics.md#glue.executorId.jvm.heap.usage)
  + [`glue.ALL.jvm.heap.usage`](monitoring-awsglue-with-cloudwatch-metrics.md#glue.ALL.jvm.heap.usage)

![\[The graph for Memory Profile in the Metrics tab of the AWS Glue console.\]](http://docs.aws.amazon.com/glue/latest/dg/images/job_detailed_mem.png)


**Example Data shuffle across executors graph**  
The Data Shuffle Across Executors graph shows the following metrics:  
+ The number of bytes read by all executors to shuffle data between them—[`glue.driver.aggregate.shuffleLocalBytesRead`](monitoring-awsglue-with-cloudwatch-metrics.md#glue.driver.aggregate.shuffleLocalBytesRead)
+ The number of bytes written by all executors to shuffle data between them—[`glue.driver.aggregate.shuffleBytesWritten`](monitoring-awsglue-with-cloudwatch-metrics.md#glue.driver.aggregate.shuffleBytesWritten)

![\[The graph for Data Shuffle Across Executors in the Metrics tab of the AWS Glue console.\]](http://docs.aws.amazon.com/glue/latest/dg/images/job_detailed_data.png)


**Example CPU load graph**  
The CPU Load graph shows the following metrics:  
+ The fraction of CPU system load used (scale: 0–1) by the driver, an executor identified by *executorId*, or all executors—
  + [`glue.driver.system.cpuSystemLoad`](monitoring-awsglue-with-cloudwatch-metrics.md#glue.driver.system.cpuSystemLoad)
  + [`glue.executorId.system.cpuSystemLoad`](monitoring-awsglue-with-cloudwatch-metrics.md#glue.executorId.system.cpuSystemLoad)
  + [`glue.ALL.system.cpuSystemLoad`](monitoring-awsglue-with-cloudwatch-metrics.md#glue.ALL.system.cpuSystemLoad)

![\[The graph for CPU Load in the Metrics tab of the AWS Glue console.\]](http://docs.aws.amazon.com/glue/latest/dg/images/job_detailed_cpu.png)


**Example Job execution graph**  
The Job Execution graph shows the following metrics:  
+ The number of actively running executors—[`glue.driver.ExecutorAllocationManager.executors.numberAllExecutors`](monitoring-awsglue-with-cloudwatch-metrics.md#glue.driver.ExecutorAllocationManager.executors.numberAllExecutors)
+ The number of completed stages—[`glue.aggregate.numCompletedStages`](monitoring-awsglue-with-cloudwatch-metrics.md#glue.driver.aggregate.numCompletedStages)
+ The number of maximum needed executors—[`glue.driver.ExecutorAllocationManager.executors.numberMaxNeededExecutors`](monitoring-awsglue-with-cloudwatch-metrics.md#glue.driver.ExecutorAllocationManager.executors.numberMaxNeededExecutors)

![\[The graph for Job Execution in the Metrics tab of the AWS Glue console.\]](http://docs.aws.amazon.com/glue/latest/dg/images/job_detailed_exec.png)


# Monitoring jobs using the Apache Spark web UI
<a name="monitor-spark-ui"></a>

You can use the Apache Spark web UI to monitor and debug AWS Glue ETL jobs running on the AWS Glue job system, and also Spark applications running on AWS Glue development endpoints. The Spark UI enables you to check the following for each job:
+ The event timeline of each Spark stage
+ A directed acyclic graph (DAG) of the job
+ Physical and logical plans for SparkSQL queries
+ The underlying Spark environmental variables for each job

For more information about using the Spark Web UI, see [Web UI](https://spark.apache.org/docs/3.3.0/web-ui.html) in the Spark documentation. For guidance on how to interpret Spark UI results to improve the performance of your job, see [Best practices for performance tuning AWS Glue for Apache Spark jobs](https://docs.aws.amazon.com/prescriptive-guidance/latest/tuning-aws-glue-for-apache-spark/introduction.html) in AWS Prescriptive Guidance.

 You can see the Spark UI in the AWS Glue console. This is available when an AWS Glue job runs on AWS Glue 3.0 or later versions with logs generated in the Standard (rather than legacy) format, which is the default for newer jobs. If you have log files greater than 0.5 GB, you can enable rolling log support for job runs on AWS Glue 4.0 or later versions to simplify log archiving, analysis, and troubleshooting.

You can enable the Spark UI by using the AWS Glue console or the AWS Command Line Interface (AWS CLI). When you enable the Spark UI, AWS Glue ETL jobs and Spark applications on AWS Glue development endpoints can back up Spark event logs to a location that you specify in Amazon Simple Storage Service (Amazon S3). You can use the backed up event logs in Amazon S3 with the Spark UI, both in real time as the job is operating and after the job is complete. While the logs remain in Amazon S3, the Spark UI in the AWS Glue console can view them. 

## Permissions
<a name="monitor-spark-ui-limitations-permissions"></a>

 In order to use the Spark UI in the AWS Glue console, you can use `UseGlueStudio` or add all the individual service APIs. All APIs are needed to use the Spark UI completely, however users can access SparkUI features by adding its service APIs in their IAM permission for fine-grained access. 

 `RequestLogParsing` is the most critical as it performs the parsing of logs. The remaining APIs are for reading the respective parsed data. For example, `GetStages` provides access to the data about all stages of a Spark job. 

 The list of Spark UI service APIs mapped to `UseGlueStudio` are below in the sample policy. The policy below provides access to use only Spark UI features. To add more permissions like Amazon S3 and IAM see [ Creating Custom IAM Policies for AWS Glue Studio. ](https://docs.aws.amazon.com/glue/latest/dg/getting-started-min-privs.html#getting-started-all-gs-privs.html) 

 The list of Spark UI service APIs mapped to `UseGlueStudio` is below in the sample policy. When using a Spark UI service API, use the following namespace: `glue:<ServiceAPI>`. 

------
#### [ JSON ]

****  

```
{
  "Version":"2012-10-17",		 	 	 
  "Statement": [
    {
      "Sid": "AllowGlueStudioSparkUI",
      "Effect": "Allow",
      "Action": [
        "glue:RequestLogParsing",
        "glue:GetLogParsingStatus",
        "glue:GetEnvironment",
        "glue:GetJobs",
        "glue:GetJob",
        "glue:GetStage",
        "glue:GetStages",
        "glue:GetStageFiles",
        "glue:BatchGetStageFiles",
        "glue:GetStageAttempt",
        "glue:GetStageAttemptTaskList",
        "glue:GetStageAttemptTaskSummary",
        "glue:GetExecutors",
        "glue:GetExecutorsThreads",
        "glue:GetStorage",
        "glue:GetStorageUnit",
        "glue:GetQueries",
        "glue:GetQuery"
      ],
      "Resource": [
        "*"
      ]
    }
  ]
}
```

------

## Limitations
<a name="monitor-spark-ui-limitations"></a>
+ Spark UI in the AWS Glue console is not available for job runs that occurred before 20 Nov 2023 because they are in the legacy log format.
+  Spark UI in the AWS Glue console supports rolling logs for AWS Glue 4.0, such as those generated by default in streaming jobs. The maximum sum of all generated rolled log event files is 2 GB. For AWS Glue jobs without rolled log support, the maximum log event file size supported for SparkUI is 0.5 GB. 
+  Serverless Spark UI is not available for Spark event logs stored in an Amazon S3 bucket that can only be accessed by your VPC. 

## Example: Apache Spark web UI
<a name="monitor-spark-ui-limitations-example"></a>

This example shows you how to use the Spark UI to understand your job performance. Screen shots show the Spark web UI as provided by a self-managed Spark history server. Spark UI in the AWS Glue console provides similar views. For more information about using the Spark Web UI, see [Web UI](https://spark.apache.org/docs/3.3.0/web-ui.html) in the Spark documentation.

The following is an example of a Spark application that reads from two data sources, performs a join transform, and writes it out to Amazon S3 in Parquet format.

```
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.sql.functions import count, when, expr, col, sum, isnull
from pyspark.sql.functions import countDistinct
from awsglue.dynamicframe import DynamicFrame
 
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
 
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
 
job = Job(glueContext)
job.init(args['JOB_NAME'])
 
df_persons = spark.read.json("s3://awsglue-datasets/examples/us-legislators/all/persons.json")
df_memberships = spark.read.json("s3://awsglue-datasets/examples/us-legislators/all/memberships.json")
 
df_joined = df_persons.join(df_memberships, df_persons.id == df_memberships.person_id, 'fullouter')
df_joined.write.parquet("s3://aws-glue-demo-sparkui/output/")
 
job.commit()
```

The following DAG visualization shows the different stages in this Spark job.

![\[Screenshot of Spark UI showing 2 completed stages for job 0.\]](http://docs.aws.amazon.com/glue/latest/dg/images/spark-ui1.png)


The following event timeline for a job shows the start, execution, and termination of different Spark executors.

![\[Screenshot of Spark UI showing the completed, failed, and active stages of different Spark executors.\]](http://docs.aws.amazon.com/glue/latest/dg/images/spark-ui2.png)


The following screen shows the details of the SparkSQL query plans:
+ Parsed logical plan
+ Analyzed logical plan
+ Optimized logical plan
+ Physical plan for execution

![\[SparkSQL query plans: parsed, analyzed, and optimized logical plan and physical plans for execution.\]](http://docs.aws.amazon.com/glue/latest/dg/images/spark-ui3.png)


**Topics**
+ [Permissions](#monitor-spark-ui-limitations-permissions)
+ [Limitations](#monitor-spark-ui-limitations)
+ [Example: Apache Spark web UI](#monitor-spark-ui-limitations-example)
+ [Enabling the Apache Spark web UI for AWS Glue jobs](monitor-spark-ui-jobs.md)
+ [Launching the Spark history server](monitor-spark-ui-history.md)

# Enabling the Apache Spark web UI for AWS Glue jobs
<a name="monitor-spark-ui-jobs"></a>

You can use the Apache Spark web UI to monitor and debug AWS Glue ETL jobs running on the AWS Glue job system. You can configure the Spark UI using the AWS Glue console or the AWS Command Line Interface (AWS CLI).

Every 30 seconds, AWS Glue backs up the Spark event logs to the Amazon S3 path that you specify.

**Topics**
+ [Configuring the Spark UI (console)](#monitor-spark-ui-jobs-console)
+ [Configuring the Spark UI (AWS CLI)](#monitor-spark-ui-jobs-cli)
+ [Configuring the Spark UI for sessions using Notebooks](#monitor-spark-ui-sessions)
+ [Enable rolling logs](#monitor-spark-ui-rolling-logs)

## Configuring the Spark UI (console)
<a name="monitor-spark-ui-jobs-console"></a>

Follow these steps to configure the Spark UI by using the AWS Management Console. When creating an AWS Glue job, Spark UI is enabled by default.

**To turn on the Spark UI when you create or edit a job**

1. Sign in to the AWS Management Console and open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/).

1. In the navigation pane, choose **Jobs**.

1. Choose **Add job**, or select an existing one.

1. In **Job details**, open the **Advanced properties**.

1. Under the **Spark UI** tab, choose **Write Spark UI logs to Amazon S3**.

1. Specify an Amazon S3 path for storing the Spark event logs for the job. Note that if you use a security configuration in the job, the encryption also applies to the Spark UI log file. For more information, see [Encrypting data written by AWS Glue](encryption-security-configuration.md).

1. Under **Spark UI logging and monitoring configuration**:
   + Select **Standard** if you are generating logs to view in the AWS Glue console.
   + Select **Legacy** if you are generating logs to view on a Spark history server.
   + You can also choose to generate both.

## Configuring the Spark UI (AWS CLI)
<a name="monitor-spark-ui-jobs-cli"></a>

To generate logs for viewing with Spark UI, in the AWS Glue console, use the AWS CLI to pass the following job parameters to AWS Glue jobs. For more information, see [Using job parameters in AWS Glue jobs](aws-glue-programming-etl-glue-arguments.md).

```
'--enable-spark-ui': 'true',
'--spark-event-logs-path': 's3://s3-event-log-path'
```

To distribute logs to their legacy locations, set the `--enable-spark-ui-legacy-path` parameter to `"true"`. If you do not want to generate logs in both formats, remove the `--enable-spark-ui` parameter.

## Configuring the Spark UI for sessions using Notebooks
<a name="monitor-spark-ui-sessions"></a>

**Warning**  
AWS Glue interactive sessions do not currently support Spark UI in the console. Configure a Spark history server.

 If you use AWS Glue notebooks, set up SparkUI config before starting the session. To do this, use the `%%configure` cell magic: 

```
%%configure { “--enable-spark-ui”: “true”, “--spark-event-logs-path”: “s3://path” }
```

## Enable rolling logs
<a name="monitor-spark-ui-rolling-logs"></a>

 Enabling SparkUI and rolling log event files for AWS Glue jobs provides several benefits: 
+  Rolling Log Event Files – With rolling log event files enabled, AWS Glue generates separate log files for each step of the job execution, making it easier to identify and troubleshoot issues specific to a particular stage or transformation. 
+  Better Log Management – Rolling log event files help in managing log files more efficiently. Instead of having a single, potentially large log file, the logs are split into smaller, more manageable files based on the job execution stages. This can simplify log archiving, analysis, and troubleshooting. 
+  Improved Fault Tolerance – If a AWS Glue job fails or is interrupted, the rolling log event files can provide valuable information about the last successful stage, making it easier to resume the job from that point rather than starting from scratch. 
+  Cost Optimization – By enabling rolling log event files, you can save on storage costs associated with log files. Instead of storing a single, potentially large log file, you store smaller, more manageable log files, which can be more cost-effective, especially for long-running or complex jobs. 

 In a new environment, users can explicitly enable rolling logs through: 

```
'—conf': 'spark.eventLog.rolling.enabled=true'
```

or

```
'—conf': 'spark.eventLog.rolling.enabled=true —conf 
spark.eventLog.rolling.maxFileSize=128m'
```

 When rolling logs are activated, `spark.eventLog.rolling.maxFileSize` specifies the maximum size of the event log file before it rolls over. The default value of this optional parameter if not specified is 128 MB. Minimum is 10 MB. 

 The maximum sum of all generated rolled log event files is 2 GB. For AWS Glue jobs without rolling log support, the maximum log event file size supported for SparkUI is 0.5 GB. 

You can turn off rolling logs for a streaming job by passing an additional configuration. Note that very large log files may be costly to maintain.

To turn off rolling logs, provide the following configuration:

```
'--spark-ui-event-logs-path': 'true',
'--conf': 'spark.eventLog.rolling.enabled=false'
```

# Launching the Spark history server
<a name="monitor-spark-ui-history"></a>

You can use a Spark history server to visualize Spark logs on your own infrastructure. You can see the same visualizations in the AWS Glue console for AWS Glue job runs on AWS Glue 4.0 or later versions with logs generated in the Standard (rather than legacy) format. For more information, see [Monitoring jobs using the Apache Spark web UI](monitor-spark-ui.md).

You can launch the Spark history server using a AWS CloudFormation template that hosts the server on an EC2 instance, or launch locally using Docker.

**Topics**
+ [Launching the Spark history server and viewing the Spark UI using AWS CloudFormation](#monitor-spark-ui-history-cfn)
+ [Launching the Spark history server and viewing the Spark UI using Docker](#monitor-spark-ui-history-local)

## Launching the Spark history server and viewing the Spark UI using AWS CloudFormation
<a name="monitor-spark-ui-history-cfn"></a>

You can use an AWS CloudFormation template to start the Apache Spark history server and view the Spark web UI. These templates are samples that you should modify to meet your requirements.

**To start the Spark history server and view the Spark UI using CloudFormation**

1. Choose one of the **Launch Stack** buttons in the following table. This launches the stack on the CloudFormation console.    
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitor-spark-ui-history.html)

1. On the **Specify template** page, choose **Next**.

1. On the **Specify stack details** page, enter the **Stack name**. Enter additional information under **Parameters**.

   1. 

**Spark UI configuration**

      Provide the following information:
      + **IP address range** — The IP address range that can be used to view the Spark UI. If you want to restrict access from a specific IP address range, you should use a custom value. 
      + **History server port** — The port for the Spark UI. You can use the default value.
      + **Event log directory** — Choose the location where Spark event logs are stored from the AWS Glue job or development endpoints. You must use **s3a://** for the event logs path scheme.
      + **Spark package location** — You can use the default value.
      + **Keystore path** — SSL/TLS keystore path for HTTPS. If you want to use a custom keystore file, you can specify the S3 path `s3://path_to_your_keystore_file` here. If you leave this parameter empty, a self-signed certificate based keystore is generated and used.
      + **Keystore password** — Enter a SSL/TLS keystore password for HTTPS.

   1. 

**EC2 instance configuration**

      Provide the following information:
      + **Instance type** — The type of Amazon EC2 instance that hosts the Spark history server. Because this template launches Amazon EC2 instance in your account, Amazon EC2 cost will be charged in your account separately.
      + **Latest AMI ID** — The AMI ID of Amazon Linux 2 for the Spark history server instance. You can use the default value.
      + **VPC ID** — The virtual private cloud (VPC) ID for the Spark history server instance. You can use any of the VPCs available in your account Using a default VPC with a [default Network ACL](https://docs.aws.amazon.com/vpc/latest/userguide/vpc-network-acls.html#default-network-acl) is not recommended. For more information, see [Default VPC and Default Subnets](https://docs.aws.amazon.com/vpc/latest/userguide/default-vpc.html) and [Creating a VPC](https://docs.aws.amazon.com/vpc/latest/userguide/working-with-vpcs.html#Create-VPC) in the *Amazon VPC User Guide*.
      + **Subnet ID** — The ID for the Spark history server instance. You can use any of the subnets in your VPC. You must be able to reach the network from your client to the subnet. If you want to access via the internet, you must use a public subnet that has the internet gateway in the route table.

   1. Choose **Next**.

1. On the **Configure stack options** page, to use the current user credentials for determining how CloudFormation can create, modify, or delete resources in the stack, choose **Next**. You can also specify a role in the ** Permissions** section to use instead of the current user permissions, and then choose **Next**.

1. On the **Review** page, review the template. 

   Select **I acknowledge that CloudFormation might create IAM resources**, and then choose **Create stack**.

1. Wait for the stack to be created.

1. Open the **Outputs** tab.

   1. Copy the URL of **SparkUiPublicUrl** if you are using a public subnet.

   1. Copy the URL of **SparkUiPrivateUrl** if you are using a private subnet.

1. Open a web browser, and paste in the URL. This lets you access the server using HTTPS on the specified port. Your browser may not recognize the server's certificate, in which case you have to override its protection and proceed anyway. 

## Launching the Spark history server and viewing the Spark UI using Docker
<a name="monitor-spark-ui-history-local"></a>

If you prefer local access (not to have an EC2 instance for the Apache Spark history server), you can also use Docker to start the Apache Spark history server and view the Spark UI locally. This Dockerfile is a sample that you should modify to meet your requirements. 

 **Prerequisites** 

For information about how to install Docker on your laptop see the [Docker Engine community](https://docs.docker.com/install/).

**To start the Spark history server and view the Spark UI locally using Docker**

1. Download files from GitHub.

   Download the Dockerfile and `pom.xml` from [ AWS Glue code samples](https://github.com/aws-samples/aws-glue-samples/tree/master/utilities/Spark_UI/).

1. Determine if you want to use your user credentials or federated user credentials to access AWS.
   + To use the current user credentials for accessing AWS, get the values to use for ` AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY` in the `docker run` command. For more information, see [Managing access keys for IAM users](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_access-keys.html) in the *IAM User Guide*.
   + To use SAML 2.0 federated users for accessing AWS, get the values for ` AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, and ` AWS_SESSION_TOKEN`. For more information, see [Requesting temporary security credentials](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_temp_request.html)

1. Determine the location of your event log directory, to use in the `docker run` command.

1. Build the Docker image using the files in the local directory, using the name ` glue/sparkui`, and the tag `latest`.

   ```
   $ docker build -t glue/sparkui:latest . 
   ```

1. Create and start the docker container.

   In the following commands, use the values obtained previously in steps 2 and 3.

   1. To create the docker container using your user credentials, use a command similar to the following

      ```
      docker run -itd -e SPARK_HISTORY_OPTS="$SPARK_HISTORY_OPTS -Dspark.history.fs.logDirectory=s3a://path_to_eventlog
       -Dspark.hadoop.fs.s3a.access.key=AWS_ACCESS_KEY_ID -Dspark.hadoop.fs.s3a.secret.key=AWS_SECRET_ACCESS_KEY"
       -p 18080:18080 glue/sparkui:latest "/opt/spark/bin/spark-class org.apache.spark.deploy.history.HistoryServer"
      ```

   1. To create the docker container using temporary credentials, use ` org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider` as the provider, and provide the credential values obtained in step 2. For more information, see [Using Session Credentials with TemporaryAWSCredentialsProvider](https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#Using_Session_Credentials_with_TemporaryAWSCredentialsProvider) in the *Hadoop: Integration with Amazon Web Services* documentation.

      ```
      docker run -itd -e SPARK_HISTORY_OPTS="$SPARK_HISTORY_OPTS -Dspark.history.fs.logDirectory=s3a://path_to_eventlog
       -Dspark.hadoop.fs.s3a.access.key=AWS_ACCESS_KEY_ID -Dspark.hadoop.fs.s3a.secret.key=AWS_SECRET_ACCESS_KEY
       -Dspark.hadoop.fs.s3a.session.token=AWS_SESSION_TOKEN
       -Dspark.hadoop.fs.s3a.aws.credentials.provider=org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider"
       -p 18080:18080 glue/sparkui:latest "/opt/spark/bin/spark-class org.apache.spark.deploy.history.HistoryServer"
      ```
**Note**  
These configuration parameters come from the [ Hadoop-AWS Module](https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html). You may need to add specific configuration based on your use case. For example: users in isolated regions will need to configure the ` spark.hadoop.fs.s3a.endpoint`.

1. Open `http://localhost:18080` in your browser to view the Spark UI locally.

# Monitoring with AWS Glue job run insights
<a name="monitor-job-insights"></a>

AWS Glue job run insights is a feature in AWS Glue that simplifies job debugging and optimization for your AWS Glue jobs. AWS Glue provides [Spark UI](https://docs.aws.amazon.com/glue/latest/dg/monitor-spark-ui.html), and [CloudWatch logs and metrics](https://docs.aws.amazon.com/glue/latest/dg/monitor-cloudwatch.html) for monitoring your AWS Glue jobs. With this feature, you get this information about your AWS Glue job's execution:
+ Line number of your AWS Glue job script that had a failure.
+ Spark action that executed last in the Spark query plan just before the failure of your job.
+ Spark exception events related to the failure presented in a time-ordered log stream.
+ Root cause analysis and recommended action (such as tuning your script) to fix the issue.
+ Common Spark events (log messages relating to a Spark action) with a recommended action that addresses the root cause.

All these insights are available to you using two new log streams in the CloudWatch logs for your AWS Glue jobs.

## Requirements
<a name="monitor-job-insights-requirements"></a>

The AWS Glue job run insights feature is available for AWS Glue versions 2.0 and above. You can follow the [migration guide](https://docs.aws.amazon.com/glue/latest/dg/migrating-version-30.html) for your existing jobs to upgrade them from older AWS Glue versions.

## Enabling job run insights for an AWS Glue ETL job
<a name="monitor-job-insights-enable"></a>

You can enable job run insights through AWS Glue Studio or the CLI.

### AWS Glue Studio
<a name="monitor-job-insights-requirements"></a>

When creating a job via AWS Glue Studio, you can enable or disable job run insights under the **Job Details** tab. Check that the **Generate job insights** box is selected.

![\[Enabling job run insights in AWS Glue Studio.\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-job-run-insights-1.png)


### Command line
<a name="monitor-job-insights-enable-cli"></a>

If creating a job via the CLI, you can start a job run with a single new [job parameter](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-glue-arguments.html): `--enable-job-insights = true`.

By default, the job run insights log streams are created under the same default log group used by [AWS Glue continuous logging](https://docs.aws.amazon.com/glue/latest/dg/monitor-continuous-logging.html), that is, `/aws-glue/jobs/logs-v2/`. You may set up custom log group name, log filters and log group configurations using the same set of arguments for continuous logging. For more information, see [Enabling Continuous Logging for AWS Glue Jobs](https://docs.aws.amazon.com/glue/latest/dg/monitor-continuous-logging-enable.html).

## Accessing the job run insights log streams in CloudWatch
<a name="monitor-job-insights-access"></a>

With the job run insights feature enabled, there may be two log streams created when a job run fails. When a job finishes successfully, neither of the streams are generated.

1. *Exception analysis log stream*: `<job-run-id>-job-insights-rca-driver`. This stream provides the following:
   + Line number of your AWS Glue job script that caused the failure.
   + Spark action that executed last in the Spark query plan (DAG).
   + Concise time-ordered events from the Spark driver and executors that are related to the exception. You can find details such as complete error messages, the failed Spark task and its executors ID that help you to focus on the specific executor's log stream for a deeper investigation if needed.

1. *Rule-based insights stream*: 
   + Root cause analysis and recommendations on how to fix the errors (such as using a specific job parameter to optimize the performance).
   + Relevant Spark events serving as the basis for root cause analysis and a recommended action.

**Note**  
The first stream will exist only if any exception Spark events are available for a failed job run, and the second stream will exist only if any insights are available for the failed job run. For example, if your job finishes successfully, neither of the streams will be generated; if your job fails but there isn't a service-defined rule that can match with your failure scenario, then only the first stream will be generated.

If the job is created from AWS Glue Studio, the links to the above streams are also available under the job run details tab (Job run insights) as "Concise and consolidated error logs" and "Error analysis and guidance".

![\[The Job Run Details page containing links to the log streams.\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-job-run-insights-2.png)


## Example for AWS Glue job run insights
<a name="monitor-job-insights-example"></a>

In this section we present an example of how the job run insights feature can help you resolve an issue in your failed job. In this example, a user forgot to import the required module (tensorflow) in an AWS Glue job to analyze and build a machine learning model on their data.

```
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.sql.types import *
from pyspark.sql.functions import udf,col

args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

data_set_1 = [1, 2, 3, 4]
data_set_2 = [5, 6, 7, 8]

scoresDf = spark.createDataFrame(data_set_1, IntegerType())

def data_multiplier_func(factor, data_vector):
    import tensorflow as tf
    with tf.compat.v1.Session() as sess:
        x1 = tf.constant(factor)
        x2 = tf.constant(data_vector)
        result = tf.multiply(x1, x2)
        return sess.run(result).tolist()

data_multiplier_udf = udf(lambda x:data_multiplier_func(x, data_set_2), ArrayType(IntegerType(),False))
factoredDf = scoresDf.withColumn("final_value", data_multiplier_udf(col("value")))
print(factoredDf.collect())
```

Without the job run insights feature, as the job fails, you only see this message thrown by Spark:

`An error occurred while calling o111.collectToPython. Traceback (most recent call last):`

The message is ambiguous and limits your debugging experience. In this case, this feature provides with you additional insights in two CloudWatch log streams:

1. The `job-insights-rca-driver` log stream:
   + *Exception events*: This log stream provides you the Spark exception events related to the failure collected from the Spark driver and different distributed workers. These events help you understand the time-ordered propagation of the exception as faulty code executes across Spark tasks, executors, and stages distributed across the AWS Glue workers.
   + *Line numbers*: This log stream identifies line 21, which made the call to import the missing Python module that caused the failure; it also identifies line 24, the call to Spark Action `collect()`, as the last executed line in your script.  
![\[The job-insights-rca-driver log stream.\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-job-run-insights-3.png)

1. The `job-insights-rule-driver` log stream:
   + *Root cause and recommendation*: In addition to the line number and last executed line number for the fault in your script, this log stream shows the root cause analysis and recommendation for you to follow the AWS Glue doc and set up the necessary job parameters in order to use an additional Python module in your AWS Glue job. 
   + *Basis event*: This log stream also shows the Spark exception event that was evaluated with the service-defined rule to infer the root cause and provide a recommendation.  
![\[The job-insights-rule-driver log stream.\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-job-run-insights-4.png)

# Monitoring with Amazon CloudWatch
<a name="monitor-cloudwatch"></a>

You can monitor AWS Glue using Amazon CloudWatch, which collects and processes raw data from AWS Glue into readable, near-real-time metrics. These statistics are recorded for a period of two weeks so that you can access historical information for a better perspective on how your web application or service is performing. By default, AWS Glue metrics data is sent to CloudWatch automatically. For more information, see [What Is Amazon CloudWatch?](https://docs.aws.amazon.com/AmazonCloudWatch/latest/DeveloperGuide/WhatIsCloudWatch.html) in the *Amazon CloudWatch User Guide*, and [AWS Glue metrics](monitoring-awsglue-with-cloudwatch-metrics.md#awsglue-metrics).

 **Continous logging** 

AWS Glue also supports real-time continuous logging for AWS Glue jobs. When continuous logging is enabled for a job, you can view the real-time logs on the AWS Glue console or the CloudWatch console dashboard. For more information, see [Logging for AWS Glue jobs](monitor-continuous-logging.md).

 **Observability metrics** 

 When **Job observability metrics** is enabled, additional Amazon CloudWatch metrics are generated when the job is run. Use AWS Glue Observability metrics to generate insights into what is happening inside your AWS Glue to improve triaging and analysis of issues. 

**Topics**
+ [Monitoring AWS Glue using Amazon CloudWatch metrics](monitoring-awsglue-with-cloudwatch-metrics.md)
+ [Setting up Amazon CloudWatch alarms on AWS Glue job profiles](monitor-profile-glue-job-cloudwatch-alarms.md)
+ [Logging for AWS Glue jobs](monitor-continuous-logging.md)
+ [Monitoring with AWS Glue Observability metrics](monitor-observability.md)

# Monitoring AWS Glue using Amazon CloudWatch metrics
<a name="monitoring-awsglue-with-cloudwatch-metrics"></a>

You can profile and monitor AWS Glue operations using AWS Glue job profiler. It collects and processes raw data from AWS Glue jobs into readable, near real-time metrics stored in Amazon CloudWatch. These statistics are retained and aggregated in CloudWatch so that you can access historical information for a better perspective on how your application is performing.

**Note**  
 You may incur additional charges when you enable job metrics and CloudWatch custom metrics are created. For more information, see [ Amazon CloudWatch pricing ](https://aws.amazon.com/cloudwatch/pricing/). 

## AWS Glue metrics overview
<a name="metrics-overview"></a>

When you interact with AWS Glue, it sends metrics to CloudWatch. You can view these metrics using the AWS Glue console (the preferred method), the CloudWatch console dashboard, or the AWS Command Line Interface (AWS CLI). 

**To view metrics using the AWS Glue console dashboard**

You can view summary or detailed graphs of metrics for a job, or detailed graphs for a job run. 

1. Sign in to the AWS Management Console and open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/).

1. In the navigation pane, choose **Job run monitoring**.

1. In **Job runs** choose **Actions** to stop a job that is currently running, view a job, or rewind job bookmark.

1. Select a job, then choose **View run details** to view additional information about the job run.

**To view metrics using the CloudWatch console dashboard**

Metrics are grouped first by the service namespace, and then by the various dimension combinations within each namespace.

1. Open the CloudWatch console at [https://console.aws.amazon.com/cloudwatch/](https://console.aws.amazon.com/cloudwatch/).

1. In the navigation pane, choose **Metrics**.

1. Choose the **Glue** namespace.

**To view metrics using the AWS CLI**
+ At a command prompt, use the following command.

  ```
  1. aws cloudwatch list-metrics --namespace Glue
  ```

AWS Glue reports metrics to CloudWatch every 30 seconds, and the CloudWatch metrics dashboards are configured to display them every minute. The AWS Glue metrics represent delta values from the previously reported values. Where appropriate, metrics dashboards aggregate (sum) the 30-second values to obtain a value for the entire last minute.

### AWS Glue metrics behavior for Spark jobs
<a name="metrics-overview-spark"></a>

 AWS Glue metrics are enabled at initialization of a `GlueContext` in a script and are generally updated only at the end of an Apache Spark task. They represent the aggregate values across all completed Spark tasks so far.

However, the Spark metrics that AWS Glue passes on to CloudWatch are generally absolute values representing the current state at the time they are reported. AWS Glue reports them to CloudWatch every 30 seconds, and the metrics dashboards generally show the average across the data points received in the last 1 minute.

AWS Glue metrics names are all preceded by one of the following types of prefix:
+ `glue.driver.` – Metrics whose names begin with this prefix either represent AWS Glue metrics that are aggregated from all executors at the Spark driver, or Spark metrics corresponding to the Spark driver.
+ `glue.`*executorId*`.` – The *executorId* is the number of a specific Spark executor. It corresponds with the executors listed in the logs.
+ `glue.ALL.` – Metrics whose names begin with this prefix aggregate values from all Spark executors.

## AWS Glue metrics
<a name="awsglue-metrics"></a>

AWS Glue profiles and sends the following metrics to CloudWatch every 30 seconds, and the AWS Glue Metrics Dashboard report them once a minute:


| Metric | Description | 
| --- | --- | 
|  `glue.driver.aggregate.bytesRead` |  The number of bytes read from all data sources by all completed Spark tasks running in all executors. Valid dimensions: `JobName` (the name of the AWS Glue Job), `JobRunId` (the JobRun ID. or `ALL`), and `Type` (count). Valid Statistics: SUM. This metric is a delta value from the last reported value, so on the AWS Glue Metrics Dashboard, a SUM statistic is used for aggregation.  Unit: Bytes Can be used to monitor: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html) This metric can be used the same way as the `glue.ALL.s3.filesystem.read_bytes` metric, with the difference that this metric is updated at the end of a Spark task and captures non-S3 data sources as well.  | 
|  `glue.driver.aggregate.elapsedTime` |  The ETL elapsed time in milliseconds (does not include the job bootstrap times). Valid dimensions: `JobName` (the name of the AWS Glue Job), `JobRunId` (the JobRun ID. or `ALL`), and `Type` (count). Valid Statistics: SUM. This metric is a delta value from the last reported value, so on the AWS Glue Metrics Dashboard, a SUM statistic is used for aggregation. Unit: Milliseconds Can be used to determine how long it takes a job run to run on average. Some ways to use the data: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html)  | 
|   `glue.driver.aggregate.numCompletedStages` |  The number of completed stages in the job. Valid dimensions: `JobName` (the name of the AWS Glue Job), `JobRunId` (the JobRun ID. or `ALL`), and `Type` (count). Valid Statistics: SUM. This metric is a delta value from the last reported value, so on the AWS Glue Metrics Dashboard, a SUM statistic is used for aggregation. Unit: Count Can be used to monitor: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html) Some ways to use the data: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html)  | 
|  `glue.driver.aggregate.numCompletedTasks` |  The number of completed tasks in the job. Valid dimensions: `JobName` (the name of the AWS Glue Job), `JobRunId` (the JobRun ID. or `ALL`), and `Type` (count). Valid Statistics: SUM. This metric is a delta value from the last reported value, so on the AWS Glue Metrics Dashboard, a SUM statistic is used for aggregation. Unit: Count Can be used to monitor: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html)  | 
|  `glue.driver.aggregate.numFailedTasks` |  The number of failed tasks. Valid dimensions: `JobName` (the name of the AWS Glue Job), `JobRunId` (the JobRun ID. or `ALL`), and `Type` (count). Valid Statistics: SUM. This metric is a delta value from the last reported value, so on the AWS Glue Metrics Dashboard, a SUM statistic is used for aggregation. Unit: Count Can be used to monitor: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html) The data can be used to set alarms for increased failures that might suggest abnormalities in data, cluster or scripts.  | 
|  `glue.driver.aggregate.numKilledTasks` |  The number of tasks killed. Valid dimensions: `JobName` (the name of the AWS Glue Job), `JobRunId` (the JobRun ID. or `ALL`), and `Type` (count). Valid Statistics: SUM. This metric is a delta value from the last reported value, so on the AWS Glue Metrics Dashboard, a SUM statistic is used for aggregation. Unit: Count Can be used to monitor: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html) Some ways to use the data: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html)  | 
|  `glue.driver.aggregate.recordsRead` |  The number of records read from all data sources by all completed Spark tasks running in all executors. Valid dimensions: `JobName` (the name of the AWS Glue Job), `JobRunId` (the JobRun ID. or `ALL`), and `Type` (count). Valid Statistics: SUM. This metric is a delta value from the last reported value, so on the AWS Glue Metrics Dashboard, a SUM statistic is used for aggregation. Unit: Count Can be used to monitor: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html) This metric can be used in a similar way to the `glue.ALL.s3.filesystem.read_bytes` metric, with the difference that this metric is updated at the end of a Spark task.  | 
|   `glue.driver.aggregate.shuffleBytesWritten` |  The number of bytes written by all executors to shuffle data between them since the previous report (aggregated by the AWS Glue Metrics Dashboard as the number of bytes written for this purpose during the previous minute). Valid dimensions: `JobName` (the name of the AWS Glue Job), `JobRunId` (the JobRun ID. or `ALL`), and `Type` (count). Valid Statistics: SUM. This metric is a delta value from the last reported value, so on the AWS Glue Metrics Dashboard, a SUM statistic is used for aggregation. Unit: Bytes Can be used to monitor: Data shuffle in jobs (large joins, groupBy, repartition, coalesce). Some ways to use the data: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html)  | 
|   `glue.driver.aggregate.shuffleLocalBytesRead` |  The number of bytes read by all executors to shuffle data between them since the previous report (aggregated by the AWS Glue Metrics Dashboard as the number of bytes read for this purpose during the previous minute). Valid dimensions: `JobName` (the name of the AWS Glue Job), `JobRunId` (the JobRun ID. or `ALL`), and `Type` (count). Valid Statistics: SUM. This metric is a delta value from the last reported value, so on the AWS Glue Metrics Dashboard, a SUM statistic is used for aggregation. Unit: Bytes Can be used to monitor: Data shuffle in jobs (large joins, groupBy, repartition, coalesce). Some ways to use the data: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html)  | 
|  `glue.driver.BlockManager.disk.diskSpaceUsed_MB` |  The number of megabytes of disk space used across all executors. Valid dimensions: `JobName` (the name of the AWS Glue Job), `JobRunId` (the JobRun ID. or `ALL`), and `Type` (gauge). Valid Statistics: Average. This is a Spark metric, reported as an absolute value. Unit: Megabytes Can be used to monitor: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html) Some ways to use the data: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html)  | 
|   `glue.driver.ExecutorAllocationManager.executors.numberAllExecutors` |  The number of actively running job executors. Valid dimensions: `JobName` (the name of the AWS Glue Job), `JobRunId` (the JobRun ID. or `ALL`), and `Type` (gauge). Valid Statistics: Average. This is a Spark metric, reported as an absolute value. Unit: Count Can be used to monitor: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html) Some ways to use the data: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html)  | 
|   `glue.driver.ExecutorAllocationManager.executors.numberMaxNeededExecutors` |  The number of maximum (actively running and pending) job executors needed to satisfy the current load. Valid dimensions: `JobName` (the name of the AWS Glue Job), `JobRunId` (the JobRun ID. or `ALL`), and `Type` (gauge). Valid Statistics: Maximum. This is a Spark metric, reported as an absolute value. Unit: Count Can be used to monitor: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html) Some ways to use the data: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html)  | 
|   `glue.driver.jvm.heap.usage`  `glue.`*executorId*`.jvm.heap.usage`  `glue.ALL.jvm.heap.usage`  |  The fraction of memory used by the JVM heap for this driver (scale: 0-1) for driver, executor identified by executorId, or ALL executors. Valid dimensions: `JobName` (the name of the AWS Glue Job), `JobRunId` (the JobRun ID. or `ALL`), and `Type` (gauge). Valid Statistics: Average. This is a Spark metric, reported as an absolute value. Unit: Percentage Can be used to monitor: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html) Some ways to use the data: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html)  | 
|  `glue.driver.jvm.heap.used`  `glue.`*executorId*`.jvm.heap.used`  `glue.ALL.jvm.heap.used`  |  The number of memory bytes used by the JVM heap for the driver, the executor identified by *executorId*, or ALL executors. Valid dimensions: `JobName` (the name of the AWS Glue Job), `JobRunId` (the JobRun ID. or `ALL`), and `Type` (gauge). Valid Statistics: Average. This is a Spark metric, reported as an absolute value. Unit: Bytes Can be used to monitor: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html) Some ways to use the data: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html)  | 
|   `glue.driver.s3.filesystem.read_bytes`  `glue.`*executorId*`.s3.filesystem.read_bytes`  `glue.ALL.s3.filesystem.read_bytes`  |  The number of bytes read from Amazon S3 by the driver, an executor identified by *executorId*, or ALL executors since the previous report (aggregated by the AWS Glue Metrics Dashboard as the number of bytes read during the previous minute). Valid dimensions: `JobName`, `JobRunId`, and `Type` (gauge). Valid Statistics: SUM. This metric is a delta value from the last reported value, so on the AWS Glue Metrics Dashboard a SUM statistic is used for aggregation. The area under the curve on the AWS Glue Metrics Dashboard can be used to visually compare bytes read by two different job runs. Unit: Bytes. Can be used to monitor: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html) Resulting data can be used for: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html)  | 
|   `glue.driver.s3.filesystem.write_bytes`  `glue.`*executorId*`.s3.filesystem.write_bytes`  `glue.ALL.s3.filesystem.write_bytes`  |  The number of bytes written to Amazon S3 by the driver, an executor identified by *executorId*, or ALL executors since the previous report (aggregated by the AWS Glue Metrics Dashboard as the number of bytes written during the previous minute). Valid dimensions: `JobName`, `JobRunId`, and `Type` (gauge). Valid Statistics: SUM. This metric is a delta value from the last reported value, so on the AWS Glue Metrics Dashboard a SUM statistic is used for aggregation. The area under the curve on the AWS Glue Metrics Dashboard can be used to visually compare bytes written by two different job runs. Unit: Bytes Can be used to monitor: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html) Some ways to use the data: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html)  | 
|  `glue.driver.streaming.numRecords` |  The number of records that are received in a micro-batch. This metric is only available for AWS Glue streaming jobs with AWS Glue version 2.0 and above. Valid dimensions: `JobName` (the name of the AWS Glue job), `JobRunId` (the JobRun ID. or `ALL`), and `Type` (count). Valid Statistics: Sum, Maximum, Minimum, Average, Percentile Unit: Count Can be used to monitor: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html)  | 
|  `glue.driver.streaming.batchProcessingTimeInMs` |  The time it takes to process the batches in milliseconds. This metric is only available for AWS Glue streaming jobs with AWS Glue version 2.0 and above. Valid dimensions: `JobName` (the name of the AWS Glue job), `JobRunId` (the JobRun ID. or `ALL`), and `Type` (count). Valid Statistics: Sum, Maximum, Minimum, Average, Percentile Unit: Count Can be used to monitor: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html)  | 
|   `glue.driver.system.cpuSystemLoad`  `glue.`*executorId*`.system.cpuSystemLoad`  `glue.ALL.system.cpuSystemLoad`  |  The fraction of CPU system load used (scale: 0-1) by the driver, an executor identified by *executorId*, or ALL executors. Valid dimensions: `JobName` (the name of the AWS Glue job), `JobRunId` (the JobRun ID. or `ALL`), and `Type` (gauge). Valid Statistics: Average. This metric is reported as an absolute value. Unit: Percentage Can be used to monitor: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html) Some ways to use the data: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html)  | 

## Dimensions for AWS Glue Metrics
<a name="awsglue-metricdimensions"></a>

AWS Glue metrics use the AWS Glue namespace and provide metrics for the following dimensions:


| Dimension | Description | 
| --- | --- | 
|  `JobName`  |  This dimension filters for metrics of all job runs of a specific AWS Glue job.  | 
|  `JobRunId`  |  This dimension filters for metrics of a specific AWS Glue job run by a JobRun ID, or `ALL`.  | 
|  `Type`  |  This dimension filters for metrics by either `count` (an aggregate number) or `gauge` (a value at a point in time).  | 

For more information, see the [Amazon CloudWatch User Guide](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/).

# Setting up Amazon CloudWatch alarms on AWS Glue job profiles
<a name="monitor-profile-glue-job-cloudwatch-alarms"></a>

AWS Glue metrics are also available in Amazon CloudWatch. You can set up alarms on any AWS Glue metric for scheduled jobs. 

A few common scenarios for setting up alarms are as follows:
+ Jobs running out of memory (OOM): Set an alarm when the memory usage exceeds the normal average for either the driver or an executor for an AWS Glue job.
+ Straggling executors: Set an alarm when the number of executors falls below a certain threshold for a large duration of time in an AWS Glue job.
+ Data backlog or reprocessing: Compare the metrics from individual jobs in a workflow using a CloudWatch math expression. You can then trigger an alarm on the resulting expression value (such as the ratio of bytes written by a job and bytes read by a following job).

For detailed instructions on setting alarms, see [Create or Edit a CloudWatch Alarm](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/ConsoleAlarms.html) in the *[Amazon CloudWatch Events User Guide](https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/)*. 

For monitoring and debugging scenarios using CloudWatch, see [Job monitoring and debugging](monitor-profile-glue-job-cloudwatch-metrics.md).

# Logging for AWS Glue jobs
<a name="monitor-continuous-logging"></a>

 In AWS Glue 5.0, all jobs have real-time logging capabilities. Additionally, you can specify custom configuration options to tailor the logging behavior. These options include setting the Amazon CloudWatch log group name, the Amazon CloudWatch log stream prefix (which will precede the AWS Glue job run ID and driver/executor ID), and the log conversion pattern for log messages. These configurations allow you to aggregate logs in custom Amazon CloudWatch log groups with different expiration policies. Furthermore, you can analyze the logs more effectively by using custom log stream prefixes and conversion patterns. This level of customization enables you to optimize log management and analysis according to your specific requirements. 

## Logging behavior in AWS Glue 5.0
<a name="monitor-logging-behavior-glue-50"></a>

 By default, system logs, Spark daemon logs, and user AWS Glue Logger logs are written to the `/aws-glue/jobs/error` log group in Amazon CloudWatch. On the other hand, user stdout (standard output) and stderr (standard error) logs are written to the `/aws-glue/jobs/output` log group by default. 

## Custom logging
<a name="monitor-logging-custom"></a>

 You can customize the default log group and log stream prefixes using the following job arguments: 
+  `--custom-logGroup-prefix`: Allows you to specify a custom prefix for the `/aws-glue/jobs/error` and `/aws-glue/jobs/output` log groups. If you provide a custom prefix, the log group names will be in the following format: 
  +  `/aws-glue/jobs/error` will be `<customer prefix>/error` 
  +  `/aws-glue/jobs/output ` will be `<customer prefix>/output` 
+  `--custom-logStream-prefix`: Allows you to specify a custom prefix for the log stream names within the log groups. If you provide a custom prefix, the log stream names will be in the following format: 
  +  `jobrunid-driver` will be `<customer log stream>-driver` 
  +  `jobrunid-executorNum` will be `<customer log stream>-executorNum` 

 Validation rules and limitations for custom prefixes: 
+  The entire log stream name must be between 1 and 512 characters long. 
+  The custom prefix itself is restricted to 400 characters. 
+  The custom prefix must match the regular expression pattern `[^:\$1]\$1` (special characters allowed are '\$1', '-', and '/'). 

## Logging application-specific messages using the custom script logger
<a name="monitor-logging-script"></a>

You can use the AWS Glue logger to log any application-specific messages in the script that are sent in real time to the driver log stream.

The following example shows a Python script.

```
from awsglue.context import GlueContext
from pyspark.context import SparkContext

sc = SparkContext()
glueContext = GlueContext(sc)
logger = glueContext.get_logger()
logger.info("info message")
logger.warn("warn message")
logger.error("error message")
```

The following example shows a Scala script.

```
import com.amazonaws.services.glue.log.GlueLogger

object GlueApp {
  def main(sysArgs: Array[String]) {
    val logger = new GlueLogger
    logger.info("info message")
    logger.warn("warn message")
    logger.error("error message")
  }
}
```

## Enabling the progress bar to show job progress
<a name="monitor-logging-progress"></a>

AWS Glue provides a real-time progress bar under the `JOB_RUN_ID-progress-bar` log stream to check AWS Glue job run status. Currently it supports only jobs that initialize `glueContext`. If you run a pure Spark job without initializing `glueContext`, the AWS Glue progress bar does not appear.

The progress bar shows the following progress update every 5 seconds.

```
Stage Number (Stage Name): > (numCompletedTasks + numActiveTasks) / totalNumOfTasksInThisStage]
```

## Security configuration with Amazon CloudWatch logging
<a name="monitor-security-config-logging"></a>

 When a security configuration is enabled for Amazon CloudWatch logs, AWS Glue creates log groups with specific naming patterns that incorporate the security configuration name. 

### Log group naming with security configuration
<a name="monitor-log-group-naming"></a>

 The default and custom log groups will be as follows: 
+  **Default error log group:** `/aws-glue/jobs/Security-Configuration-Name-role/glue-job-role/error` 
+  **Default output log group:** `/aws-glue/jobs/Security-Configuration-Name-role/glue-job-role/output` 
+  **Custom error log group (AWS Glue 5.0):** `custom-log-group-prefix/Security-Configuration-Name-role/glue-job-role/error` 
+  **Custom output log group (AWS Glue 5.0):** `custom-log-group-prefix/Security-Configuration-Name-role/glue-job-role/output` 

### Required IAM Permissions
<a name="monitor-logging-iam-permissions"></a>

 You need to add the `logs:AssociateKmsKey` permission to your IAM role permissions, if you enable a security configuration with Amazon CloudWatch Logs. If that permission is not included, continuous logging will be disabled. 

 Also, to configure the encryption for the Amazon CloudWatch Logs, follow the instructions at [ Encrypt Log Data in Amazon CloudWatch Logs Using AWS Key Management Service](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/encrypt-log-data-kms.html) in the Amazon Amazon CloudWatch Logs User Guide. 

### Additional Information
<a name="additional-info"></a>

 For more information on creating security configurations, see [Managing security configurations on the AWS Glue console](https://docs.aws.amazon.com/glue/latest/dg/console-security-configurations.html). 

**Topics**
+ [Logging behavior in AWS Glue 5.0](#monitor-logging-behavior-glue-50)
+ [Custom logging](#monitor-logging-custom)
+ [Logging application-specific messages using the custom script logger](#monitor-logging-script)
+ [Enabling the progress bar to show job progress](#monitor-logging-progress)
+ [Security configuration with Amazon CloudWatch logging](#monitor-security-config-logging)
+ [Enabling continuous logging for AWS Glue 4.0 and earlier jobs](monitor-continuous-logging-enable.md)
+ [Viewing logs for AWS Glue jobs](monitor-continuous-logging-view.md)

# Enabling continuous logging for AWS Glue 4.0 and earlier jobs
<a name="monitor-continuous-logging-enable"></a>

**Note**  
 In AWS Glue 4.0 and earlier versions, continuous logging was an available feature. However, with the introduction of AWS Glue 5.0, all jobs have real-time logging capability. For more details on the logging capabilities and configuration options in AWS Glue 5.0, see [Logging for AWS Glue jobs](https://docs.aws.amazon.com/glue/latest/dg/monitor-continuous-logging.html). 

You can enable continuous logging using the AWS Glue console or through the AWS Command Line Interface (AWS CLI). 

You can enable continuous logging when you create a new job, edit an existing job, or enable it through the AWS CLI.

You can also specify custom configuration options such as the Amazon CloudWatch log group name, CloudWatch log stream prefix before the AWS Glue job run ID driver/executor ID, and log conversion pattern for log messages. These configurations help you to set aggregate logs in custom CloudWatch log groups with different expiration policies, and analyze them further with custom log stream prefixes and conversions patterns. 

**Topics**
+ [Using the AWS Management Console](#monitor-continuous-logging-enable-console)
+ [Logging application-specific messages using the custom script logger](#monitor-continuous-logging-script)
+ [Enabling the progress bar to show job progress](#monitor-continuous-logging-progress)
+ [Security configuration with continuous logging](#monitor-logging-encrypt-log-data)

## Using the AWS Management Console
<a name="monitor-continuous-logging-enable-console"></a>

Follow these steps to use the console to enable continuous logging when creating or editing an AWS Glue job.

**To create a new AWS Glue job with continuous logging**

1. Sign in to the AWS Management Console and open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/).

1. In the navigation pane, choose **ETL jobs**.

1. Choose **Visual ETL**.

1. In the **Job details** tab, expand the **Advanced properties** section.

1. Under **Continuous logging** select **Enable logs in CloudWatch**.

**To enable continuous logging for an existing AWS Glue job**

1. Open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/).

1. In the navigation pane, choose **Jobs**.

1. Choose an existing job from the **Jobs** list.

1. Choose **Action**, **Edit job**.

1. In the **Job details** tab, expand the **Advanced properties** section.

1. Under **Continuous logging** select **Enable logs in CloudWatch**.

### Using the AWS CLI
<a name="monitor-continuous-logging-cli"></a>

To enable continuous logging, you pass in job parameters to an AWS Glue job. Pass the following special job parameters similar to other AWS Glue job parameters. For more information, see [Using job parameters in AWS Glue jobs](aws-glue-programming-etl-glue-arguments.md).

```
'--enable-continuous-cloudwatch-log': 'true'
```

You can specify a custom Amazon CloudWatch log group name. If not specified, the default log group name is `/aws-glue/jobs/logs-v2`.

```
'--continuous-log-logGroup': 'custom_log_group_name'
```

You can specify a custom Amazon CloudWatch log stream prefix. If not specified, the default log stream prefix is the job run ID.

```
'--continuous-log-logStreamPrefix': 'custom_log_stream_prefix'
```

You can specify a custom continuous logging conversion pattern. If not specified, the default conversion pattern is `%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n`. Note that the conversion pattern only applies to driver logs and executor logs. It does not affect the AWS Glue progress bar.

```
'--continuous-log-conversionPattern': 'custom_log_conversion_pattern'
```

## Logging application-specific messages using the custom script logger
<a name="monitor-continuous-logging-script"></a>

You can use the AWS Glue logger to log any application-specific messages in the script that are sent in real time to the driver log stream.

The following example shows a Python script.

```
from awsglue.context import GlueContext
from pyspark.context import SparkContext

sc = SparkContext()
glueContext = GlueContext(sc)
logger = glueContext.get_logger()
logger.info("info message")
logger.warn("warn message")
logger.error("error message")
```

The following example shows a Scala script.

```
import com.amazonaws.services.glue.log.GlueLogger

object GlueApp {
  def main(sysArgs: Array[String]) {
    val logger = new GlueLogger
    logger.info("info message")
    logger.warn("warn message")
    logger.error("error message")
  }
}
```

## Enabling the progress bar to show job progress
<a name="monitor-continuous-logging-progress"></a>

AWS Glue provides a real-time progress bar under the `JOB_RUN_ID-progress-bar` log stream to check AWS Glue job run status. Currently it supports only jobs that initialize `glueContext`. If you run a pure Spark job without initializing `glueContext`, the AWS Glue progress bar does not appear.

The progress bar shows the following progress update every 5 seconds.

```
Stage Number (Stage Name): > (numCompletedTasks + numActiveTasks) / totalNumOfTasksInThisStage]
```

## Security configuration with continuous logging
<a name="monitor-logging-encrypt-log-data"></a>

If a security configuration is enabled for CloudWatch logs, AWS Glue will create a log group named as follows for continuous logs:

```
<Log-Group-Name>-<Security-Configuration-Name>
```

The default and custom log groups will be as follows:
+ The default continuous log group will be `/aws-glue/jobs/error-<Security-Configuration-Name>`
+ The custom continuous log group will be `<custom-log-group-name>-<Security-Configuration-Name>`

You need to add the `logs:AssociateKmsKey` to your IAM role permissions, if you enable a security configuration with CloudWatch Logs. If that permission is not included, continuous logging will be disabled. Also, to configure the encryption for the CloudWatch Logs, follow the instructions at [Encrypt Log Data in CloudWatch Logs Using AWS Key Management Service](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/encrypt-log-data-kms.html) in the *Amazon CloudWatch Logs User Guide*.

For more information on creating security configurations, see [Managing security configurations on the AWS Glue console](console-security-configurations.md).

**Note**  
 You may incur additional charges when you enable logging and additional CloudWatch log events are created. For more information, see [ Amazon CloudWatch pricing ](https://aws.amazon.com/cloudwatch/pricing/). 

# Viewing logs for AWS Glue jobs
<a name="monitor-continuous-logging-view"></a>

You can view real-time logs using the AWS Glue console or the Amazon CloudWatch console.

**To view real-time logs using the AWS Glue console dashboard**

1. Sign in to the AWS Management Console and open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/).

1. In the navigation pane, choose **Jobs**.

1. Add or start an existing job. Choose **Action**, **Run job**.

   When you start running a job, you navigate to a page that contains information about the running job:
   + The **Logs** tab shows the older aggregated application logs.
   + The **Logs** tab shows a real-time progress bar when the job is running with `glueContext` initialized.
   + The **Logs** tab also contains the **Driver logs**, which capture real-time Apache Spark driver logs, and application logs from the script logged using the AWS Glue application logger when the job is running.

1. For older jobs, you can also view the real-time logs under the **Job History** view by choosing **Logs**. This action takes you to the CloudWatch console that shows all Spark driver, executor, and progress bar log streams for that job run.

**To view real-time logs using the CloudWatch console dashboard**

1. Open the CloudWatch console at [https://console.aws.amazon.com/cloudwatch/](https://console.aws.amazon.com/cloudwatch/).

1. In the navigation pane, choose **Log**.

1. Choose the **/aws-glue/jobs/error/** log group.

1. In the **Filter** box, paste the job run ID.

   You can view the driver logs, executor logs, and progress bar (if using the **Standard filter**).

# Monitoring with AWS Glue Observability metrics
<a name="monitor-observability"></a>

**Note**  
AWS Glue Observability metrics is available on AWS Glue 4.0 and later versions.

 Use AWS Glue Observability metrics to generate insights into what is happening inside your AWS Glue for Apache Spark jobs to improve triaging and analysis of issues. Observability metrics are visualized through Amazon CloudWatch dashboards and can be used to help perform root cause analysis for errors and for diagnosing performance bottlenecks. You can reduce the time spent debugging issues at scale so you can focus on resolving issues faster and more effectively. 

 AWS Glue Observability provides Amazon CloudWatch metrics categorized in following four groups: 
+  **Reliability (i.e., Errors Classes)** – easily identify the most common failure reasons at given time range that you may want to address. 
+  **Performance (i.e., Skewness)** – identify a performance bottleneck and apply tuning techniques. For example, when you experience degraded performance due to job skewness, you may want to enable Spark Adaptive Query Execution and fine-tune the skew join threshold. 
+  **Throughput (i.e., per source/sink throughput)** – monitor trends of data reads and writes. You can also configure Amazon CloudWatch alarms for anomalies. 
+  **Resource Utilization (i.e., workers, memory and disk utilization)** – efficiently find the jobs with low capacity utilization. You may want to enable AWS Glue auto-scaling for those jobs. 

## Getting started with AWS Glue Observability metrics
<a name="monitor-observability-getting-started"></a>

**Note**  
 The new metrics are enabled by default in the AWS Glue Studio console. 

**To configure observability metrics in AWS Glue Studio:**

1. Log in to the AWS Glue console and choose **ETL jobs** from the console menu.

1. Choose a job by clicking on the job name in the **Your jobs** section.

1. Choose the **Job details** tab.

1. Scroll to the bottom and choose **Advanced properties**, then **Job observability metrics**.  
![\[The screenshot shows the Job details tab Advanced properties. The Job observability metrics option is highlighted.\]](http://docs.aws.amazon.com/glue/latest/dg/images/job-details-observability-metrics.png)

**To enable AWS Glue Observability metrics using AWS CLI:**
+  Add to the `--default-arguments` map the following key-value in the input JSON file: 

  ```
  --enable-observability-metrics, true
  ```

## Using AWS Glue observability
<a name="monitor-observability-cloudwatch"></a>

 Because the AWS Glue observability metrics is provided through Amazon CloudWatch, you can use the Amazon CloudWatch console, AWS CLI, SDK or API to query the observability metrics datapoints. See [ Using Glue Observability for monitoring resource utilization to reduce cost ](https://aws.amazon.com/blogs/big-data/enhance-monitoring-and-debugging-for-aws-glue-jobs-using-new-job-observability-metrics/) for an example use case when to use AWS Glue observability metrics. 

### Using AWS Glue observability in the Amazon CloudWatch console
<a name="monitor-observability-cloudwatch-console"></a>

**To query and visualize metrics in the Amazon CloudWatch console:**

1.  Open the Amazon CloudWatch console and choose **All metrics**. 

1.  Under custom namespaces, choose **AWS Glue**. 

1.  Choose **Job Observability Metrics, Observability Metrics Per Source, or Observability Metrics Per Sink** . 

1. Search for the specific metric name, job name, job run ID, and select them.

1. Under the **Graphed metrics** tab, configure your preferred statistic, period, and other options.  
![\[The screenshot shows the Amazon CloudWatch console and metrics graph.\]](http://docs.aws.amazon.com/glue/latest/dg/images/cloudwatch-console-metrics.png)

**To query an Observability metric using AWS CLI:**

1.  Create a metric definition JSON file and replace `your-Glue-job-name`and `your-Glue-job-run-id` with yours. 

   ```
   $ cat multiplequeries.json
   [
       {
           "Id": "avgWorkerUtil_0",
           "MetricStat": {
               "Metric": {
                   "Namespace": "Glue",
                   "MetricName": "glue.driver.workerUtilization",
                   "Dimensions": [
                       {
                           "Name": "JobName",
                           "Value": "<your-Glue-job-name-A>"
                       },
                       {
                           "Name": "JobRunId",
                           "Value": "<your-Glue-job-run-id-A>"
                       },
                       {
                           "Name": "Type",
                           "Value": "gauge"
                       },
                       {
                           "Name": "ObservabilityGroup",
                           "Value": "resource_utilization"
                       }
                   ]
               },
               "Period": 1800,
               "Stat": "Minimum",
               "Unit": "None"
           }
       },
       {
           "Id": "avgWorkerUtil_1",
           "MetricStat": {
               "Metric": {
                   "Namespace": "Glue",
                   "MetricName": "glue.driver.workerUtilization",
                   "Dimensions": [
                       {
                           "Name": "JobName",
                           "Value": "<your-Glue-job-name-B>"
                       },
                       {
                           "Name": "JobRunId",
                           "Value": "<your-Glue-job-run-id-B>"
                       },
                       {
                           "Name": "Type",
                           "Value": "gauge"
                       },
                       {
                           "Name": "ObservabilityGroup",
                           "Value": "resource_utilization"
                       }
                   ]
               },
               "Period": 1800,
               "Stat": "Minimum",
               "Unit": "None"
           }
       }
   ]
   ```

1.  Run the `get-metric-data` command: 

   ```
   $ aws cloudwatch get-metric-data --metric-data-queries file: //multiplequeries.json \
        --start-time '2023-10-28T18: 20' \
        --end-time '2023-10-28T19: 10'  \
        --region us-east-1
   {
       "MetricDataResults": [
           {
               "Id": "avgWorkerUtil_0",
               "Label": "<your-label-for-A>",
               "Timestamps": [
                   "2023-10-28T18:20:00+00:00"
               ],
               "Values": [
                   0.06718750000000001
               ],
               "StatusCode": "Complete"
           },
           {
               "Id": "avgWorkerUtil_1",
               "Label": "<your-label-for-B>",
               "Timestamps": [
                   "2023-10-28T18:50:00+00:00"
               ],
               "Values": [
                   0.5959183673469387
               ],
               "StatusCode": "Complete"
           }
       ],
       "Messages": []
   }
   ```

## Observability metrics
<a name="monitor-observability-metrics-definitions"></a>

 AWS Glue Observability profiles and sends the following metrics to Amazon CloudWatch every 30 seconds, and some of these metrics can be visible in the AWS Glue Studio Job Runs Monitoring Page. 


| Metric | Description | Category | 
| --- | --- | --- | 
| glue.driver.skewness.stage |  Metric Category: job\$1performance The spark stages execution Skewness: this metric is an indicator of how long the maximum task duration time in a certain stage is compared to the median task duration in this stage. It captures execution skewness, which might be caused by input data skewness or by a transformation (e.g., skewed join). The values of this metric falls into the range of [0, infinity[, where 0 means the ratio of the maximum to median tasks' execution time, among all tasks in the stage is less than a certain stage skewness factor. The default stage skewness factor is `5` and it can be overwritten via spark conf: spark.metrics.conf.driver.source.glue.jobPerformance.skewnessFactor A stage skewness value of 1 means the ratio is twice the stage skewness factor.  The value of stage skewness is updated every 30 seconds to reflect the current skewness. The value at the end of the stage reflects the final stage skewness. This stage-level metric is used to calculate the job-level metric `glue.driver.skewness.job`. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (gauge), and ObservabilityGroup (job\$1performance) Valid Statistics: Average, Maximum, Minimum, Percentile Unit: Count  | job\$1performance | 
| glue.driver.skewness.job |  Metric Category: job\$1performance  Job skewness is the maximum of weighted skewness of all stages. The stage skewness (glue.driver.skewness.stage) is weighted with stage duration. This is to avoid the corner case when a very skewed stage is actually running for very short time relative to other stages (and thus its skewness is not significant for the overall job performance and does not worth the effort to try to address its skewness).  This metric is updated upon completion of each stage, and thus the last value reflects the actual overall job skewness. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (gauge), and ObservabilityGroup (job\$1performance) Valid Statistics: Average, Maximum, Minimum, Percentile Unit: Count  | job\$1performance | 
| glue.succeed.ALL |  Metric Category: error Total number of successful job runs, to complete the picture of failures categories Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (count), and ObservabilityGroup (error) Valid Statistics: SUM Unit: Count  | error | 
| glue.error.ALL |  Metric Category: error  Total number of job run errors, to complete the picture of failures categories Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (count), and ObservabilityGroup (error) Valid Statistics: SUM Unit: Count  | error | 
| glue.error.[error category] |  Metric Category: error  This is actually a set of metrics, that are updated only when a job run fails. The error categorization helps with triaging and debugging. When a job run fails, the error causing the failure is categorized and the corresponding error category metric is set to 1. This helps to perform over time failures analysis, as well as over all jobs error analysis to identify most common failure categories to start addressing them. AWS Glue has 28 error categories, including OUT\$1OF\$1MEMORY (driver and executor), PERMISSION, SYNTAX and THROTTLING error categories. Error categories also include COMPILATION, LAUNCH and TIMEOUT error categories. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (count), and ObservabilityGroup (error) Valid Statistics: SUM Unit: Count  | error | 
| glue.driver.workerUtilization |  Metric Category: resource\$1utilization  The percentage of the allocated workers which are actually used. If not good, auto scaling can help. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (gauge), and ObservabilityGroup (resource\$1utilization) Valid Statistics: Average, Maximum, Minimum, Percentile Unit: Percentage  | resource\$1utilization | 
| glue.driver.memory.heap.[available \$1 used] |  Metric Category: resource\$1utilization  The driver's available / used heap memory during the job run. This helps to understand memory usage trends, especially over time, which can help avoid potential failures, in addition to debugging memory related failures. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (gauge), and ObservabilityGroup (resource\$1utilization) Valid Statistics: Average Unit: Bytes  | resource\$1utilization | 
| glue.driver.memory.heap.used.percentage |  Metric Category: resource\$1utilization  The driver's used (%) heap memory during the job run. This helps to understand memory usage trends, especially over time, which can help avoid potential failures, in addition to debugging memory related failures. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (gauge), and ObservabilityGroup (resource\$1utilization) Valid Statistics: Average Unit: Percentage  | resource\$1utilization | 
| glue.driver.memory.non-heap.[available \$1 used] |  Metric Category: resource\$1utilization  The driver's available / used non-heap memory during the job run. This helps to understand memory usage trensd, especially over time, which can help avoid potential failures, in addition to debugging memory related failures. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (gauge), and ObservabilityGroup (resource\$1utilization) Valid Statistics: Average Unit: Bytes  | resource\$1utilization | 
| glue.driver.memory.non-heap.used.percentage |  Metric Category: resource\$1utilization  The driver's used (%) non-heap memory during the job run. This helps to understand memory usage trends, especially over time, which can help avoid potential failures, in addition to debugging memory related failures. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (gauge), and ObservabilityGroup (resource\$1utilization) Valid Statistics: Average Unit: Percentage  | resource\$1utilization | 
| glue.driver.memory.total.[available \$1 used] |  Metric Category: resource\$1utilization  The driver's available / used total memory during the job run. This helps to understand memory usage trends, especially over time, which can help avoid potential failures, in addition to debugging memory related failures. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (gauge), and ObservabilityGroup (resource\$1utilization) Valid Statistics: Average Unit: Bytes  | resource\$1utilization | 
| glue.driver.memory.total.used.percentage |  Metric Category: resource\$1utilization  The driver's used (%) total memory during the job run. This helps to understand memory usage trends, especially over time, which can help avoid potential failures, in addition to debugging memory related failures. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (gauge), and ObservabilityGroup (resource\$1utilization) Valid Statistics: Average Unit: Percentage  | resource\$1utilization | 
| glue.ALL.memory.heap.[available \$1 used] |  Metric Category: resource\$1utilization  The executors' available/used heap memory. ALL means all executors. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (gauge), and ObservabilityGroup (resource\$1utilization) Valid Statistics: Average Unit: Bytes  | resource\$1utilization | 
| glue.ALL.memory.heap.used.percentage |  Metric Category: resource\$1utilization  The executors' used (%) heap memory. ALL means all executors. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (gauge), and ObservabilityGroup (resource\$1utilization) Valid Statistics: Average Unit: Percentage  | resource\$1utilization | 
| glue.ALL.memory.non-heap.[available \$1 used] |  Metric Category: resource\$1utilization  The executors' available/used non-heap memory. ALL means all executors. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (gauge), and ObservabilityGroup (resource\$1utilization) Valid Statistics: Average Unit: Bytes  | resource\$1utilization | 
| glue.ALL.memory.non-heap.used.percentage |  Metric Category: resource\$1utilization  The executors' used (%) non-heap memory. ALL means all executors. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (gauge), and ObservabilityGroup (resource\$1utilization) Valid Statistics: Average Unit: Percentage  | resource\$1utilization | 
| glue.ALL.memory.total.[available \$1 used] |  Metric Category: resource\$1utilization  The executors' available/used total memory. ALL means all executors. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (gauge), and ObservabilityGroup (resource\$1utilization) Valid Statistics: Average Unit: Bytes  | resource\$1utilization | 
| glue.ALL.memory.total.used.percentage |  Metric Category: resource\$1utilization  The executors' used (%) total memory. ALL means all executors. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (gauge), and ObservabilityGroup (resource\$1utilization) Valid Statistics: Average Unit: Percentage  | resource\$1utilization | 
| glue.driver.disk.[available\$1GB \$1 used\$1GB] |  Metric Category: resource\$1utilization  The driver's available/used disk space during the job run. This helps to understand disk usage trends, especially over time, which can help avoid potential failures, in addition to debugging not enought disk space related failures. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (gauge), and ObservabilityGroup (resource\$1utilization) Valid Statistics: Average Unit: Gigabytes  | resource\$1utilization | 
| glue.driver.disk.used.percentage] |  Metric Category: resource\$1utilization  The driver's available/used disk space during the job run. This helps to understand disk usage trends, especially over time, which can help avoid potential failures, in addition to debugging not enought disk space related failures. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (gauge), and ObservabilityGroup (resource\$1utilization) Valid Statistics: Average Unit: Percentage  | resource\$1utilization | 
| glue.ALL.disk.[available\$1GB \$1 used\$1GB] |  Metric Category: resource\$1utilization  The executors' available/used disk space. ALL means all executors. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (gauge), and ObservabilityGroup (resource\$1utilization) Valid Statistics: Average Unit: Gigabytes  | resource\$1utilization | 
| glue.ALL.disk.used.percentage |  Metric Category: resource\$1utilization  The executors' available/used/used(%) disk space. ALL means all executors. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (gauge), and ObservabilityGroup (resource\$1utilization) Valid Statistics: Average Unit: Percentage  | resource\$1utilization | 
| glue.driver.bytesRead |  Metric Category: throughput  The number of bytes read per input source in this job run, as well as well as for ALL sources. This helps understand the data volume and its changes over time, which helps addressing issues such as data skewness. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (gauge), ObservabilityGroup (resource\$1utilization), and Source (source data location) Valid Statistics: Average Unit: Bytes  | throughput | 
| glue.driver.[recordsRead \$1 filesRead]  |  Metric Category: throughput  The number of records/files read per input source in this job run, as well as well as for ALL sources. This helps understand the data volume and its changes over time, which helps addressing issues such as data skewness. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (gauge), ObservabilityGroup (resource\$1utilization), and Source (source data location) Valid Statistics: Average Unit: Count  | throughput | 
| glue.driver.partitionsRead  |  Metric Category: throughput  The number of partitions read per Amazon S3 input source in this job run, as well as well as for ALL sources. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (gauge), ObservabilityGroup (resource\$1utilization), and Source (source data location) Valid Statistics: Average Unit: Count  | throughput | 
| glue.driver.bytesWrittten |  Metric Category: throughput  The number of bytes written per output sink in this job run, as well as well as for ALL sinks. This helps understand the data volume and how it evolves over time, which helps addressing issues such as processing skewness. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (gauge), ObservabilityGroup (resource\$1utilization), and Sink (sink data location) Valid Statistics: Average Unit: Bytes  | throughput | 
| glue.driver.[recordsWritten \$1 filesWritten] |  Metric Category: throughput  The nnumber of records/files written per output sink in this job run, as well as well as for ALL sinks. This helps understand the data volume and how it evolves over time, which helps addressing issues such as processing skewness. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (gauge), ObservabilityGroup (resource\$1utilization), and Sink (sink data location) Valid Statistics: Average Unit: Count  | throughput | 

## Error categories
<a name="monitor-observability-error-categories"></a>


| Error categories | Description | 
| --- | --- | 
| COMPILATION\$1ERROR | Errors arise during the compilation of Scala code. | 
| CONNECTION\$1ERROR | Errors arise during connecting to a service/remote host/database service, etc. | 
| DISK\$1NO\$1SPACE\$1ERROR |  Errors arise when there is no space left in disk on driver/executor.  | 
| OUT\$1OF\$1MEMORY\$1ERROR | Errors arise when there is no space left in memory on driver/executor. | 
| IMPORT\$1ERROR | Errors arise when import dependencies. | 
| INVALID\$1ARGUMENT\$1ERROR | Errors arise when the input arguments are invalid/illegal. | 
| PERMISSION\$1ERROR | Errors arise when lacking the permission to service, data, etc.  | 
| RESOURCE\$1NOT\$1FOUND\$1ERROR |  Errors arise when data, location, etc does not exit.   | 
| QUERY\$1ERROR | Errors arise from Spark SQL query execution.  | 
| SYNTAX\$1ERROR | Errors arise when there is syntax error in the script.  | 
| THROTTLING\$1ERROR | Errors arise when hitting service concurrency limitation or execeding service quota limitaion. | 
| DATA\$1LAKE\$1FRAMEWORK\$1ERROR | Errors arise from AWS Glue native-supported data lake framework like Hudi, Iceberg, etc. | 
| UNSUPPORTED\$1OPERATION\$1ERROR | Errors arise when making unsupported operation. | 
| RESOURCES\$1ALREADY\$1EXISTS\$1ERROR | Errors arise when a resource to be created or added already exists. | 
| GLUE\$1INTERNAL\$1SERVICE\$1ERROR | Errors arise when there is a AWS Glue internal service issue.  | 
| GLUE\$1OPERATION\$1TIMEOUT\$1ERROR | Errors arise when a AWS Glue operation is timeout. | 
| GLUE\$1VALIDATION\$1ERROR | Errors arise when a required value could not be validated for AWS Glue job. | 
| GLUE\$1JOB\$1BOOKMARK\$1VERSION\$1MISMATCH\$1ERROR | Errors arise when same job exon the same source bucket and write to the same/different destination concurrently (concurrency >1) | 
| LAUNCH\$1ERROR | Errors arise during the AWS Glue job launch phase. | 
| DYNAMODB\$1ERROR | Generic errors arise from Amazon DynamoDB service. | 
| GLUE\$1ERROR | Generic Errors arise from AWS Glue service. | 
| LAKEFORMATION\$1ERROR | Generic Errors arise from AWS Lake Formation service. | 
| REDSHIFT\$1ERROR | Generic Errors arise from Amazon Redshift service. | 
| S3\$1ERROR | Generic Errors arise from Amazon S3 service. | 
| SYSTEM\$1EXIT\$1ERROR | Generic system exit error. | 
| TIMEOUT\$1ERROR | Generic errors arise when job failed by operation time out. | 
| UNCLASSIFIED\$1SPARK\$1ERROR | Generic errors arise from Spark. | 
| UNCLASSIFIED\$1ERROR | Default error category. | 

## Limitations
<a name="monitoring-observability-limitations"></a>

**Note**  
`glueContext` must be initialized to publish the metrics.

 In the Source Dimension, the value is either Amazon S3 path or table name, depending on the source type. In addition, if the source is JDBC and the query option is used, the query string is set in the source dimension. If the value is longer than 500 characters, it is trimmed within 500 characters.The following are limitations in the value: 
+ Non-ASCII characters will be removed.
+ If the source name doesn’t contain any ASCII character, it is converted to <non-ASCII input>.

### Limitations and considerations for throughput metrics
<a name="monitoring-observability-considerations"></a>
+  DataFrame and DataFrame-based DynamicFrame (e.g. JDBC, reading from parquet on Amazon S3) are supported, however, RDD-based DynamicFrame (e.g. reading csv, json on Amazon S3, etc.) is not supported. Technically, all reads and writes visible on Spark UI are supported. 
+  The `recordsRead` metric will be emitted if the data source is catalog table and the format is JSON, CSV, text, or Iceberg. 
+  `glue.driver.throughput.recordsWritten`, `glue.driver.throughput.bytesWritten`, and `glue.driver.throughput.filesWritten` metrics are not available in JDBC and Iceberg tables. 
+  Metrics may be delayed. If the job finishes in about one minute, there may be no throughput metrics in Amazon CloudWatch Metrics. 

# Job monitoring and debugging
<a name="monitor-profile-glue-job-cloudwatch-metrics"></a>

You can collect metrics about AWS Glue jobs and visualize them on the AWS Glue and Amazon CloudWatch consoles to identify and fix issues. Profiling your AWS Glue jobs requires the following steps:

1.  Enable metrics: 

   1.  Enable the **Job metrics** option in the job definition. You can enable profiling in the AWS Glue console or as a parameter to the job. For more information see [Defining job properties for Spark jobs](add-job.md#create-job) or [Using job parameters in AWS Glue jobs](aws-glue-programming-etl-glue-arguments.md). 

   1.  Enable the **AWS Glue Observability metrics** option in the job definition. You can enable Observability in the AWS Glue console or as a parameter to the job. For more information see [Monitoring with AWS Glue Observability metrics](monitor-observability.md). 

1. Confirm that the job script initializes a `GlueContext`. For example, the following script snippet initializes a `GlueContext` and shows where profiled code is placed in the script. This general format is used in the debugging scenarios that follow. 

   ```
   import sys
   from awsglue.transforms import *
   from awsglue.utils import getResolvedOptions
   from pyspark.context import SparkContext
   from awsglue.context import GlueContext
   from awsglue.job import Job
   import time
   
   ## @params: [JOB_NAME]
   args = getResolvedOptions(sys.argv, ['JOB_NAME'])
   
   sc = SparkContext()
   glueContext = GlueContext(sc)
   spark = glueContext.spark_session
   job = Job(glueContext)
   job.init(args['JOB_NAME'], args)
   
   ...
   ...
   code-to-profile
   ...
   ...
   
   
   job.commit()
   ```

1. Run the job.

1. Visualize the metrics:

   1. Visualize job metrics on the AWS Glue console and identify abnormal metrics for the driver or an executor.

   1. Check observability metrics in the Job run monitoring page, job run details page, or on Amazon CloudWatch. For more information, see [Monitoring with AWS Glue Observability metrics](monitor-observability.md).

1. Narrow down the root cause using the identified metric.

1. Optionally, confirm the root cause using the log stream of the identified driver or job executor.

 **Use cases for AWS Glue observability metrics** 
+  [Debugging OOM exceptions and job abnormalities](monitor-profile-debug-oom-abnormalities.md) 
+  [Debugging demanding stages and straggler tasks](monitor-profile-debug-straggler.md) 
+  [Monitoring the progress of multiple jobs](monitor-debug-multiple.md) 
+  [Monitoring for DPU capacity planning](monitor-debug-capacity.md) 
+  [ Using AWS Glue Observability for monitoring resource utilization to reduce cost ](https://aws.amazon.com/blogs/big-data/enhance-monitoring-and-debugging-for-aws-glue-jobs-using-new-job-observability-metrics) 

# Debugging OOM exceptions and job abnormalities
<a name="monitor-profile-debug-oom-abnormalities"></a>

You can debug out-of-memory (OOM) exceptions and job abnormalities in AWS Glue. The following sections describe scenarios for debugging out-of-memory exceptions of the Apache Spark driver or a Spark executor. 
+ [Debugging a driver OOM exception](#monitor-profile-debug-oom-driver)
+ [Debugging an executor OOM exception](#monitor-profile-debug-oom-executor)

## Debugging a driver OOM exception
<a name="monitor-profile-debug-oom-driver"></a>

In this scenario, a Spark job is reading a large number of small files from Amazon Simple Storage Service (Amazon S3). It converts the files to Apache Parquet format and then writes them out to Amazon S3. The Spark driver is running out of memory. The input Amazon S3 data has more than 1 million files in different Amazon S3 partitions. 

The profiled code is as follows:

```
data = spark.read.format("json").option("inferSchema", False).load("s3://input_path")
data.write.format("parquet").save(output_path)
```

### Visualize the profiled metrics on the AWS Glue console
<a name="monitor-debug-oom-visualize"></a>

The following graph shows the memory usage as a percentage for the driver and executors. This usage is plotted as one data point that is averaged over the values reported in the last minute. You can see in the memory profile of the job that the [driver memory](monitoring-awsglue-with-cloudwatch-metrics.md#glue.driver.jvm.heap.usage) crosses the safe threshold of 50 percent usage quickly. On the other hand, the [average memory usage](monitoring-awsglue-with-cloudwatch-metrics.md#glue.ALL.jvm.heap.usage) across all executors is still less than 4 percent. This clearly shows abnormality with driver execution in this Spark job. 

![\[The memory usage in percentage for the driver and executors.\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-debug-oom-memoryprofile.png)


The job run soon fails, and the following error appears in the **History** tab on the AWS Glue console: Command Failed with Exit Code 1. This error string means that the job failed due to a systemic error—which in this case is the driver running out of memory.

![\[The error message shown on the AWS Glue console.\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-debug-oom-errorstring.png)


On the console, choose the **Error logs** link on the **History** tab to confirm the finding about driver OOM from the CloudWatch Logs. Search for "**Error**" in the job's error logs to confirm that it was indeed an OOM exception that failed the job:

```
# java.lang.OutOfMemoryError: Java heap space
# -XX:OnOutOfMemoryError="kill -9 %p"
# Executing /bin/sh -c "kill -9 12039"...
```

On the **History** tab for the job, choose **Logs**. You can find the following trace of driver execution in the CloudWatch Logs at the beginning of the job. The Spark driver tries to list all the files in all the directories, constructs an `InMemoryFileIndex`, and launches one task per file. This in turn results in the Spark driver having to maintain a large amount of state in memory to track all the tasks. It caches the complete list of a large number of files for the in-memory index, resulting in a driver OOM.

### Fix the processing of multiple files using grouping
<a name="monitor-debug-oom-fix"></a>

You can fix the processing of the multiple files by using the *grouping* feature in AWS Glue. Grouping is automatically enabled when you use dynamic frames and when the input dataset has a large number of files (more than 50,000). Grouping allows you to coalesce multiple files together into a group, and it allows a task to process the entire group instead of a single file. As a result, the Spark driver stores significantly less state in memory to track fewer tasks. For more information about manually enabling grouping for your dataset, see [Reading input files in larger groups](grouping-input-files.md).

To check the memory profile of the AWS Glue job, profile the following code with grouping enabled:

```
df = glueContext.create_dynamic_frame_from_options("s3", {'paths': ["s3://input_path"], "recurse":True, 'groupFiles': 'inPartition'}, format="json")
datasink = glueContext.write_dynamic_frame.from_options(frame = df, connection_type = "s3", connection_options = {"path": output_path}, format = "parquet", transformation_ctx = "datasink")
```

You can monitor the memory profile and the ETL data movement in the AWS Glue job profile.

The driver runs below the threshold of 50 percent memory usage over the entire duration of the AWS Glue job. The executors stream the data from Amazon S3, process it, and write it out to Amazon S3. As a result, they consume less than 5 percent memory at any point in time.

![\[The memory profile showing the issue is fixed.\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-debug-oom-memoryprofile-fixed.png)


The data movement profile below shows the total number of Amazon S3 bytes that are [read](monitoring-awsglue-with-cloudwatch-metrics.md#glue.ALL.s3.filesystem.read_bytes) and [written](monitoring-awsglue-with-cloudwatch-metrics.md#glue.ALL.s3.filesystem.write_bytes) in the last minute by all executors as the job progresses. Both follow a similar pattern as the data is streamed across all the executors. The job finishes processing all one million files in less than three hours.

![\[The data movement profile showing the issue is fixed.\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-debug-oom-etlmovement.png)


## Debugging an executor OOM exception
<a name="monitor-profile-debug-oom-executor"></a>

In this scenario, you can learn how to debug OOM exceptions that could occur in Apache Spark executors. The following code uses the Spark MySQL reader to read a large table of about 34 million rows into a Spark dataframe. It then writes it out to Amazon S3 in Parquet format. You can provide the connection properties and use the default Spark configurations to read the table.

```
val connectionProperties = new Properties()
connectionProperties.put("user", user)
connectionProperties.put("password", password)
connectionProperties.put("Driver", "com.mysql.jdbc.Driver")
val sparkSession = glueContext.sparkSession
val dfSpark = sparkSession.read.jdbc(url, tableName, connectionProperties)
dfSpark.write.format("parquet").save(output_path)
```

### Visualize the profiled metrics on the AWS Glue console
<a name="monitor-debug-oom-visualize-2"></a>

If the slope of the memory usage graph is positive and crosses 50 percent, then if the job fails before the next metric is emitted, then memory exhaustion is a good candidate for the cause. The following graph shows that within a minute of execution, the [average memory usage](monitoring-awsglue-with-cloudwatch-metrics.md#glue.ALL.jvm.heap.usage) across all executors spikes up quickly above 50 percent. The usage reaches up to 92 percent and the container running the executor is stopped by Apache Hadoop YARN. 

![\[The average memory usage across all executors.\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-debug-oom-2-memoryprofile.png)


As the following graph shows, there is always a [single executor](monitoring-awsglue-with-cloudwatch-metrics.md#glue.driver.ExecutorAllocationManager.executors.numberAllExecutors) running until the job fails. This is because a new executor is launched to replace the stopped executor. The JDBC data source reads are not parallelized by default because it would require partitioning the table on a column and opening multiple connections. As a result, only one executor reads in the complete table sequentially.

![\[The job execution shows a single executor running until the job fails.\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-debug-oom-2-execution.png)


As the following graph shows, Spark tries to launch a new task four times before failing the job. You can see the [memory profile](monitoring-awsglue-with-cloudwatch-metrics.md#glue.ALL.jvm.heap.used) of three executors. Each executor quickly uses up all of its memory. The fourth executor runs out of memory, and the job fails. As a result, its metric is not reported immediately.

![\[The memory profiles of the executors.\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-debug-oom-2-exec-memprofile.png)


You can confirm from the error string on the AWS Glue console that the job failed due to OOM exceptions, as shown in the following image.

![\[The error message shown on the AWS Glue console.\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-debug-oom-2-errorstring.png)


**Job output logs:** To further confirm your finding of an executor OOM exception, look at the CloudWatch Logs. When you search for **Error**, you find the four executors being stopped in roughly the same time windows as shown on the metrics dashboard. All are terminated by YARN as they exceed their memory limits.

Executor 1

```
18/06/13 16:54:29 WARN YarnAllocator: Container killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
18/06/13 16:54:29 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
18/06/13 16:54:29 ERROR YarnClusterScheduler: Lost executor 1 on ip-10-1-2-175.ec2.internal: Container killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
18/06/13 16:54:29 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, ip-10-1-2-175.ec2.internal, executor 1): ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
```

Executor 2

```
18/06/13 16:55:35 WARN YarnAllocator: Container killed by YARN for exceeding memory limits. 5.8 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
18/06/13 16:55:35 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container killed by YARN for exceeding memory limits. 5.8 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
18/06/13 16:55:35 ERROR YarnClusterScheduler: Lost executor 2 on ip-10-1-2-16.ec2.internal: Container killed by YARN for exceeding memory limits. 5.8 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
18/06/13 16:55:35 WARN TaskSetManager: Lost task 0.1 in stage 0.0 (TID 1, ip-10-1-2-16.ec2.internal, executor 2): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 5.8 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
```

Executor 3

```
18/06/13 16:56:37 WARN YarnAllocator: Container killed by YARN for exceeding memory limits. 5.8 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
18/06/13 16:56:37 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container killed by YARN for exceeding memory limits. 5.8 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
18/06/13 16:56:37 ERROR YarnClusterScheduler: Lost executor 3 on ip-10-1-2-189.ec2.internal: Container killed by YARN for exceeding memory limits. 5.8 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
18/06/13 16:56:37 WARN TaskSetManager: Lost task 0.2 in stage 0.0 (TID 2, ip-10-1-2-189.ec2.internal, executor 3): ExecutorLostFailure (executor 3 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 5.8 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
```

Executor 4

```
18/06/13 16:57:18 WARN YarnAllocator: Container killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
18/06/13 16:57:18 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
18/06/13 16:57:18 ERROR YarnClusterScheduler: Lost executor 4 on ip-10-1-2-96.ec2.internal: Container killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
18/06/13 16:57:18 WARN TaskSetManager: Lost task 0.3 in stage 0.0 (TID 3, ip-10-1-2-96.ec2.internal, executor 4): ExecutorLostFailure (executor 4 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
```

### Fix the fetch size setting using AWS Glue dynamic frames
<a name="monitor-debug-oom-fix-2"></a>

The executor ran out of memory while reading the JDBC table because the default configuration for the Spark JDBC fetch size is zero. This means that the JDBC driver on the Spark executor tries to fetch the 34 million rows from the database together and cache them, even though Spark streams through the rows one at a time. With Spark, you can avoid this scenario by setting the fetch size parameter to a non-zero default value.

You can also fix this issue by using AWS Glue dynamic frames instead. By default, dynamic frames use a fetch size of 1,000 rows that is a typically sufficient value. As a result, the executor does not take more than 7 percent of its total memory. The AWS Glue job finishes in less than two minutes with only a single executor. While using AWS Glue dynamic frames is the recommended approach, it is also possible to set the fetch size using the Apache Spark `fetchsize` property. See the [Spark SQL, DataFrames and Datasets Guide](https://spark.apache.org/docs/2.2.0/sql-programming-guide.html#jdbc-to-other-databases).

```
val (url, database, tableName) = {
 ("jdbc_url", "db_name", "table_name")
 } 
val source = glueContext.getSource(format, sourceJson)
val df = source.getDynamicFrame
glueContext.write_dynamic_frame.from_options(frame = df, connection_type = "s3", connection_options = {"path": output_path}, format = "parquet", transformation_ctx = "datasink")
```

**Normal profiled metrics:** The [executor memory](monitoring-awsglue-with-cloudwatch-metrics.md#glue.ALL.jvm.heap.usage) with AWS Glue dynamic frames never exceeds the safe threshold, as shown in the following image. It streams in the rows from the database and caches only 1,000 rows in the JDBC driver at any point in time. An out of memory exception does not occur.

![\[AWS Glue console showing executor memory below the safe threshold.\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-debug-oom-2-memoryprofile-fixed.png)


# Debugging demanding stages and straggler tasks
<a name="monitor-profile-debug-straggler"></a>

You can use AWS Glue job profiling to identify demanding stages and straggler tasks in your extract, transform, and load (ETL) jobs. A straggler task takes much longer than the rest of the tasks in a stage of an AWS Glue job. As a result, the stage takes longer to complete, which also delays the total execution time of the job.

## Coalescing small input files into larger output files
<a name="monitor-profile-debug-straggler-scenario-1"></a>

A straggler task can occur when there is a non-uniform distribution of work across the different tasks, or a data skew results in one task processing more data.

You can profile the following code—a common pattern in Apache Spark—to coalesce a large number of small files into larger output files. For this example, the input dataset is 32 GB of JSON Gzip compressed files. The output dataset has roughly 190 GB of uncompressed JSON files. 

The profiled code is as follows:

```
datasource0 = spark.read.format("json").load("s3://input_path")
df = datasource0.coalesce(1)
df.write.format("json").save(output_path)
```

### Visualize the profiled metrics on the AWS Glue console
<a name="monitor-debug-straggler-visualize"></a>

You can profile your job to examine four different sets of metrics:
+ ETL data movement
+ Data shuffle across executors
+ Job execution
+ Memory profile

**ETL data movement**: In the **ETL Data Movement** profile, the bytes are [read](monitoring-awsglue-with-cloudwatch-metrics.md#glue.ALL.s3.filesystem.read_bytes) fairly quickly by all the executors in the first stage that completes within the first six minutes. However, the total job execution time is around one hour, mostly consisting of the data [writes](monitoring-awsglue-with-cloudwatch-metrics.md#glue.ALL.s3.filesystem.write_bytes).

![\[Graph showing the ETL Data Movement profile.\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-debug-straggler-1.png)


**Data shuffle across executors:** The number of bytes [read](monitoring-awsglue-with-cloudwatch-metrics.md#glue.driver.aggregate.shuffleLocalBytesRead) and [written](monitoring-awsglue-with-cloudwatch-metrics.md#glue.driver.aggregate.shuffleBytesWritten) during shuffling also shows a spike before Stage 2 ends, as indicated by the **Job Execution** and **Data Shuffle** metrics. After the data shuffles from all executors, the reads and writes proceed from executor number 3 only.

![\[The metrics for data shuffle across the executors.\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-debug-straggler-2.png)


**Job execution:** As shown in the graph below, all other executors are idle and are eventually relinquished by the time 10:09. At that point, the total number of executors decreases to only one. This clearly shows that executor number 3 consists of the straggler task that is taking the longest execution time and is contributing to most of the job execution time.

![\[The execution metrics for the active executors.\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-debug-straggler-3.png)


**Memory profile:** After the first two stages, only [executor number 3](monitoring-awsglue-with-cloudwatch-metrics.md#glue.executorId.jvm.heap.used) is actively consuming memory to process the data. The remaining executors are simply idle or have been relinquished shortly after the completion of the first two stages. 

![\[The metrics for the memory profile after the first two stages.\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-debug-straggler-4.png)


### Fix straggling executors using grouping
<a name="monitor-debug-straggler-fix"></a>

You can avoid straggling executors by using the *grouping* feature in AWS Glue. Use grouping to distribute the data uniformly across all the executors and coalesce files into larger files using all the available executors on the cluster. For more information, see [Reading input files in larger groups](grouping-input-files.md).

To check the ETL data movements in the AWS Glue job, profile the following code with grouping enabled:

```
df = glueContext.create_dynamic_frame_from_options("s3", {'paths': ["s3://input_path"], "recurse":True, 'groupFiles': 'inPartition'}, format="json")
datasink = glueContext.write_dynamic_frame.from_options(frame = df, connection_type = "s3", connection_options = {"path": output_path}, format = "json", transformation_ctx = "datasink4")
```

**ETL data movement:** The data writes are now streamed in parallel with the data reads throughout the job execution time. As a result, the job finishes within eight minutes—much faster than previously.

![\[The ETL data movements showing the issue is fixed.\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-debug-straggler-5.png)


**Data shuffle across executors:** As the input files are coalesced during the reads using the grouping feature, there is no costly data shuffle after the data reads.

![\[The data shuffle metrics showing the issue is fixed.\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-debug-straggler-6.png)


**Job execution:** The job execution metrics show that the total number of active executors running and processing data remains fairly constant. There is no single straggler in the job. All executors are active and are not relinquished until the completion of the job. Because there is no intermediate shuffle of data across the executors, there is only a single stage in the job.

![\[The metrics for the Job Execution widget showing that there are no stragglers in the job.\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-debug-straggler-7.png)


**Memory profile:** The metrics show the [active memory consumption](monitoring-awsglue-with-cloudwatch-metrics.md#glue.executorId.jvm.heap.used) across all executors—reconfirming that there is activity across all executors. As data is streamed in and written out in parallel, the total memory footprint of all executors is roughly uniform and well below the safe threshold for all executors.

![\[The memory profile metrics showing active memory consumption across all executors.\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-debug-straggler-8.png)


# Monitoring the progress of multiple jobs
<a name="monitor-debug-multiple"></a>

You can profile multiple AWS Glue jobs together and monitor the flow of data between them. This is a common workflow pattern, and requires monitoring for individual job progress, data processing backlog, data reprocessing, and job bookmarks.

**Topics**
+ [Profiled code](#monitor-debug-multiple-profile)
+ [Visualize the profiled metrics on the AWS Glue console](#monitor-debug-multiple-visualize)
+ [Fix the processing of files](#monitor-debug-multiple-fix)

## Profiled code
<a name="monitor-debug-multiple-profile"></a>

In this workflow, you have two jobs: an Input job and an Output job. The Input job is scheduled to run every 30 minutes using a periodic trigger. The Output job is scheduled to run after each successful run of the Input job. These scheduled jobs are controlled using job triggers.

![\[Console screenshot showing the job triggers controlling the scheduling of the Input and Output jobs.\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-debug-multiple-1.png)


**Input job**: This job reads in data from an Amazon Simple Storage Service (Amazon S3) location, transforms it using `ApplyMapping`, and writes it to a staging Amazon S3 location. The following code is profiled code for the Input job:

```
datasource0 = glueContext.create_dynamic_frame.from_options(connection_type="s3", connection_options = {"paths": ["s3://input_path"], "useS3ListImplementation":True,"recurse":True}, format="json")
applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [map_spec])
datasink2 = glueContext.write_dynamic_frame.from_options(frame = applymapping1, connection_type = "s3", connection_options = {"path": staging_path, "compression": "gzip"}, format = "json")
```

**Output job**: This job reads the output of the Input job from the staging location in Amazon S3, transforms it again, and writes it to a destination:

```
datasource0 = glueContext.create_dynamic_frame.from_options(connection_type="s3", connection_options = {"paths": [staging_path], "useS3ListImplementation":True,"recurse":True}, format="json")
applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [map_spec])
datasink2 = glueContext.write_dynamic_frame.from_options(frame = applymapping1, connection_type = "s3", connection_options = {"path": output_path}, format = "json")
```

## Visualize the profiled metrics on the AWS Glue console
<a name="monitor-debug-multiple-visualize"></a>

The following dashboard superimposes the Amazon S3 bytes written metric from the Input job onto the Amazon S3 bytes read metric on the same timeline for the Output job. The timeline shows different job runs of the Input and Output jobs. The Input job (shown in red) starts every 30 minutes. The Output Job (shown in brown) starts at the completion of the Input Job, with a Max Concurrency of 1. 

![\[Graph showing the data read and written.\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-debug-multiple-4.png)


In this example, [job bookmarks](https://docs.aws.amazon.com/glue/latest/dg/monitor-continuations.html) are not enabled. No transformation contexts are used to enable job bookmarks in the script code. 

**Job History**: The Input and Output jobs have multiple runs, as shown on the **History** tab, starting from 12:00 PM.

The Input job on the AWS Glue console looks like this:

![\[Console screenshot showing the History tab of the Input job.\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-debug-multiple-2.png)


The following image shows the Output job:

![\[Console screenshot showing the History tab of the Output job.\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-debug-multiple-3.png)


**First job runs**: As shown in the Data Bytes Read and Written graph below, the first job runs of the Input and Output jobs between 12:00 and 12:30 show roughly the same area under the curves. Those areas represent the Amazon S3 bytes written by the Input job and the Amazon S3 bytes read by the Output job. This data is also confirmed by the ratio of Amazon S3 bytes written (summed over 30 minutes – the job trigger frequency for the Input job). The data point for the ratio for the Input job run that started at 12:00PM is also 1.

The following graph shows the data flow ratio across all the job runs:

![\[Graph showing the data flow ratio: bytes written and bytes read.\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-debug-multiple-5.png)


**Second job runs**: In the second job run, there is a clear difference in the number of bytes read by the Output job compared to the number of bytes written by the Input job. (Compare the area under the curve across the two job runs for the Output job, or compare the areas in the second run of the Input and Output jobs.) The ratio of the bytes read and written shows that the Output Job read about 2.5x the data written by the Input job in the second span of 30 minutes from 12:30 to 13:00. This is because the Output Job reprocessed the output of the first job run of the Input job because job bookmarks were not enabled. A ratio above 1 shows that there is an additional backlog of data that was processed by the Output job.

**Third job runs**: The Input job is fairly consistent in terms of the number of bytes written (see the area under the red curves). However, the third job run of the Input job ran longer than expected (see the long tail of the red curve). As a result, the third job run of the Output job started late. The third job run processed only a fraction of the data accumulated in the staging location in the remaining 30 minutes between 13:00 and 13:30. The ratio of the bytes flow shows that it only processed 0.83 of data written by the third job run of the Input job (see the ratio at 13:00).

**Overlap of Input and Output jobs**: The fourth job run of the Input job started at 13:30 as per the schedule, before the third job run of the Output job finished. There is a partial overlap between these two job runs. However, the third job run of the Output job captures only the files that it listed in the staging location of Amazon S3 when it began around 13:17. This consists of all data output from the first job runs of the Input job. The actual ratio at 13:30 is around 2.75. The third job run of the Output job processed about 2.75x of data written by the fourth job run of the Input job from 13:30 to 14:00.

As these images show, the Output job is reprocessing data from the staging location from all prior job runs of the Input job. As a result, the fourth job run for the Output job is the longest and overlaps with the entire fifth job run of the Input job.

## Fix the processing of files
<a name="monitor-debug-multiple-fix"></a>

You should ensure that the Output job processes only the files that haven't been processed by previous job runs of the Output job. To do this, enable job bookmarks and set the transformation context in the Output job, as follows:

```
datasource0 = glueContext.create_dynamic_frame.from_options(connection_type="s3", connection_options = {"paths": [staging_path], "useS3ListImplementation":True,"recurse":True}, format="json", transformation_ctx = "bookmark_ctx")
```

With job bookmarks enabled, the Output job doesn't reprocess the data in the staging location from all the previous job runs of the Input job. In the following image showing the data read and written, the area under the brown curve is fairly consistent and similar with the red curves. 

![\[Graph showing the data read and written as red and brown lines.\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-debug-multiple-6.png)


The ratios of byte flow also remain roughly close to 1 because there is no additional data processed.

![\[Graph showing the data flow ratio: bytes written and bytes read\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-debug-multiple-7.png)


A job run for the Output job starts and captures the files in the staging location before the next Input job run starts putting more data into the staging location. As long as it continues to do this, it processes only the files captured from the previous Input job run, and the ratio stays close to 1.

![\[Graph showing the data flow ratio: bytes written and bytes read\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-debug-multiple-7.png)


Suppose that the Input job takes longer than expected, and as a result, the Output job captures files in the staging location from two Input job runs. The ratio is then higher than 1 for that Output job run. However, the following job runs of the Output job don't process any files that are already processed by the previous job runs of the Output job.

# Monitoring for DPU capacity planning
<a name="monitor-debug-capacity"></a>

You can use job metrics in AWS Glue to estimate the number of data processing units (DPUs) that can be used to scale out an AWS Glue job.

**Note**  
This page is only applicable to AWS Glue versions 0.9 and 1.0. Later versions of AWS Glue contain cost-saving features that introduce additional considerations when capacity planning. 

**Topics**
+ [Profiled code](#monitor-debug-capacity-profile)
+ [Visualize the profiled metrics on the AWS Glue console](#monitor-debug-capacity-visualize)
+ [Determine the optimal DPU capacity](#monitor-debug-capacity-fix)

## Profiled code
<a name="monitor-debug-capacity-profile"></a>

The following script reads an Amazon Simple Storage Service (Amazon S3) partition containing 428 gzipped JSON files. The script applies a mapping to change the field names, and converts and writes them to Amazon S3 in Apache Parquet format. You provision 10 DPUs as per the default and run this job. 

```
datasource0 = glueContext.create_dynamic_frame.from_options(connection_type="s3", connection_options = {"paths": [input_path], "useS3ListImplementation":True,"recurse":True}, format="json")
applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [(map_spec])
datasink2 = glueContext.write_dynamic_frame.from_options(frame = applymapping1, connection_type = "s3", connection_options = {"path": output_path}, format = "parquet")
```

## Visualize the profiled metrics on the AWS Glue console
<a name="monitor-debug-capacity-visualize"></a>

**Job run 1:** In this job run we show how to find if there are under-provisioned DPUs in the cluster. The job execution functionality in AWS Glue shows the total [number of actively running executors](monitoring-awsglue-with-cloudwatch-metrics.md#glue.driver.ExecutorAllocationManager.executors.numberAllExecutors), the [number of completed stages](monitoring-awsglue-with-cloudwatch-metrics.md#glue.driver.aggregate.numCompletedStages), and the [number of maximum needed executors](monitoring-awsglue-with-cloudwatch-metrics.md#glue.driver.ExecutorAllocationManager.executors.numberMaxNeededExecutors).

The number of maximum needed executors is computed by adding the total number of running tasks and pending tasks, and dividing by the tasks per executor. This result is a measure of the total number of executors required to satisfy the current load. 

In contrast, the number of actively running executors measures how many executors are running active Apache Spark tasks. As the job progresses, the maximum needed executors can change and typically goes down towards the end of the job as the pending task queue diminishes.

The horizontal red line in the following graph shows the number of maximum allocated executors, which depends on the number of DPUs that you allocate for the job. In this case, you allocate 10 DPUs for the job run. One DPU is reserved for management. Nine DPUs run two executors each and one executor is reserved for the Spark driver. The Spark driver runs inside the primary application. So, the number of maximum allocated executors is 2\$19 - 1 = 17 executors.

![\[The job metrics showing active executors and maximum needed executors.\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-debug-capacity-1.png)


As the graph shows, the number of maximum needed executors starts at 107 at the beginning of the job, whereas the number of active executors remains 17. This is the same as the number of maximum allocated executors with 10 DPUs. The ratio between the maximum needed executors and maximum allocated executors (adding 1 to both for the Spark driver) gives you the under-provisioning factor: 108/18 = 6x. You can provision 6 (under provisioning ratio) \$19 (current DPU capacity - 1) \$1 1 DPUs = 55 DPUs to scale out the job to run it with maximum parallelism and finish faster. 

The AWS Glue console displays the detailed job metrics as a static line representing the original number of maximum allocated executors. The console computes the maximum allocated executors from the job definition for the metrics. By constrast, for detailed job run metrics, the console computes the maximum allocated executors from the job run configuration, specifically the DPUs allocated for the job run. To view metrics for an individual job run, select the job run and choose **View run metrics**.

![\[The job metrics showing ETL data movement.\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-debug-capacity-2.png)


Looking at the Amazon S3 bytes [read](monitoring-awsglue-with-cloudwatch-metrics.md#glue.ALL.s3.filesystem.read_bytes) and [written](monitoring-awsglue-with-cloudwatch-metrics.md#glue.ALL.s3.filesystem.write_bytes), notice that the job spends all six minutes streaming in data from Amazon S3 and writing it out in parallel. All the cores on the allocated DPUs are reading and writing to Amazon S3. The maximum number of needed executors being 107 also matches the number of files in the input Amazon S3 path—428. Each executor can launch four Spark tasks to process four input files (JSON gzipped).

## Determine the optimal DPU capacity
<a name="monitor-debug-capacity-fix"></a>

Based on the results of the previous job run, you can increase the total number of allocated DPUs to 55, and see how the job performs. The job finishes in less than three minutes—half the time it required previously. The job scale-out is not linear in this case because it is a short running job. Jobs with long-lived tasks or a large number of tasks (a large number of max needed executors) benefit from a close-to-linear DPU scale-out performance speedup.

![\[Graph showing increasing the total number of allocated DPUs\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-debug-capacity-3.png)


As the above image shows, the total number of active executors reaches the maximum allocated—107 executors. Similarly, the maximum needed executors is never above the maximum allocated executors. The maximum needed executors number is computed from the actively running and pending task counts, so it might be smaller than the number of active executors. This is because there can be executors that are partially or completely idle for a short period of time and are not yet decommissioned.

![\[Graph showing the total number of active executors reaching the maximum allocated.\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-debug-capacity-4.png)


This job run uses 6x more executors to read and write from Amazon S3 in parallel. As a result, this job run uses more Amazon S3 bandwidth for both reads and writes, and finishes faster. 

### Identify overprovisioned DPUs
<a name="monitor-debug-capacity-over"></a>

Next, you can determine whether scaling out the job with 100 DPUs (99 \$1 2 = 198 executors) helps to scale out any further. As the following graph shows, the job still takes three minutes to finish. Similarly, the job does not scale out beyond 107 executors (55 DPUs configuration), and the remaining 91 executors are overprovisioned and not used at all. This shows that increasing the number of DPUs might not always improve performance, as evident from the maximum needed executors.

![\[Graph showing that job performance does not always increase by increasing the number of DPUs.\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-debug-capacity-5.png)


### Compare time differences
<a name="monitor-debug-capacity-time"></a>

The three job runs shown in the following table summarize the job execution times for 10 DPUs, 55 DPUs, and 100 DPUs. You can find the DPU capacity to improve the job execution time using the estimates you established by monitoring the first job run.


| Job ID | Number of DPUs | Execution time | 
| --- | --- | --- | 
| jr\$1c894524c8ef5048a4d9... | 10 | 6 min. | 
| jr\$11a466cf2575e7ffe6856... | 55 | 3 min. | 
| jr\$134fa1ed4c6aa9ff0a814... | 100 | 3 min. | 

# Generative AI troubleshooting for Apache Spark in AWS Glue
<a name="troubleshoot-spark"></a>

 Generative AI Troubleshooting for Apache Spark jobs in AWS Glue is a new capability that helps data engineers and scientists diagnose and fix issues in their Spark applications with ease. Utilizing machine learning and generative AI technologies, this feature analyzes issues in Spark jobs and provides detailed root cause analysis along with actionable recommendations to resolve those issues. The generative AI troubleshooting for Apache Spark is available for jobs running on AWS Glue version 4.0 and above. 


|  | 
| --- |
|  Transform your Apache Spark troubleshooting with our AI-powered Troubleshooting Agent, now supporting all major deployment modes including AWS Glue, Amazon EMR-EC2, Amazon EMR-Serverless and Amazon SageMaker AI Notebooks. This powerful agent eliminates complex debugging processes by combining natural language interactions, real-time workload analysis, and smart code recommendations into a seamless experience. For implementation details, refer to [What is Apache Spark Troubleshooting Agent for Amazon EMR](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/spark-troubleshoot.html). View the second demonstration in [Using the Troubleshooting Agent](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/spark-troubleshooting-using-troubleshooting-agent.html) for AWS Glue troubleshooting examples.  | 

## How does Generative AI Troubleshooting for Apache Spark work?
<a name="troubleshoot-spark-how-it-works"></a>

 For your failed Spark jobs, Generative AI Troubleshooting analyzes the job metadata and the precise metrics and logs associated with the error signature of your job to generate a root cause analysis, and recommends specific solutions and best practices to help address job failures. 

## Setting up Generative AI Troubleshooting for Apache Spark for your jobs
<a name="w2aac37c11c12c33c13"></a>

### Configuring IAM permissions
<a name="troubleshoot-spark-iam-permissions"></a>

 Granting permissions to the APIs used by Spark Troubleshooting for your jobs in AWS Glue requires appropriate IAM permissions. You can obtain permissions by attaching the following custom AWS policy to your IAM identity (such as a user, role, or group). 

------
#### [ JSON ]

****  

```
{
  "Version":"2012-10-17",		 	 	 
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "glue:StartCompletion",
        "glue:GetCompletion"
      ],
      "Resource": [
        "arn:aws:glue:*:*:completion/*",
        "arn:aws:glue:*:*:job/*"
      ]
    }
  ]
}
```

------

**Note**  
 The following two APIs are used in the IAM policy for enabling this experience through the AWS Glue Studio Console: `StartCompletion` and `GetCompletion`. 

### Assigning permissions
<a name="troubleshoot-spark-assigning-permissions"></a>

 To provide access, add permissions to your users, groups, or roles: 
+  For users and groups in IAM Identity Center: Create a permission set. Follow the instructions in [ Create a permission set ](https://docs.aws.amazon.com/singlesignon/latest/userguide/howtocreatepermissionset.html) in the IAM Identity Center User Guide. 
+  For users managed in IAM through an identity provider: Create a role for identity federation. Follow the instructions in [ Creating a role for a third-party identity provider (federation) ](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_create_for-idp.html) in the IAM User Guide. 
+  For IAM users: Create a role that your user can assume. Follow the instructions in [ Creating a role for an IAM user ](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_create_for-user.html) in the IAM User Guide. 

## Running troubleshooting analysis from a failed job run
<a name="troubleshoot-spark-run-analysis"></a>

 You can access the troubleshooting feature through multiple paths in the AWS Glue console. Here's how to get started: 

### Option 1: From the Jobs List page
<a name="troubleshoot-spark-from-jobs-list"></a>

1.  Open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/). 

1.  In the navigation pane, choose **ETL Jobs**. 

1.  Locate your failed job in the jobs list. 

1.  Select the **Runs** tab in the job details section. 

1.  Click on the failed job run you want to analyze. 

1.  Choose **Troubleshoot with AI** to start the analysis. 

1.  When the troubleshooting analysis is complete, you can view the root-cause analysis and recommendations in the **Troubleshooting analysis** tab at the bottom of the screen. 

![\[The GIF shows an end to end implementation of a failed run and the troubleshoot with AI feature running.\]](http://docs.aws.amazon.com/glue/latest/dg/images/troubleshoot_spark_option_1_jobs_list.gif)


### Option 2: Using the Job Run Monitoring page
<a name="troubleshoot-spark-job-run-monitoring-page"></a>

1.  Navigate to the **Job run monitoring** page. 

1.  Locate your failed job run. 

1.  Choose the **Actions** drop-down menu. 

1.  Choose **Troubleshoot with AI**. 

![\[The GIF shows an end to end implementation of a failed run and the troubleshoot with AI feature running.\]](http://docs.aws.amazon.com/glue/latest/dg/images/troubleshoot_spark_option_2_job_monitoring.gif)


### Option 3: From the Job Run Details page
<a name="troubleshoot-spark-job-run-details-page"></a>

1.  Navigate to your failed job run's details page by either clicking **View details** on a failed run from the **Runs** tab or selecting the job run from the **Job run monitoring** page. 

1.  In the job run details page, find the **Troubleshooting analysis** tab. 

## Supported troubleshooting categories
<a name="troubleshoot-spark-supported-troubleshooting-categories"></a>

 This service focuses on three primary categories of issues that data engineers and developers frequently encounter in their Spark applications: 
+  **Resource setup and access errors:** When running Spark applications in AWS Glue, resource setup and access errors are among the most common yet challenging issues to diagnose. These errors often occur when your Spark application attempts to interact with AWS resources but encounters permission issues, missing resources, or configuration problems. 
+  **Spark driver and executor memory issues:** Memory-related errors in Apache Spark jobs can be complex to diagnose and resolve. These errors often manifest when your data processing requirements exceed the available memory resources, either on the driver node or executor nodes. 
+  **Spark disk capacity issues:** Storage-related errors in AWS Glue Spark jobs often emerge during shuffle operations, data spilling, or when dealing with large-scale data transformations. These errors can be particularly tricky because they might not manifest until your job has been running for a while, potentially wasting valuable compute time and resources. 
+  **Query execution errors:** Query failures in Spark SQL and DataFrame operations can be difficult to troubleshoot because error messages may not clearly point to the root cause, and queries that work fine with small datasets can suddenly fail at scale. These errors become even more challenging when they occur deep within complex transformation pipelines, where the actual issue may stem from data quality problems in earlier stages rather than the query logic itself. 

**Note**  
 Before implementing any suggested changes in your production environment, review the suggested changes thoroughly. The service provides recommendations based on patterns and best practices, but your specific use case might require additional considerations. 

## Supported regions
<a name="troubleshoot-spark-supported-regions"></a>

Generative AI troubleshooting for Apache Spark is available in the following regions:
+ **Africa**: Cape Town (af-south-1)
+ **Asia Pacific**: Hong Kong (ap-east-1), Tokyo (ap-northeast-1), Seoul (ap-northeast-2), Osaka (ap-northeast-3), Mumbai (ap-south-1), Singapore (ap-southeast-1), Sydney (ap-southeast-2), and Jakarta (ap-southeast-3)
+ **Europe**: Frankfurt (eu-central-1), Stockholm (eu-north-1), Milan (eu-south-1), Ireland (eu-west-1), London (eu-west-2), and Paris (eu-west-3)
+ **Middle East**: Bahrain (me-south-1) and UAE (me-central-1)
+ **North America**: Canada (ca-central-1)
+ **South America**: São Paulo (sa-east-1)
+ **United States**: North Virginia (us-east-1), Ohio (us-east-2), North California (us-west-1), and Oregon (us-west-2)

# Using materialized views with AWS Glue
<a name="materialized-views"></a>

AWS Glue version 5.1 and later supports creating and managing Apache Iceberg materialized views in the AWS Glue Data Catalog. A materialized view is a managed table that stores the precomputed result of a SQL query in Apache Iceberg format and incrementally updates as the underlying source tables change. You can use materialized views to simplify data transformation pipelines and accelerate query performance for complex analytical workloads.

When you create a materialized view using Spark in AWS Glue, the view definition and metadata are stored in the AWS Glue Data Catalog. The precomputed results are stored as Apache Iceberg tables in Amazon S3 Tables buckets or Amazon S3 general purpose buckets within your account. The AWS Glue Data Catalog automatically monitors source tables and refreshes materialized views using managed compute infrastructure.

**Topics**
+ [How materialized views work with AWS Glue](#materialized-views-how-they-work)
+ [Prerequisites](#materialized-views-prerequisites)
+ [Configuring Spark to use materialized views](#materialized-views-configuring-spark)
+ [Creating materialized views](#materialized-views-creating)
+ [Querying materialized views](#materialized-views-querying)
+ [Refreshing materialized views](#materialized-views-refreshing)
+ [Managing materialized views](#materialized-views-managing)
+ [Permissions for materialized views](#materialized-views-permissions)
+ [Monitoring materialized view operations](#materialized-views-monitoring)
+ [Example: Complete workflow](#materialized-views-complete-workflow)
+ [Considerations and limitations](#materialized-views-considerations-limitations)

## How materialized views work with AWS Glue
<a name="materialized-views-how-they-work"></a>

Materialized views integrate with AWS Glue through Apache Spark's Iceberg support in AWS Glue jobs and AWS Glue Studio notebooks. When you configure your Spark session to use the AWS Glue Data Catalog, you can create materialized views using standard SQL syntax. The Spark optimizer can automatically rewrite queries to use materialized views when they provide better performance, eliminating the need to manually modify application code.

The AWS Glue Data Catalog handles all operational aspects of materialized view maintenance, including:
+ Detecting changes in source tables using Apache Iceberg's metadata layer
+ Scheduling and executing refresh operations using managed Spark compute
+ Determining whether to perform full or incremental refresh based on the data changes
+ Storing precomputed results in Apache Iceberg format for multi-engine access

You can query materialized views from AWS Glue using the same Spark SQL interfaces you use for regular tables. The precomputed data is also accessible from other services including Amazon Athena and Amazon Redshift.

## Prerequisites
<a name="materialized-views-prerequisites"></a>

To use materialized views with AWS Glue, you need:
+ An account
+ AWS Glue version 5.1 or later
+ Source tables in Apache Iceberg format registered in the AWS Glue Data Catalog
+ AWS Lake Formation permissions configured for source tables and target databases
+ An S3 Tables bucket or S3 general purpose bucket registered with AWS Lake Formation for storing materialized view data
+ An IAM role with permissions to access AWS Glue Data Catalog and Amazon S3

## Configuring Spark to use materialized views
<a name="materialized-views-configuring-spark"></a>

To create and manage materialized views in AWS Glue, configure your Spark session with the required Iceberg extensions and catalog settings. The configuration method varies depending on whether you're using AWS Glue jobs or AWS Glue Studio notebooks.

### Configuring AWS Glue jobs
<a name="materialized-views-configuring-glue-jobs"></a>

When creating or updating an AWS Glue job, add the following configuration parameters as job parameters:

#### For S3 Tables buckets
<a name="materialized-views-s3-tables-buckets"></a>

```
job = glue.create_job(
    Name='materialized-view-job',
    Role='arn:aws:iam::111122223333:role/GlueServiceRole',
    Command={
        'Name': 'glueetl',
        'ScriptLocation': 's3://amzn-s3-demo-bucket/scripts/mv-script.py',
        'PythonVersion': '3'
    },
    DefaultArguments={
        '--enable-glue-datacatalog': 'true',
        '--conf': 'spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions '
        '--conf spark.sql.catalog.glue_catalog=org.apache.iceberg.spark.SparkCatalog '
                  '--conf spark.sql.catalog.glue_catalog.type=glue '
                  '--conf spark.sql.catalog.glue_catalog.warehouse=s3://amzn-s3-demo-bucket/warehouse '
                  '--conf spark.sql.catalog.glue_catalog.glue.region=us-east-1 '
                  '--conf spark.sql.catalog.glue_catalog.glue.id=111122223333 '
                  '--conf spark.sql.catalog.glue_catalog.glue.account-id=111122223333 ',
                  '--conf spark.sql.catalog.glue_catalog.glue.lakeformation-enabled=true ',
                  '--conf spark.sql.catalog.s3t_catalog=org.apache.iceberg.spark.SparkCatalog '
                  '--conf spark.sql.catalog.s3t_catalog.type=glue '
                  '--conf spark.sql.catalog.s3t_catalog.glue.id=111122223333:s3tablescatalog/my-table-bucket ',
                  '--conf spark.sql.catalog.s3t_catalog.glue.account-id=111122223333 ',
                  '--conf spark.sql.catalog.s3t_catalog.glue.lakeformation-enabled=true ',
                  '--conf spark.sql.catalog.s3t_catalog.warehouse=s3://amzn-s3-demo-bucket/mv-warehouse '
                  '--conf spark.sql.catalog.s3t_catalog.glue.region=us-east-1 '
                  '--conf spark.sql.defaultCatalog=s3t_catalog '
                  '--conf spark.sql.optimizer.answerQueriesWithMVs.enabled=true '
                  '--conf spark.sql.materializedViews.metadataCache.enabled=true'
    },
    GlueVersion='5.1'
)
```

#### For S3 general purpose buckets
<a name="materialized-views-s3-general-purpose-buckets"></a>

```
job = glue.create_job(
    Name='materialized-view-job',
    Role='arn:aws:iam::111122223333:role/GlueServiceRole',
    Command={
        'Name': 'glueetl',
        'ScriptLocation': 's3://amzn-s3-demo-bucket/scripts/mv-script.py',
        'PythonVersion': '3'
    },
    DefaultArguments={
        '--enable-glue-datacatalog': 'true',
        '--conf': 'spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions '
                  '--conf spark.sql.catalog.glue_catalog=org.apache.iceberg.spark.SparkCatalog '
                  '--conf spark.sql.catalog.glue_catalog.type=glue '
                  '--conf spark.sql.catalog.glue_catalog.warehouse=s3://amzn-s3-demo-bucket/warehouse '
                  '--conf spark.sql.catalog.glue_catalog.glue.region=us-east-1 '
                  '--conf spark.sql.catalog.glue_catalog.glue.id=111122223333 ',
                  '--conf spark.sql.catalog.glue_catalog.glue.account-id=111122223333 ',
                  '--conf spark.sql.catalog.glue_catalog.glue.lakeformation-enabled=true ',
                  '--conf spark.sql.defaultCatalog=glue_catalog '
                  '--conf spark.sql.optimizer.answerQueriesWithMVs.enabled=true '
                  '--conf spark.sql.materializedViews.metadataCache.enabled=true'
    },
    GlueVersion='5.1'
)
```

### Configuring AWS Glue Studio notebooks
<a name="materialized-views-configuring-glue-studio-notebooks"></a>

In AWS Glue Studio notebooks, configure your Spark session using the %%configure magic command at the beginning of your notebook:

```
%%configure
{
    "conf": {
        "spark.sql.extensions": "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions",
        "spark.sql.catalog.glue_catalog": "org.apache.iceberg.spark.SparkCatalog",
        "spark.sql.catalog.glue_catalog.type": "glue",
        "spark.sql.catalog.glue_catalog.warehouse": "s3://amzn-s3-demo-bucket/warehouse",
        "spark.sql.catalog.glue_catalog.glue.region": "us-east-1",
        "spark.sql.catalog.glue_catalog.glue.id": "111122223333",
        "spark.sql.catalog.glue_catalog.glue.account-id": "111122223333",
        "spark.sql.catalog.glue_catalog.glue.lakeformation-enabled": "true",
        "spark.sql.defaultCatalog": "glue_catalog",
        "spark.sql.optimizer.answerQueriesWithMVs.enabled": "true",
        "spark.sql.materializedViews.metadataCache.enabled": "true"
    }
}
```

### Enabling incremental refresh
<a name="materialized-views-enabling-incremental-refresh"></a>

To enable incremental refresh optimization, add the following configuration properties to your job parameters or notebook configuration:

```
--conf spark.sql.optimizer.incrementalMVRefresh.enabled=true
--conf spark.sql.optimizer.incrementalMVRefresh.deltaThresholdCheckEnabled=false
```

### Configuration parameters
<a name="materialized-views-configuration-parameters"></a>

The following configuration parameters control materialized view behavior:
+ `spark.sql.extensions` – Enables Iceberg Spark session extensions required for materialized view support.
+ `spark.sql.optimizer.answerQueriesWithMVs.enabled` – Enables automatic query rewrite to use materialized views. Set to true to activate this optimization.
+ `spark.sql.materializedViews.metadataCache.enabled` – Enables caching of materialized view metadata for query optimization. Set to true to improve query rewrite performance.
+ `spark.sql.optimizer.incrementalMVRefresh.enabled` – Enables incremental refresh optimization. Set to true to process only changed data during refresh operations.
+ `spark.sql.optimizer.answerQueriesWithMVs.decimalAggregateCheckEnabled` – Controls validation of decimal aggregate operations in query rewrite. Set to false to disable certain decimal overflow checks.

## Creating materialized views
<a name="materialized-views-creating"></a>

You create materialized views using the CREATE MATERIALIZED VIEW SQL statement in AWS Glue jobs or notebooks. The view definition specifies the transformation logic as a SQL query that references one or more source tables.

### Creating a basic materialized view in AWS Glue jobs
<a name="materialized-views-creating-basic-glue-jobs"></a>

The following example demonstrates creating a materialized view in a AWS Glue job script, use fully qualified table names with three part naming convention in view definition:

```
from awsglue.context import GlueContext
from pyspark.context import SparkContext

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session

# Create materialized view
spark.sql("""
    CREATE MATERIALIZED VIEW customer_orders
    AS 
    SELECT 
        customer_name, 
        COUNT(*) as order_count, 
        SUM(amount) as total_amount 
    FROM glue_catalog.sales.orders
    GROUP BY customer_name
""")
```

### Creating a materialized view with automatic refresh
<a name="materialized-views-creating-automatic-refresh"></a>

To configure automatic refresh, specify a refresh schedule when creating the view, using fully qualified table names with three part naming convention in view definition:

```
spark.sql("""
    CREATE MATERIALIZED VIEW customer_orders
    SCHEDULE REFRESH EVERY 1 HOUR
    AS 
    SELECT 
        customer_name, 
        COUNT(*) as order_count, 
        SUM(amount) as total_amount 
    FROM glue_catalog.sales.orders
    GROUP BY customer_name
""")
```

### Creating a materialized view with cross-catalog references
<a name="materialized-views-creating-cross-catalog"></a>

When your source tables are in a different catalog than your materialized view, use fully qualified table names with three-part naming convention in both view name and view definition:

```
spark.sql("""
    CREATE MATERIALIZED VIEW s3t_catalog.analytics.customer_summary
    AS 
    SELECT 
        customer_name, 
        COUNT(*) as order_count, 
        SUM(amount) as total_amount 
    FROM glue_catalog.sales.orders
    GROUP BY customer_name
""")
```

### Creating materialized views in AWS Glue Studio notebooks
<a name="materialized-views-creating-glue-studio-notebooks"></a>

In AWS Glue Studio notebooks, you can use the %%sql magic command to create materialized views, using fully qualified table names with three part naming convention in view definition:

```
%%sql
CREATE MATERIALIZED VIEW customer_orders
AS 
SELECT 
    customer_name, 
    COUNT(*) as order_count, 
    SUM(amount) as total_amount 
FROM glue_catalog.sales.orders
GROUP BY customer_name
```

## Querying materialized views
<a name="materialized-views-querying"></a>

After creating a materialized view, you can query it like any other table using standard SQL SELECT statements in your AWS Glue jobs or notebooks.

### Querying in AWS Glue jobs
<a name="materialized-views-querying-glue-jobs"></a>

```
from awsglue.context import GlueContext
from pyspark.context import SparkContext

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session

# Query materialized view
result = spark.sql("SELECT * FROM customer_orders")
result.show()
```

### Querying in AWS Glue Studio notebooks
<a name="materialized-views-querying-glue-studio-notebooks"></a>

```
%%sql
SELECT * FROM customer_orders
```

### Automatic query rewrite
<a name="materialized-views-automatic-query-rewrite"></a>

When automatic query rewrite is enabled, the Spark optimizer analyzes your queries and automatically uses materialized views when they can improve performance. For example, if you execute the following query:

```
result = spark.sql("""
    SELECT 
        customer_name, 
        COUNT(*) as order_count, 
        SUM(amount) as total_amount 
    FROM orders
    GROUP BY customer_name
""")
```

The Spark optimizer automatically rewrites this query to use the customer\$1orders materialized view instead of processing the base orders table, provided the materialized view is current.

### Verifying automatic query rewrite
<a name="materialized-views-verifying-automatic-query-rewrite"></a>

To verify whether a query uses automatic query rewrite, use the EXPLAIN EXTENDED command:

```
spark.sql("""
    EXPLAIN EXTENDED
    SELECT customer_name, COUNT(*) as order_count, SUM(amount) as total_amount 
    FROM orders
    GROUP BY customer_name
""").show(truncate=False)
```

In the execution plan, look for the materialized view name in the BatchScan operation. If the plan shows BatchScan glue\$1catalog.analytics.customer\$1orders instead of BatchScan glue\$1catalog.sales.orders, the query has been automatically rewritten to use the materialized view.

Note that automatic query rewrite requires time for the Spark metadata cache to populate after creating a materialized view. This process typically completes within 30 seconds.

## Refreshing materialized views
<a name="materialized-views-refreshing"></a>

You can refresh materialized views using two methods: full refresh or incremental refresh. Full refresh recomputes the entire materialized view from all base table data, while incremental refresh processes only the data that has changed since the last refresh.

### Manual full refresh in AWS Glue jobs
<a name="materialized-views-manual-full-refresh-glue-jobs"></a>

To perform a full refresh of a materialized view:

```
spark.sql("REFRESH MATERIALIZED VIEW customer_orders FULL")

# Verify updated results
result = spark.sql("SELECT * FROM customer_orders")
result.show()
```

### Manual incremental refresh in AWS Glue jobs
<a name="materialized-views-manual-incremental-refresh-glue-jobs"></a>

To perform an incremental refresh, ensure incremental refresh is enabled in your Spark session configuration, then execute:

```
spark.sql("REFRESH MATERIALIZED VIEW customer_orders")

# Verify updated results
result = spark.sql("SELECT * FROM customer_orders")
result.show()
```

The AWS Glue Data Catalog automatically determines whether incremental refresh is applicable based on the view definition and the amount of changed data. If incremental refresh is not possible, the operation falls back to full refresh.

### Refreshing in AWS Glue Studio notebooks
<a name="materialized-views-refreshing-glue-studio-notebooks"></a>

In notebooks, use the %%sql magic command:

```
%%sql
REFRESH MATERIALIZED VIEW customer_orders FULL
```

### Verifying incremental refresh execution
<a name="materialized-views-verifying-incremental-refresh"></a>

To confirm that incremental refresh executed successfully, enable debug logging in your AWS Glue job:

```
from awsglue.context import GlueContext
from pyspark.context import SparkContext
import logging

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session

# Enable debug logging
logger = logging.getLogger('org.apache.spark.sql')
logger.setLevel(logging.DEBUG)

# Execute refresh
spark.sql("REFRESH MATERIALIZED VIEW customer_orders")
```

Search for the following message in the AWS Glue job logs:

```
DEBUG RefreshMaterializedViewExec: Executed Incremental Refresh
```

## Managing materialized views
<a name="materialized-views-managing"></a>

AWS Glue provides SQL commands for managing the lifecycle of materialized views in your jobs and notebooks.

### Describing a materialized view
<a name="materialized-views-describing"></a>

To view metadata about a materialized view, including its definition, refresh status, and last refresh timestamp:

```
spark.sql("DESCRIBE EXTENDED customer_orders").show(truncate=False)
```

### Altering a materialized view
<a name="materialized-views-altering"></a>

To modify the refresh schedule of an existing materialized view:

```
spark.sql("""
    ALTER MATERIALIZED VIEW customer_orders 
    ADD SCHEDULE REFRESH EVERY 2 HOURS
""")
```

To remove automatic refresh:

```
spark.sql("""
    ALTER MATERIALIZED VIEW customer_orders 
    DROP SCHEDULE
""")
```

### Dropping a materialized view
<a name="materialized-views-dropping"></a>

To delete a materialized view:

```
spark.sql("DROP MATERIALIZED VIEW customer_orders")
```

This command removes the materialized view definition from the AWS Glue Data Catalog and deletes the underlying Iceberg table data from your S3 bucket.

### Listing materialized views
<a name="materialized-views-listing"></a>

To list all materialized views in a database:

```
spark.sql("SHOW VIEWS FROM analytics").show()
```

## Permissions for materialized views
<a name="materialized-views-permissions"></a>

To create and manage materialized views, you must configure AWS Lake Formation permissions. The IAM role creating the materialized view (the definer role) requires specific permissions on source tables and target databases.

### Required permissions for the definer role
<a name="materialized-views-required-permissions-definer-role"></a>

The definer role must have the following Lake Formation permissions:
+ On source tables – SELECT or ALL permissions without row, column, or cell filters
+ On the target database – CREATE\$1TABLE permission
+ On the AWS Glue Data Catalog – GetTable and CreateTable API permissions

When you create a materialized view, the definer role's ARN is stored in the view definition. The AWS Glue Data Catalog assumes this role when executing automatic refresh operations. If the definer role loses access to source tables, refresh operations will fail until permissions are restored.

### IAM permissions for AWS Glue jobs
<a name="materialized-views-iam-permissions-glue-jobs"></a>

Your AWS Glue job's IAM role requires the following permissions:

```
{
    "Version": "2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "glue:GetCatalog",
                "glue:GetCatalogs",
                "glue:GetTable",
                "glue:GetTables",
                "glue:CreateTable",
                "glue:UpdateTable",
                "glue:DeleteTable",
                "glue:GetDatabase",
                "glue:GetDatabases",
                "cloudwatch:PutMetricData"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:PutObject",
                "s3:DeleteObject"
            ],
            "Resource": [
                "arn:aws:s3:::amzn-s3-demo-bucket/*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::amzn-s3-demo-bucket"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "logs:CreateLogGroup",
                "logs:CreateLogStream",
                "logs:PutLogEvents"
            ],
            "Resource": [
                "arn:aws:logs:*:*:*:/aws-glue/*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "lakeformation:GetDataAccess"
            ],
            "Resource": "*"
        }
    ]
}
```

The role you use for Materialized View auto-refresh must have the iam:PassRole permission on the role.

```
{
  "Version":"2012-10-17",		 	 	 
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "iam:PassRole"
      ],
      "Resource": [
        "arn:aws:iam::111122223333:role/materialized-view-role-name"
      ]
    }
  ]
}
```

To let Glue automatically refresh the materialized view for you, the role must also have the following trust policy that enables the service to assume the role.

```
{
  "Version":"2012-10-17",		 	 	 
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "iam:PassRole"
      ],
      "Resource": [
        "arn:aws:iam::111122223333:role/materialized-view-role-name"
      ]
    }
  ]
}
```

If the Materialized View is stored in S3 Tables Buckets, you also need to add the following permission to the role.

```
{
  "Version": "2012-10-17",		 	 	 
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3tables:PutTableMaintenanceConfiguration"
      ],
      "Resource": "arn:aws:s3tables:*:123456789012:*"
    }
  ]
}
```

### Granting access to materialized views
<a name="materialized-views-granting-access"></a>

To grant other users access to query a materialized view, use AWS Lake Formation to grant SELECT permission on the materialized view table. Users can query the materialized view without requiring direct access to the underlying source tables.

For detailed information about configuring Lake Formation permissions, see Granting and revoking permissions on Data Catalog resources in the AWS Lake Formation Developer Guide.

## Monitoring materialized view operations
<a name="materialized-views-monitoring"></a>

The AWS Glue Data Catalog publishes metrics and logs for materialized view refresh operations to Amazon CloudWatch. You can monitor refresh status, duration, and data volume processed through CloudWatch metrics.

### Viewing job logs
<a name="materialized-views-viewing-job-logs"></a>

To view logs for AWS Glue jobs that create or refresh materialized views:

1. Open the AWS Glue console.

1. Choose Jobs from the navigation pane.

1. Select your job and choose Runs.

1. Select a specific run and choose Logs to view CloudWatch logs.

### Setting up alarms
<a name="materialized-views-setting-up-alarms"></a>

To receive notifications when refresh operations fail or exceed expected duration, create CloudWatch alarms on materialized view metrics. You can also configure Amazon EventBridge rules to trigger automated responses to refresh events.

## Example: Complete workflow
<a name="materialized-views-complete-workflow"></a>

The following example demonstrates a complete workflow for creating and using a materialized view in AWS Glue.

### Example AWS Glue job script
<a name="materialized-views-example-glue-job-script"></a>

```
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# Create database and base table
spark.sql("CREATE DATABASE IF NOT EXISTS sales")
spark.sql("USE sales")

spark.sql("""
    CREATE TABLE IF NOT EXISTS orders (
        id INT,
        customer_name STRING,
        amount DECIMAL(10,2),
        order_date DATE
    )
""")

# Insert sample data
spark.sql("""
    INSERT INTO orders VALUES 
        (1, 'John Doe', 150.00, DATE('2024-01-15')),
        (2, 'Jane Smith', 200.50, DATE('2024-01-16')),
        (3, 'Bob Johnson', 75.25, DATE('2024-01-17'))
""")

# Create materialized view
spark.sql("""
    CREATE MATERIALIZED VIEW customer_summary
    AS 
    SELECT 
        customer_name, 
        COUNT(*) as order_count, 
        SUM(amount) as total_amount 
    FROM glue_catalog.sales.orders
    GROUP BY customer_name
""")

# Query the materialized view
print("Initial materialized view data:")
spark.sql("SELECT * FROM customer_summary").show()

# Insert additional data
spark.sql("""
    INSERT INTO orders VALUES 
        (4, 'Jane Smith', 350.00, DATE('2024-01-18')),
        (5, 'Bob Johnson', 100.25, DATE('2024-01-19'))
""")

# Refresh the materialized view
spark.sql("REFRESH MATERIALIZED VIEW customer_summary FULL")

# Query updated results
print("Updated materialized view data:")
spark.sql("SELECT * FROM customer_summary").show()

job.commit()
```

### Example AWS Glue Studio notebook
<a name="materialized-views-example-glue-studio-notebook"></a>

```
%%configure
{
    "conf": {
        "spark.sql.extensions": "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions",
        "spark.sql.catalog.glue_catalog": "org.apache.iceberg.spark.SparkCatalog",
        "spark.sql.catalog.glue_catalog.type": "glue",
        "spark.sql.catalog.glue_catalog.warehouse": "s3://amzn-s3-demo-bucket/warehouse",
        "spark.sql.catalog.glue_catalog.glue.region": "us-east-1",
        "spark.sql.catalog.glue_catalog.glue.id": "111122223333",
        "spark.sql.catalog.glue_catalog.glue.account-id": "111122223333",
        "spark.sql.catalog.glue_catalog.glue.lakeformation-enabled": "true",
        "spark.sql.defaultCatalog": "glue_catalog",
        "spark.sql.optimizer.answerQueriesWithMVs.enabled": "true",
        "spark.sql.materializedViews.metadataCache.enabled": "true"
    }
}
```

```
%%sql
CREATE DATABASE IF NOT EXISTS sales
```

```
%%sql
USE sales
```

```
%%sql
CREATE TABLE IF NOT EXISTS orders (
    id INT,
    customer_name STRING,
    amount DECIMAL(10,2),
    order_date DATE
)
```

```
%%sql
INSERT INTO orders VALUES 
    (1, 'John Doe', 150.00, DATE('2024-01-15')),
    (2, 'Jane Smith', 200.50, DATE('2024-01-16')),
    (3, 'Bob Johnson', 75.25, DATE('2024-01-17'))
```

```
%%sql
CREATE MATERIALIZED VIEW customer_summary
AS 
SELECT 
    customer_name, 
    COUNT(*) as order_count, 
    SUM(amount) as total_amount 
FROM glue_catalog.sales.orders
GROUP BY customer_name
```

```
%%sql
SELECT * FROM customer_summary
```

```
%%sql
INSERT INTO orders VALUES 
    (4, 'Jane Smith', 350.00, DATE('2024-01-18')),
    (5, 'Bob Johnson', 100.25, DATE('2024-01-19'))
```

```
%%sql
REFRESH MATERIALIZED VIEW customer_summary FULL
```

```
%%sql
SELECT * FROM customer_summary
```

## Considerations and limitations
<a name="materialized-views-considerations-limitations"></a>

Consider the following when using materialized views with AWS Glue:
+ Materialized views require AWS Glue version 5.1 or later.
+ Source tables must be Apache Iceberg tables registered in the AWS Glue Data Catalog. Apache Hive, Apache Hudi, and Linux Foundation Delta Lake tables are not supported at launch.
+ Source tables must reside in the same Region and account as the materialized view.
+ All source tables must be governed by AWS Lake Formation. IAM-only permissions and hybrid access are not supported.
+ Materialized views cannot reference AWS Glue Data Catalog views, multi-dialect views, or other materialized views as source tables.
+ The view definer role must have full read access (SELECT or ALL permission) on all source tables without row, column, or cell filters applied.
+ Materialized views are eventually consistent with source tables. During the refresh window, queries may return stale data. Execute manual refresh for immediate consistency.
+ The minimum automatic refresh interval is one hour.
+ Incremental refresh supports a restricted subset of SQL operations. The view definition must be a single SELECT-FROM-WHERE-GROUP BY-HAVING block and cannot contain set operations, subqueries, the DISTINCT keyword in SELECT or aggregate functions, window functions, or joins other than INNER JOIN.
+ Incremental refresh does not support user-defined functions or certain built-in functions. Only a subset of Spark SQL built-in functions are supported.
+ Query automatic rewrite only considers materialized views whose definitions belong to a restricted SQL subset similar to incremental refresh restrictions.
+ Identifiers containing special characters other than alphanumeric characters and underscores are not supported in CREATE MATERIALIZED VIEW queries. This applies to all identifier types including catalog/namespace/table names, column and struct field names, CTEs, and aliases.
+ Materialized view columns starting with the \$1\$1ivm prefix are reserved for system use. Amazon reserves the right to modify or remove these columns in future releases.
+ The SORT BY, LIMIT, OFFSET, CLUSTER BY, and ORDER BY clauses are not supported in materialized view definitions.
+ Cross-Region and cross-account source tables are not supported.
+ Tables referenced in the view query must use three-part naming convention (e.g., glue\$1catalog.my\$1db.my\$1table) because automatic refresh does not use default catalog and database settings.
+ Full refresh operations override the entire table and make previous snapshots unavailable.
+ Non-deterministic functions such as rand() or current\$1timestamp() are not supported in materialized view definitions.