# Working with Spark jobs in AWS Glue
<a name="etl-jobs-section"></a>

Provides information on AWS Glue for Spark ETL jobs.

**Topics**
+ [Using job parameters in AWS Glue jobs](aws-glue-programming-etl-glue-arguments.md)
+ [AWS Glue Spark and PySpark jobs](spark_and_pyspark.md)
+ [AWS Glue worker types](worker-types.md)
+ [Streaming ETL jobs in AWS Glue](add-job-streaming.md)
+ [Record matching with AWS Lake Formation FindMatches](machine-learning.md)
+ [Migrate Apache Spark programs to AWS Glue](glue-author-migrate-apache-spark.md)

# Using job parameters in AWS Glue jobs
<a name="aws-glue-programming-etl-glue-arguments"></a>

When creating a AWS Glue job, you set some standard fields, such as `Role` and `WorkerType`. You can provide additional configuration information through the `Argument` fields (**Job Parameters** in the console). In these fields, you can provide AWS Glue jobs with the arguments (parameters) listed in this topic. 

 For more information about the AWS Glue Job API, see [Jobs](aws-glue-api-jobs-job.md). 

**Note**  
 Job arguments have a maximum size limit of 260KB. A validation check will raise an error if the argument size is greater than 260KB. 


## Setting job parameters
<a name="w2aac37c11b8c11"></a>

You can configure a job through the console on the **Job details** tab, under the **Job Parameters** heading. You can also configure a job through the AWS CLI by setting `DefaultArguments` or `NonOverridableArguments` on a job, or setting `Arguments` on a job run. Arguments set on the job will be passed in every time the job is run, while arguments set on the job run will only be passed in for that individual run. 

For example, the following is the syntax for running a job using `--arguments` to set a job parameter.

```
$ aws glue start-job-run --job-name "CSV to CSV" --arguments='--scriptLocation="s3://my_glue/libraries/test_lib.py"'
```

## Accessing job parameters
<a name="w2aac37c11b8c13"></a>

When writing AWS Glue scripts, you may want to access job parameter values to alter the behavior of your own code. We provide helper methods to do so in our libraries. These methods resolve job run parameter values that override job parameter values. When resolving parameters set in multiple places, job `NonOverridableArguments` will override job run `Arguments`, which will override job `DefaultArguments`.

**In Python:**

In Python jobs, we provide a function named `getResolvedParameters`. For more information, see [Accessing parameters using `getResolvedOptions`](aws-glue-api-crawler-pyspark-extensions-get-resolved-options.md). Job parameters are available in the `sys.argv` variable.

**In Scala:**

In Scala jobs, we provide an object named `GlueArgParser`. For more information, see [AWS Glue Scala GlueArgParser APIs](glue-etl-scala-apis-glue-util-glueargparser.md). Job parameters are available in the `sysArgs` variable.

## Job parameter reference
<a name="job-parameter-reference"></a>

**AWS Glue recognizes the following argument names that you can use to set up the script environment for your jobs and job runs:**

**`--additional-python-modules`**  
 A comma delimited list representing a set of Python packages to be installed. You can install packages from PyPI or provide a custom distribution. A PyPI package entry will be in the format `package==version`, with the PyPI name and version of your target package. A custom distribution entry is the S3 path to the distribution.  
Entries use Python version matching to match package and version. This means you will need to use two equals signs, such as `==`. There are other version matching operators, for more information see [PEP 440](https://peps.python.org/pep-0440/#version-matching).   
To pass module installation options to `pip3`, use the [--python-modules-installer-option](#python-modules-installer-option) parameter.

**`--auto-scale-within-microbatch`**  
The default value is true. This parameter can only be used for AWS Glue streaming jobs, which process the streaming data in a series of micro batches, and auto scaling must be enabled. When setting this value to false, it computes the exponential moving average of batch duration for completed micro-batches and compares this value with the window size to determine whether to scale up or scale down the number of executors. Scaling only happens when a micro batch is completed. When setting this value to true, during a micro-batch, it scales up when the number of Spark tasks remains the same for 30 seconds, or the current batch processing is greater than the window size. The number of executors will drop if an executor has been idle for more than 60 seconds, or the exponential moving average of batch duration is low. 

**`--class`**  
The Scala class that serves as the entry point for your Scala script. This applies only if your `--job-language` is set to `scala`.

**`--continuous-log-conversionPattern`**  
Specifies a custom conversion log pattern for a job enabled for continuous logging. The conversion pattern applies only to driver logs and executor logs. It does not affect the AWS Glue progress bar.

**`--continuous-log-logGroup`**  
Specifies a custom Amazon CloudWatch log group name for a job enabled for continuous logging.

**`--continuous-log-logStreamPrefix`**  
 Specifies a custom CloudWatch log stream prefix for a job enabled for continuous logging.

**`--customer-driver-env-vars` and `--customer-executor-env-vars`**  
These parameters set environment variables on the operating system respectively for each worker (driver or executor). You can use these parameters when building platforms and custom frameworks on top of AWS Glue, to let your users write jobs on top of it. Enabling these two flags will allow you to set different environment variables on the driver and executor respectively without having to inject the same logic in the job script itself.   
**Example usage**  
The following is an example of using these parameters:

```
"—customer-driver-env-vars", "CUSTOMER_KEY1=VAL1,CUSTOMER_KEY2=\"val2,val2 val2\"",
"—customer-executor-env-vars", "CUSTOMER_KEY3=VAL3,KEY4=VAL4"
```
Setting these in the job run argument is equivalent to running the following commands:  
In the driver:  
+ export CUSTOMER\$1KEY1=VAL1
+ export CUSTOMER\$1KEY2="val2,val2 val2"
In the executor:  
+ export CUSTOMER\$1KEY3=VAL3
Then, in the job script itself, you can retrieve the environment variables using `os.environ.get("CUSTOMER_KEY1")` or `System.getenv("CUSTOMER_KEY1")`.   
**Enforced syntax**  
Observe the following standards when defining environment variables:
+ Each key must have the `CUSTOMER_ prefix`.

  For example: for `"CUSTOMER_KEY3=VAL3,KEY4=VAL4"`, `KEY4=VAL4` will be ignored and not set.
+ Each key and value pair must be delineated with a single comma.

  For example: `"CUSTOMER_KEY3=VAL3,CUSTOMER_KEY4=VAL4"`
+ If the "value" has spaces or commas, then it must be defined within quotations.

  For example: `CUSTOMER_KEY2=\"val2,val2 val2\"`
This syntax closely models the standards of setting bash environment variables.

**`--datalake-formats` **  
Supported in AWS Glue 3.0 and later versions.  
Specifies the data lake framework to use. AWS Glue adds the required JAR files for the frameworks that you specify into the `classpath`. For more information, see [Using data lake frameworks with AWS Glue ETL jobs](aws-glue-programming-etl-datalake-native-frameworks.md).  
You can specify one or more of the following values, separated by a comma:  
+ `hudi`
+ `delta`
+ `iceberg`
For example, pass the following argument to specify all three frameworks.  

```
'--datalake-formats': 'hudi,delta,iceberg'
```

**`--disable-proxy-v2`**  
 Disable the service proxy to allow AWS service calls to Amazon S3, CloudWatch, and AWS Glue originating from your script through your VPC. For more information, see [ Configuring AWS calls to go through your VPC ](https://docs.aws.amazon.com/glue/latest/dg/connection-VPC-disable-proxy.html). To disable the service proxy, set the value of this paramater to `true`.

**`--enable-auto-scaling`**  
Turns on auto scaling and per-worker billing when you set the value to `true`.

**`--enable-continuous-cloudwatch-log`**  
Enables real-time continuous logging for AWS Glue jobs. You can view real-time Apache Spark job logs in CloudWatch.

**`--enable-continuous-log-filter`**  
Specifies a standard filter (`true`) or no filter (`false`) when you create or edit a job enabled for continuous logging. Choosing the standard filter prunes out non-useful Apache Spark driver/executor and Apache Hadoop YARN heartbeat log messages. Choosing no filter gives you all the log messages.

**`--enable-glue-datacatalog`**  
Enables you to use the AWS Glue Data Catalog as an Apache Spark Hive metastore. To enable this feature, set the value to `true`.

**`--enable-job-insights`**  
Enables additional error analysis monitoring with AWS Glue job run insights. For details, see [Monitoring with AWS Glue job run insights](monitor-job-insights.md). By default, the value is set to `true` and job run insights are enabled.  
This option is available for AWS Glue version 2.0 and 3.0.

**`--enable-lakeformation-fine-grained-access`**  
Enables fine-grained access control for AWS Glue jobs. For more information, see [Using AWS Glue with AWS Lake Formation for fine-grained access control](security-lf-enable.md).

**`--enable-metrics`**  
Enables the collection of metrics for job profiling for this job run. These metrics are available on the AWS Glue console and the Amazon CloudWatch console. The value of this parameter is not relevant. To enable this feature, you can provide this parameter with any value, but `true` is recommended for clarity. To disable this feature, remove this parameter from your job configuration.

**`--enable-observability-metrics`**  
 Enables a set of Observability metrics to generate insights into what is happening inside each job run on Job Runs Monitoring page under AWS Glue console and the Amazon CloudWatch console. To enable this feature, set the value of this parameter to true. To disable this feature, set it to `false` or remove this parameter from your job configuration. 

**`--enable-rename-algorithm-v2`**  
Sets the EMRFS rename algorithm version to version 2. When a Spark job uses dynamic partition overwrite mode, there is a possibility that a duplicate partition is created. For instance, you can end up with a duplicate partition such as `s3://bucket/table/location/p1=1/p1=1`. Here, P1 is the partition that is being overwritten. Rename algorithm version 2 fixes this issue.  
This option is only available on AWS Glue version 1.0.

**`--enable-s3-parquet-optimized-committer`**  
Enables the EMRFS S3-optimized committer for writing Parquet data into Amazon S3. You can supply the parameter/value pair via the AWS Glue console when creating or updating an AWS Glue job. Setting the value to **true** enables the committer. By default, the flag is turned on in AWS Glue 3.0 and off in AWS Glue 2.0.  
For more information, see [Using the EMRFS S3-optimized Committer](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-s3-optimized-committer.html).

**`--enable-spark-ui`**  
When set to `true`, turns on the feature to use the Spark UI to monitor and debug AWS Glue ETL jobs.

**`--executor-cores`**  
Number of spark tasks that can run in parallel. This option is supported on AWS Glue 3.0\$1. The value should not exceed 2x the number of vCPUs on the worker type, which is 8 on `G.1X`, 16 on `G.2X`, 32 on `G.4X`, 64 on `G.8X`, 96 on `G.12X`, 128 on `G.16X`, and 8 on `R.1X`, 16 on `R.2X`, 32 on `R.4X`, 64 on `R.8X`. You should exercise caution while updating this configuration as it could impact job performance because increased task parallelism causes memory, disk pressure as well as it could throttle the source and target systems (for example: it would cause more concurrent connections on Amazon RDS).

**`--extra-files`**  
The Amazon S3 paths to additional files, such as configuration files that AWS Glue copies to the working directory of your script on the driver node before running it. Multiple values must be complete paths separated by a comma (`,`). The value can be individual files or directory locations. This option is not supported for Python Shell job types.

**`--extra-jars`**  
The Amazon S3 paths to additional files that AWS Glue copies to the driver and executors. AWS Glue also adds these files to the Java classpath before executing your script. Multiple values must be complete paths separated by a comma (`,`). The extension need not be `.jar`

**`--extra-py-files`**  
The Amazon S3 paths to additional Python modules that AWS Glue adds to the Python path on the driver node before running your script. Multiple values must be complete paths separated by a comma (`,`). Only individual files are supported, not a directory path.

**`--job-bookmark-option`**  
Controls the behavior of a job bookmark. The following option values can be set.    
****    
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-glue-arguments.html)
For example, to enable a job bookmark, pass the following argument.  

```
'--job-bookmark-option': 'job-bookmark-enable'
```

**`--job-language`**  
The script programming language. This value must be either `scala` or `python`. If this parameter is not present, the default is `python`.

**`--python-modules-installer-option`**  
A plaintext string that defines options to be passed to `pip3` when installing modules with [--additional-python-modules](#additional-python-modules). Provide options as you would in the command line, separated by spaces and prefixed by dashes. For more information about usage, see [Installing additional Python modules with pip in AWS Glue 2.0 or later](aws-glue-programming-python-libraries.md#addl-python-modules-support).  
This option is not supported for AWS Glue jobs when you use Python 3.9.

**`--scriptLocation`**  
The Amazon Simple Storage Service (Amazon S3) location where your ETL script is located (in the form `s3://path/to/my/script.py`). This parameter overrides a script location set in the `JobCommand` object.

**`--spark-event-logs-path`**  
Specifies an Amazon S3 path. When using the Spark UI monitoring feature, AWS Glue flushes the Spark event logs to this Amazon S3 path every 30 seconds to a bucket that can be used as a temporary directory for storing Spark UI events.

**`--TempDir`**  
Specifies an Amazon S3 path to a bucket that can be used as a temporary directory for the job.  
For example, to set a temporary directory, pass the following argument.  

```
'--TempDir': 's3-path-to-directory'
```
AWS Glue creates a temporary bucket for jobs if a bucket doesn't already exist in a Region. This bucket might permit public access. You can either modify the bucket in Amazon S3 to set the public access block, or delete the bucket later after all jobs in that Region have completed.

**`--use-postgres-driver`**  
When setting this value to `true`, it prioritizes the Postgres JDBC driver in the class path to avoid a conflict with the Amazon Redshift JDBC driver. This option is only available in AWS Glue version 2.0.

**`--user-jars-first`**  
When setting this value to `true`, it prioritizes the customer's extra JAR files in the classpath. This option is only available in AWS Glue version 2.0 or later.

**`--conf`**  
Controls Spark config parameters. It is for advanced use cases.

**`--encryption-type`**  
Legacy parameter. The corresponding behavior should be configured using security configurations. for more information about security configurations, see [Encrypting data written by AWS Glue](encryption-security-configuration.md).

AWS Glue uses the following arguments internally and you should never use them:
+ `--debug` — Internal to AWS Glue. Do not set.
+ `--mode` — Internal to AWS Glue. Do not set.
+ `--JOB_NAME` — Internal to AWS Glue. Do not set.
+ `--endpoint` — Internal to AWS Glue. Do not set.


## 
<a name="w2aac37c11b8c17"></a>

 AWS Glue supports bootstrapping an environment with Python's `site` module using `sitecustomize` to perform site-specific customizations. Bootstrapping your own initilization functions is recommended for advanced use cases only and is supported on a best-effort basis on AWS Glue 4.0. 

 The environment variable prefix, `GLUE_CUSTOMER`, is reserved for customer use. 

# AWS Glue Spark and PySpark jobs
<a name="spark_and_pyspark"></a>

AWS Glue support Spark and PySpark jobs. A Spark job is run in an Apache Spark environment managed by AWS Glue. It processes data in batches. A streaming ETL job is similar to a Spark job, except that it performs ETL on data streams. It uses the Apache Spark Structured Streaming framework. Some Spark job features are not available to streaming ETL jobs.

The following sections provide information on AWS Glue Spark and PySpark jobs.

**Topics**
+ [Configuring job properties for Spark jobs in AWS Glue](add-job.md)
+ [Editing Spark scripts in the AWS Glue console](edit-script-spark.md)
+ [Jobs (legacy)](console-edit-script.md)
+ [Tracking processed data using job bookmarks](monitor-continuations.md)
+ [Storing Spark shuffle data](monitor-spark-shuffle-manager.md)
+ [Monitoring AWS Glue Spark jobs](monitor-spark.md)
+ [Generative AI troubleshooting for Apache Spark in AWS Glue](troubleshoot-spark.md)
+ [Using materialized views with AWS Glue](materialized-views.md)

# Configuring job properties for Spark jobs in AWS Glue
<a name="add-job"></a>

When you define your job on the AWS Glue console, you provide values for properties to control the AWS Glue runtime environment. 

## Defining job properties for Spark jobs
<a name="create-job"></a>

The following list describes the properties of a Spark job. For the properties of a Python shell job, see [Defining job properties for Python shell jobs](add-job-python.md#create-job-python-properties). For properties of a streaming ETL job, see [Defining job properties for a streaming ETL job](add-job-streaming.md#create-job-streaming-properties).

The properties are listed in the order in which they appear on the **Add job** wizard on AWS Glue console.

**Name**  
Provide a UTF-8 string with a maximum length of 255 characters. 

**Description**  
Provide an optional description of up to 2048 characters. 

**IAM Role**  
Specify the IAM role that is used for authorization to resources used to run the job and access data stores. For more information about permissions for running jobs in AWS Glue, see [Identity and access management for AWS Glue](security-iam.md).

**Type**  
The type of ETL job. This is set automatically based on the type of data sources you select.  
+ **Spark** runs an Apache Spark ETL script with the job command `glueetl`.
+ **Spark Streaming** runs a Apache Spark streaming ETL script with the job command `gluestreaming`. For more information, see [Streaming ETL jobs in AWS Glue](add-job-streaming.md).
+ **Python shell** run a Python script with the job command `pythonshell`. For more information, see [Configuring job properties for Python shell jobs in AWS Glue](add-job-python.md).

**AWS Glue version**  
AWS Glue version determines the versions of Apache Spark and Python that are available to the job, as specified in the following table.      
<a name="table-glue-versions"></a>[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/add-job.html)
Jobs that are created without specifying a AWS Glue version default to AWS Glue 5.1.

**Language**  
The code in the ETL script defines your job's logic. The script can be coded in Python or Scala. You can choose whether the script that the job runs is generated by AWS Glue or provided by you. You provide the script name and location in Amazon Simple Storage Service (Amazon S3). Confirm that there isn't a file with the same name as the script directory in the path. To learn more about writing scripts, see [AWS Glue programming guide](edit-script.md).

**Worker type**  
The following worker types are available:  
The resources available on AWS Glue workers are measured in DPUs. A DPU is a relative measure of processing power that consists of 4 vCPUs of compute capacity and 16 GB of memory.  
+ **G.025X** – When you choose this type, you also provide a value for **Number of workers**. Each worker maps to 0.25 DPU (2 vCPUs, 4 GB of memory) with 84GB disk (approximately 34GB free). We recommend this worker type for low volume streaming jobs. This worker type is only available for AWS Glue version 3.0 or later streaming jobs.
+ **G.1X** – When you choose this type, you also provide a value for **Number of workers**. Each worker maps to 1 DPU (4 vCPUs, 16 GB of memory) with 94GB disk (approximately 44GB free). We recommend this worker type for workloads such as data transforms, joins, and queries, to offers a scalable and cost effective way to run most jobs.
+ **G.2X** – When you choose this type, you also provide a value for **Number of workers**. Each worker maps to 2 DPU (8 vCPUs, 32 GB of memory) with 138GB disk (approximately 78GB free). We recommend this worker type for workloads such as data transforms, joins, and queries, to offers a scalable and cost effective way to run most jobs.
+ **G.4X** – When you choose this type, you also provide a value for **Number of workers**. Each worker maps to 4 DPU (16 vCPUs, 64 GB of memory) with 256GB disk (approximately 230GB free). We recommend this worker type for jobs whose workloads contain your most demanding transforms, aggregations, joins, and queries. 
+ **G.8X** – When you choose this type, you also provide a value for **Number of workers**. Each worker maps to 8 DPU (32 vCPUs, 128 GB of memory) with 512GB disk (approximately 485GB free). We recommend this worker type for jobs whose workloads contain your most demanding transforms, aggregations, joins, and queries.
+ **G.12X** – When you choose this type, you also provide a value for **Number of workers**. Each worker maps to 12 DPU (48 vCPUs, 192 GB of memory) with 768GB disk (approximately 741GB free). We recommend this worker type for jobs with very large and resource-intensive workloads that require significant compute capacity. 
+ **G.16X** – When you choose this type, you also provide a value for **Number of workers**. Each worker maps to 16 DPU (64 vCPUs, 256 GB of memory) with 1024GB disk (approximately 996GB free). We recommend this worker type for jobs with the largest and most resource-intensive workloads that require maximum compute capacity. 
+ **R.1X** – When you choose this type, you also provide a value for **Number of workers**. Each worker maps to 1 DPU with memory-optimized configuration. We recommend this worker type for memory-intensive workloads that frequently encounter out-of-memory errors or require high memory-to-CPU ratios. 
+ **R.2X** – When you choose this type, you also provide a value for **Number of workers**. Each worker maps to 2 DPU with memory-optimized configuration. We recommend this worker type for memory-intensive workloads that frequently encounter out-of-memory errors or require high memory-to-CPU ratios. 
+ **R.4X** – When you choose this type, you also provide a value for **Number of workers**. Each worker maps to 4 DPU with memory-optimized configuration. We recommend this worker type for large memory-intensive workloads that frequently encounter out-of-memory errors or require high memory-to-CPU ratios. 
+ **R.8X** – When you choose this type, you also provide a value for **Number of workers**. Each worker maps to 8 DPU with memory-optimized configuration. We recommend this worker type for very large memory-intensive workloads that frequently encounter out-of-memory errors or require high memory-to-CPU ratios. 
**Worker Type Specifications**  
The following table provides detailed specifications for all available G worker types:    
**G Worker Type Specifications**    
<a name="table-worker-specifications"></a>[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/add-job.html)
**Important:** G.12X and G.16X worker types, as well as all R worker types (R.1X through R.8X), have higher startup latency.  
You are charged an hourly rate based on the number of DPUs used to run your ETL jobs. For more information, see the [AWS Glue pricing page](https://aws.amazon.com/glue/pricing/).  
For AWS Glue version 1.0 or earlier jobs, when you configure a job using the console and specify a **Worker type** of **Standard**, the **Maximum capacity** is set and the **Number of workers** becomes the value of **Maximum capacity** - 1. If you use the AWS Command Line Interface (AWS CLI) or AWS SDK, you can specify the **Max capacity** parameter, or you can specify both **Worker type** and the **Number of workers**.  
For AWS Glue version 2.0 or later jobs, you cannot specify a **Maximum capacity**. Instead, you should specify a **Worker type** and the **Number of workers**.  
**G.4X** and **G.8X** worker types are available only for AWS Glue version 3.0 or later Spark ETL jobs in the following AWS Regions: US East (Ohio), US East (N. Virginia), US West (N. California), US West (Oregon), Asia Pacific (Mumbai), Asia Pacific (Seoul), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Tokyo), Canada (Central), Europe (Frankfurt), Europe (Ireland), Europe (London), Europe (Spain), Europe (Stockholm), and South America (São Paulo).  
**G.12X**, **G.16X**, and **R.1X** through **R.8X** worker types are available only for AWS Glue version 4.0 or later Spark ETL jobs in the following AWS Regions: US East (N. Virginia), US West (Oregon), US East (Ohio), Europe (Ireland), and Europe (Frankfurt). Additional regions will be supported in future releases.

**Requested number of workers**  
For most worker types, you must specify the number of workers that are allocated when the job runs. 

**Job bookmark**  
Specify how AWS Glue processes state information when the job runs. You can have it remember previously processed data, update state information, or ignore state information. For more information, see [Tracking processed data using job bookmarks](monitor-continuations.md).

**Job run queuing**  
Specifies whether job runs are queued to run later when they cannot run immediately due to service quotas.  
When checked, job run queuing is enabled for the job runs. If not populated, the job runs will not be considered for queueing.  
If this setting does not match the value set in the job run, then the value from the job run field will be used.

**Flex execution**  
When you configure a job using AWS Studio or the API you may specify a standard or flexible job execution class. Your jobs may have varying degrees of priority and time sensitivity. The standard execution-class is ideal for time-sensitive workloads that require fast job startup and dedicated resources.  
The flexible execution class is appropriate for non-urgent jobs such as pre-production jobs, testing, and one-time data loads. Flexible job runs are supported for jobs using AWS Glue version 3.0 or later and `G.1X` or `G.2X` worker types. The new worker types (`G.12X`, `G.16X`, and `R.1X` through `R.8X`) do not support flexible execution.  

[![AWS Videos](http://img.youtube.com/vi/https://www.youtube.com/embed/FnHCoTuDLXU/0.jpg)](http://www.youtube.com/watch?v=https://www.youtube.com/embed/FnHCoTuDLXU)

Flex job runs are billed based on the number of workers running at any point in time. Number of workers may be added or removed for a running flexible job run. Instead of billing as a simple calculation of `Max Capacity` \$1 `Execution Time`, each worker will contribute for the time it ran during the job run. The bill is the sum of (`Number of DPUs per worker` \$1 `time each worker ran`).  
For more information, see the help panel in AWS Studio, or [Jobs](aws-glue-api-jobs-job.md) and [Job runs](aws-glue-api-jobs-runs.md).

**Number of retries**  
Specify the number of times, from 0 to 10, that AWS Glue should automatically restart the job if it fails. Jobs that reach the timeout limit are not restarted.

**Job timeout**  
Sets the maximum execution time in minutes. The maximum is 7 days or 10,080 minutes. Otherwise, the jobs will throw an exception.  
When the value is left blank, the timeout is defaulted to 2880 minutes.  
Any existing AWS Glue jobs that had a timeout value greater than 7 days will be defaulted to 7 days. For instance if you specified a timeout of 20 days for a batch job, it will be stopped on the 7th day.  
**Best practices for job timeouts**  
Jobs are billed based on execution time. To avoid unexpected charges, configure timeout values appropriate for the expected execution time of your job. 

**Advanced Properties**    
**Script filename**  
A unique script name for your job. Cannot be named **Untitled job**.  
**Script path**  
The Amazon S3 location of the script. The path must be in the form `s3://bucket/prefix/path/`. It must end with a slash (`/`) and not include any files.  
**Job metrics**  
Turn on or turn off the creation of Amazon CloudWatch metrics when this job runs. To see profiling data, you must enable this option. For more information about how to turn on and visualize metrics, see [Job monitoring and debugging](monitor-profile-glue-job-cloudwatch-metrics.md).   
**Job observability metrics**  
Turn on the creation of additional observability CloudWatch metrics when this job runs. For more information, see [Monitoring with AWS Glue Observability metrics](monitor-observability.md).  
**Continuous logging**  
Turn on continuous logging to Amazon CloudWatch. If this option is not enabled, logs are available only after the job completes. For more information, see [Logging for AWS Glue jobs](monitor-continuous-logging.md).  
**Spark UI**  
Turn on the use of Spark UI for monitoring this job. For more information, see [Enabling the Apache Spark web UI for AWS Glue jobs](monitor-spark-ui-jobs.md).   
**Spark UI logs path**  
The path to write logs when Spark UI is enabled.  
**Spark UI logging and monitoring configuration**  
Choose one of the following options:  
+ *Standard*: write logs using the AWS Glue job run ID as the filename. Turn on Spark UI monitoring in the AWS Glue console.
+ *Legacy*: write logs using 'spark-application-\$1timestamp\$1' as the filename. Do not turn on Spark UI monitoring.
+ *Standard and legacy*: write logs to both the standard and legacy locations. Turn on Spark UI monitoring in the AWS Glue console.  
**Maximum concurrency**  
Sets the maximum number of concurrent runs that are allowed for this job. The default is 1. An error is returned when this threshold is reached. The maximum value you can specify is controlled by a service limit. For example, if a previous run of a job is still running when a new instance is started, you might want to return an error to prevent two instances of the same job from running concurrently.   
**Temporary path**  
Provide the location of a working directory in Amazon S3 where temporary intermediate results are written when AWS Glue runs the script. Confirm that there isn't a file with the same name as the temporary directory in the path. This directory is used when AWS Glue reads and writes to Amazon Redshift and by certain AWS Glue transforms.  
AWS Glue creates a temporary bucket for jobs if a bucket doesn't already exist in a region. This bucket might permit public access. You can either modify the bucket in Amazon S3 to set the public access block, or delete the bucket later after all jobs in that region have completed.  
**Delay notification threshold (minutes)**  
Sets the threshold (in minutes) before a delay notification is sent. You can set this threshold to send notifications when a `RUNNING`, `STARTING`, or `STOPPING` job run takes more than an expected number of minutes.  
**Security configuration**  
Choose a security configuration from the list. A security configuration specifies how the data at the Amazon S3 target is encrypted: no encryption, server-side encryption with AWS KMS-managed keys (SSE-KMS), or Amazon S3-managed encryption keys (SSE-S3).  
**Server-side encryption**  
If you select this option, when the ETL job writes to Amazon S3, the data is encrypted at rest using SSE-S3 encryption. Both your Amazon S3 data target and any data that is written to an Amazon S3 temporary directory is encrypted. This option is passed as a job parameter. For more information, see [Protecting Data Using Server-Side Encryption with Amazon S3-Managed Encryption Keys (SSE-S3)](https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingServerSideEncryption.html) in the *Amazon Simple Storage Service User Guide*.  
This option is ignored if a security configuration is specified.  
**Use Glue data catalog as the Hive metastore**  
Select to use the AWS Glue Data Catalog as the Hive metastore. The IAM role used for the job must have the `glue:CreateDatabase` permission. A database called “default” is created in the Data Catalog if it does not exist.

**Connections**  
Choose a VPC configuration to access Amazon S3 data sources located in your virtual private cloud (VPC). You can create and manage Network connection in AWS Glue. For more information, see [Connecting to data](glue-connections.md). 

**Libraries**    
**Python library path, Dependent JARs path, and Referenced files path**  
Specify these options if your script requires them. You can define the comma-separated Amazon S3 paths for these options when you define the job. You can override these paths when you run the job. For more information, see [Providing your own custom scripts](console-custom-created.md).  
**Job parameters**  
A set of key-value pairs that are passed as named parameters to the script. These are default values that are used when the script is run, but you can override them in triggers or when you run the job. You must prefix the key name with `--`; for example: `--myKey`. You pass job parameters as a map when using the AWS Command Line Interface.  
For examples, see Python parameters in [Passing and accessing Python parameters in AWS Glue](aws-glue-programming-python-calling.md#aws-glue-programming-python-calling-parameters).  
**Tags**  
Tag your job with a **Tag key** and an optional **Tag value**. After tag keys are created, they are read-only. Use tags on some resources to help you organize and identify them. For more information, see [AWS tags in AWS Glue](monitor-tags.md). 

## Restrictions for jobs that access Lake Formation managed tables
<a name="lf-table-restrictions"></a>

Keep in mind the following notes and restrictions when creating jobs that read from or write to tables managed by AWS Lake Formation: 
+ The following features are not supported in jobs that access tables with cell-level filters:
  + [Job bookmarks](https://docs.aws.amazon.com/glue/latest/dg/monitor-continuations.html) and [bounded execution](https://docs.aws.amazon.com/glue/latest/dg/bounded-execution.html)
  + [Push-down predicates](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-partitions.html#aws-glue-programming-etl-partitions-pushdowns)
  + [Server-side catalog partition predicates](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-partitions.html#aws-glue-programming-etl-partitions-cat-predicates)
  + [enableUpdateCatalog](https://docs.aws.amazon.com/glue/latest/dg/update-from-job.html)

# Editing Spark scripts in the AWS Glue console
<a name="edit-script-spark"></a>

A script contains the code that extracts data from sources, transforms it, and loads it into targets. AWS Glue runs a script when it starts a job.

AWS Glue ETL scripts can be coded in Python or Scala. Python scripts use a language that is an extension of the PySpark Python dialect for extract, transform, and load (ETL) jobs. The script contains *extended constructs* to deal with ETL transformations. When you automatically generate the source code logic for your job, a script is created. You can edit this script, or you can provide your own script to process your ETL work.

 For information about defining and editing scripts in AWS Glue, see [AWS Glue programming guide](edit-script.md).

## Additional libraries or files
<a name="w2aac37c11c12c13b9"></a>

If your script requires additional libraries or files, you can specify them as follows:

**Python library path**  
Comma-separated Amazon Simple Storage Service (Amazon S3) paths to Python libraries that are required by the script.  
Only pure Python libraries can be used. Libraries that rely on C extensions, such as the pandas Python Data Analysis Library, are not yet supported.

**Dependent jars path**  
Comma-separated Amazon S3 paths to JAR files that are required by the script.  
Currently, only pure Java or Scala (2.11) libraries can be used.

**Referenced files path**  
Comma-separated Amazon S3 paths to additional files (for example, configuration files) that are required by the script.

# Jobs (legacy)
<a name="console-edit-script"></a>

A script contains the code that performs extract, transform, and load (ETL) work. You can provide your own script, or AWS Glue can generate a script with guidance from you. For information about creating your own scripts, see [Providing your own custom scripts](console-custom-created.md).

You can edit a script in the AWS Glue console. When you edit a script, you can add sources, targets, and transforms.

**To edit a script**

1. Sign in to the AWS Management Console and open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/). Then choose the **Jobs** tab.

1. Choose a job in the list, and then choose **Action**, **Edit script** to open the script editor.

   You can also access the script editor from the job details page. Choose the **Script** tab, and then choose **Edit script**.

   
## Script editor
<a name="console-edit-script-editor"></a>

The AWS Glue script editor lets you insert, modify, and delete sources, targets, and transforms in your script. The script editor displays both the script and a diagram to help you visualize the flow of data.

To create a diagram for the script, choose **Generate diagram**. AWS Glue uses annotation lines in the script beginning with **\$1\$1** to render the diagram. To correctly represent your script in the diagram, you must keep the parameters in the annotations and the parameters in the Apache Spark code in sync.

The script editor lets you add code templates wherever your cursor is positioned in the script. At the top of the editor, choose from the following options:
+ To add a source table to the script, choose **Source**.
+ To add a target table to the script, choose **Target**.
+ To add a target location to the script, choose **Target location**.
+ To add a transform to the script, choose **Transform**. For information about the functions that are called in your script, see [Program AWS Glue ETL scripts in PySpark](aws-glue-programming-python.md).
+ To add a Spigot transform to the script, choose **Spigot**.

In the inserted code, modify the `parameters` in both the annotations and Apache Spark code. For example, if you add a **Spigot** transform, verify that the `path` is replaced in both the `@args` annotation line and the `output` code line.

The **Logs** tab shows the logs that are associated with your job as it runs. The most recent 1,000 lines are displayed.

The **Schema** tab shows the schema of the selected sources and targets, when available in the Data Catalog. 

# Tracking processed data using job bookmarks
<a name="monitor-continuations"></a>

AWS Glue tracks data that has already been processed during a previous run of an ETL job by persisting state information from the job run. This persisted state information is called a *job bookmark*. Job bookmarks help AWS Glue maintain state information and prevent the reprocessing of old data. With job bookmarks, you can process new data when rerunning on a scheduled interval. A job bookmark is composed of the states for various elements of jobs, such as sources, transformations, and targets. For example, your ETL job might read new partitions in an Amazon S3 file. AWS Glue tracks which partitions the job has processed successfully to prevent duplicate processing and duplicate data in the job's target data store.

Job bookmarks are implemented for JDBC data sources, the Relationalize transform, and some Amazon Simple Storage Service (Amazon S3) sources. The following table lists the Amazon S3 source formats that AWS Glue supports for job bookmarks.


| AWS Glue version | Amazon S3 source formats | 
| --- | --- | 
| Version 0.9 | JSON, CSV, Apache Avro, XML | 
| Version 1.0 and later | JSON, CSV, Apache Avro, XML, Parquet, ORC | 

For information about AWS Glue versions, see [Defining job properties for Spark jobs](add-job.md#create-job).

The job bookmarks feature has additional functionalities when accessed through AWS Glue scripts. When browsing your generated script, you may see transformation contexts, which are related to this feature. For more information, see [Using job bookmarks](programming-etl-connect-bookmarks.md).

**Topics**
+ [Using job bookmarks in AWS Glue](#monitor-continuations-implement)
+ [Operational details of the job bookmarks feature](#monitor-continuations-script)

## Using job bookmarks in AWS Glue
<a name="monitor-continuations-implement"></a>

The job bookmark option is passed as a parameter when the job is started. The following table describes the options for setting job bookmarks on the AWS Glue console.


****  

| Job bookmark | Description | 
| --- | --- | 
| Enable | Causes the job to update the state after a run to keep track of previously processed data. If your job has a source with job bookmark support, it will keep track of processed data, and when a job runs, it processes new data since the last checkpoint. | 
| Disable | Job bookmarks are not used, and the job always processes the entire dataset. You are responsible for managing the output from previous job runs. This is the default. | 
| Pause |  Process incremental data since the last successful run or the data in the range identified by the following sub-options, without updating the state of last bookmark. You are responsible for managing the output from previous job runs. The two sub-options are: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitor-continuations.html) The job bookmark state is not updated when this option set is specified. The sub-options are optional, however when used both the sub-options needs to be provided.  | 

For details about the parameters passed to a job on the command line, and specifically for job bookmarks, see [Using job parameters in AWS Glue jobs](aws-glue-programming-etl-glue-arguments.md).

For Amazon S3 input sources, AWS Glue job bookmarks check the last modified time of the objects to verify which objects need to be reprocessed. If your input source data has been modified since your last job run, the files are reprocessed when you run the job again.

For JDBC sources, the following rules apply:
+ For each table, AWS Glue uses one or more columns as bookmark keys to determine new and processed data. The bookmark keys combine to form a single compound key.
+ AWS Glue by default uses the primary key as the bookmark key, provided that it is sequentially increasing or decreasing (with no gaps).
+ You can specify the columns to use as bookmark keys in your AWS Glue script. For more information about using Job bookmarks in AWS Glue scripts, see [Using job bookmarks](programming-etl-connect-bookmarks.md).
+ AWS Glue doesn't support using columns with case-sensitive names as job bookmark keys.

You can rewind your job bookmarks for your AWS Glue Spark ETL jobs to any previous job run. You can support data backfilling scenarios better by rewinding your job bookmarks to any previous job run, resulting in the subsequent job run reprocessing data only from the bookmarked job run.

If you intend to reprocess all the data using the same job, reset the job bookmark. To reset the job bookmark state, use the AWS Glue console, the [ResetJobBookmark action (Python: reset\$1job\$1bookmark)](aws-glue-api-jobs-runs.md#aws-glue-api-jobs-runs-ResetJobBookmark) API operation, or the AWS CLI. For example, enter the following command using the AWS CLI:

```
    aws glue reset-job-bookmark --job-name my-job-name
```

When you rewind or reset a bookmark, AWS Glue does not clean the target files because there could be multiple targets and targets are not tracked with job bookmarks. Only source files are tracked with job bookmarks. You can create different output targets when rewinding and reprocessing the source files to avoid duplicate data in your output.

AWS Glue keeps track of job bookmarks by job. If you delete a job, the job bookmark is deleted.

In some cases, you might have enabled AWS Glue job bookmarks but your ETL job is reprocessing data that was already processed in an earlier run. For information about resolving common causes of this error, see [Troubleshooting Glue common setup errors](glue-troubleshooting-errors.md).

## Operational details of the job bookmarks feature
<a name="monitor-continuations-script"></a>

This section describes more of the operational details of using job bookmarks.

Job bookmarks store the states for a job. Each instance of the state is keyed by a job name and a version number. When a script invokes `job.init`, it retrieves its state and always gets the latest version. Within a state, there are multiple state elements, which are specific to each source, transformation, and sink instance in the script. These state elements are identified by a transformation context that is attached to the corresponding element (source, transformation, or sink) in the script. The state elements are saved atomically when `job.commit` is invoked from the user script. The script gets the job name and the control option for the job bookmarks from the arguments.

The state elements in the job bookmark are source, transformation, or sink-specific data. For example, suppose that you want to read incremental data from an Amazon S3 location that is being constantly written to by an upstream job or process. In this case, the script must determine what has been processed so far. The job bookmark implementation for the Amazon S3 source saves information so that when the job runs again, it can filter only the new objects using the saved information and recompute the state for the next run of the job. A timestamp is used to filter the new files.

In addition to the state elements, job bookmarks have a *run number*, an *attempt number*, and a *version number*. The run number tracks the run of the job, and the attempt number records the attempts for a job run. The job run number is a monotonically increasing number that is incremented for every successful run. The attempt number tracks the attempts for each run, and is only incremented when there is a run after a failed attempt. The version number increases monotonically and tracks the updates to a job bookmark.

In the AWS Glue service database, the bookmark states for all the transformations are stored together as key-value pairs:

```
{
  "job_name" : ...,
  "run_id": ...,
  "run_number": ..,
  "attempt_number": ...
  "states": {
    "transformation_ctx1" : {
      bookmark_state1
    },
    "transformation_ctx2" : {
      bookmark_state2
    }
  }
}
```

**Best practices**  
The following are best practices for using job bookmarks.
+ *Do not change the data source property with the bookmark enabled*. For example, there is a datasource0 pointing to an Amazon S3 input path A, and the job has been reading from a source which has been running for several rounds with the bookmark enabled. If you change the input path of datasource0 to Amazon S3 path B without changing the `transformation_ctx`, the AWS Glue job will use the old bookmark state stored. That will result in missing or skipping files in the input path B as AWS Glue would assume that those files had been processed in previous runs. 
+ *Use a catalog table with bookmarks for better partition management*. Bookmarks works both for data sources from the Data Catalog or from options. However, it's difficult to remove/add new partitions with the from options approach. Using a catalog table with crawlers can provide better automation to track the newly added [partitions](https://docs.aws.amazon.com/glue/latest/dg/tables-described.html#tables-partition) and give you the flexibility to select particular partitions with a [pushdown predicate](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-partitions.html).
+ *Use the [AWS Glue Amazon S3 file lister](https://aws.amazon.com/premiumsupport/knowledge-center/glue-oom-java-heap-space-error/) for large datasets*. A bookmark will list all files under each input partition and do the filering, so if there are too many files under a single partition the bookmark can run into driver OOM. Use the AWS Glue Amazon S3 file lister to avoid listing all files in memory at once.

# Storing Spark shuffle data
<a name="monitor-spark-shuffle-manager"></a>

Shuffling is an important step in a Spark job whenever data is rearranged between partitions. This is required because wide transformations such as `join`, ` groupByKey`, `reduceByKey`, and `repartition` require information from other partitions to complete processing. Spark gathers the required data from each partition and combines it into a new partition. During a shuffle, data is written to disk and transferred across the network. As a result, the shuffle operation is bound to local disk capacity. Spark throws a `No space left on device` or ` MetadataFetchFailedException` error when there is not enough disk space left on the executor and there is no recovery.

**Note**  
 AWS Glue Spark shuffle plugin with Amazon S3 is only supported for AWS Glue ETL jobs. 

**Solution**  
With AWS Glue, you can now use Amazon S3 to store Spark shuffle data. Amazon S3 is an object storage service that offers industry-leading scalability, data availability, security, and performance. This solution disaggregates compute and storage for your Spark jobs, and gives complete elasticity and low-cost shuffle storage, allowing you to run your most shuffle-intensive workloads reliably.

![\[Spark workflow showing Map and Reduce stages using Amazon S3 for shuffle data storage.\]](http://docs.aws.amazon.com/glue/latest/dg/images/gs-s3-shuffle-diagram.png)


We are introducing a new Cloud Shuffle Storage Plugin for Apache Spark to use Amazon S3. You can turn on Amazon S3 shuffling to run your AWS Glue jobs reliably without failures if they are known to be bound by the local disk capacity for large shuffle operations. In some cases, shuffling to Amazon S3 is marginally slower than local disk (or EBS) if you have a large number of small partitions or shuffle files written out to Amazon S3.

## Prerequisites for using Cloud Shuffle Storage Plugin
<a name="monitor-spark-shuffle-manager-prereqs"></a>

 In order to use the Cloud Shuffle Storage Plugin with AWS Glue ETL jobs, you need the following: 
+ An Amazon S3 bucket located in the same region as your job run, for storing the intermediate shuffle and spilled data. The Amazon S3 prefix of shuffle storage can be specified with `--conf spark.shuffle.glue.s3ShuffleBucket=s3://shuffle-bucket/prefix/`, as in the following example:

  ```
  --conf spark.shuffle.glue.s3ShuffleBucket=s3://glue-shuffle-123456789-us-east-1/glue-shuffle-data/
  ```
+  Set the Amazon S3 storage lifecycle policies on the *prefix* (such as `glue-shuffle-data`) as the shuffle manager does not clean the files after the job is done. The intermediate shuffle and spilled data should be deleted after a job is finished. Users can set a short lifecycle policies on the prefix. Instructions for setting up an Amazon S3 lifecycle policy are available at [Setting lifecycle configuration on a bucket](https://docs.aws.amazon.com//AmazonS3/latest/userguide/how-to-set-lifecycle-configuration-intro.html) in the Amazon Simple Storage Service User Guide.

## Using AWS Glue Spark shuffle manager from the AWS console
<a name="monitor-spark-shuffle-manager-using-console"></a>

To set up the AWS Glue Spark shuffle manager using the AWS Glue console or AWS Glue Studio when configuring a job: choose the ** --write-shuffle-files-to-s3** job parameter to turn on Amazon S3 shuffling for the job.

![\[Job parameters interface showing --write-shuffle-files- parameter and option to add more.\]](http://docs.aws.amazon.com/glue/latest/dg/images/gs-s3-shuffle.png)


## Using AWS Glue Spark shuffle plugin
<a name="monitor-spark-shuffle-manager-using"></a>

The following job parameters turn on and tune the AWS Glue shuffle manager. These parameters are flags, so any values provided are not considered.
+ `--write-shuffle-files-to-s3` — The main flag, which enables the AWS Glue Spark shuffle manager to use Amazon S3 buckets for writing and reading shuffle data. When the flag is not specified, the shuffle manager is not used.
+ `--write-shuffle-spills-to-s3` — (Supported only on AWS Glue version 2.0). An optional flag that allows you to offload spill files to Amazon S3 buckets, which provides additional resiliency to your Spark job. This is only required for large workloads that spill a lot of data to disk. When the flag is not specified, no intermediate spill files are written.
+ ` --conf spark.shuffle.glue.s3ShuffleBucket=s3://<shuffle-bucket>` — Another optional flag that specifies the Amazon S3 bucket where you write the shuffle files. By default, `--TempDir`/shuffle-data. AWS Glue 3.0\$1 supports writing shuffle files to multiple buckets by specifying buckets with comma delimiter, as in `--conf spark.shuffle.glue.s3ShuffleBucket=s3://shuffle-bucket-1/prefix,s3://shuffle-bucket-2/prefix/`. Using multiple buckets improves performance. 

You need to provide security configuration settings to enable encryption at-rest for the shuffle data. For more information about security configurations, see [Setting up encryption in AWS Glue](set-up-encryption.md). AWS Glue supports all other shuffle related configurations provided by Spark.

**Software binaries for the Cloud Shuffle Storage plugin**  
You can also download the software binaries of the Cloud Shuffle Storage Plugin for Apache Spark under the Apache 2.0 license and run it on any Spark environment. The new plugin comes with out-of-the box support for Amazon S3, and can also be easily configured to use other forms of cloud storage such as [Google Cloud Storage and Microsoft Azure Blob Storage](https://github.com/aws-samples/aws-glue-samples/blob/master/docs/cloud-shuffle-plugin/README.md). For more information, see [Cloud Shuffle Storage Plugin for Apache Spark](https://docs.aws.amazon.com/glue/latest/dg/cloud-shuffle-storage-plugin.html).

**Notes and limitations**  
The following are notes or limitations for the AWS Glue shuffle manager:
+  AWS Glue shuffle manager doesn't automatically delete the (temporary) shuffle data files stored in your Amazon S3 bucket after a job is completed. To ensure data protection, follow the instructions in [Prerequisites for using Cloud Shuffle Storage Plugin](#monitor-spark-shuffle-manager-prereqs) before enabling the Cloud Shuffle Storage Plugin. 
+ You can use this feature if your data is skewed.

# Cloud Shuffle Storage Plugin for Apache Spark
<a name="cloud-shuffle-storage-plugin"></a>

The Cloud Shuffle Storage Plugin is an Apache Spark plugin compatible with the [`ShuffleDataIO` API](https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/shuffle/api/ShuffleDataIO.java) which allows storing shuffle data on cloud storage systems (such as Amazon S3). It helps you to supplement or replace local disk storage capacity for large shuffle operations, commonly triggered by transformations such as `join`, `reduceByKey`, `groupByKey` and `repartition` in your Spark applications, thereby reducing common failures or price/performance dislocation of your serverless data analytics jobs and pipelines.

**AWS Glue**  
AWS Glue versions 3.0 and 4.0 comes with the plugin pre-installed and ready to enable shuffling to Amazon S3 without any extra steps. For more information, see [AWS Glue Spark shuffle plugin with Amazon S3](https://docs.aws.amazon.com/glue/latest/dg/monitor-spark-shuffle-manager.html) to enable the feature for your Spark applications.

**Other Spark environments**  
The plugin requires the following Spark configurations to be set on other Spark environments:
+ `--conf spark.shuffle.sort.io.plugin.class=com.amazonaws.spark.shuffle.io.cloud.ChopperPlugin`: This informs Spark to use this plugin for Shuffle IO.
+ `--conf spark.shuffle.storage.path=s3://bucket-name/shuffle-file-dir`: The path where your shuffle files will be stored.

**Note**  
The plugin overwrites one Spark core class. As a result, the plugin jar needs to be loaded before Spark jars. You can do this using `userClassPathFirst` in on-prem YARN environments if the plugin is used outside AWS Glue.

## Bundling the plugin with your Spark applications
<a name="cloud-shuffle-storage-plugin-bundling"></a>

You can bundle the plugin with your Spark applications and Spark distributions (versions 3.1 and above) by adding the plugin dependency in your Maven `pom.xml` while developing your Spark applications locally. For more information on the plugin and Spark versions, see [Plugin versions](#cloud-shuffle-storage-plugin-versions).

```
<repositories>
   ...
    <repository>
        <id>aws-glue-etl-artifacts</id>
        <url>https://aws-glue-etl-artifacts.s3.amazonaws.com/release/ </url>
    </repository>
</repositories>
...
<dependency>
    <groupId>com.amazonaws</groupId>
    <artifactId>chopper-plugin</artifactId>
    <version>3.1-amzn-LATEST</version>
</dependency>
```

You can alternatively download the binaries from AWS Glue Maven artifacts directly and include them in your Spark application as follows.

```
#!/bin/bash
sudo wget -v https://aws-glue-etl-artifacts.s3.amazonaws.com/release/com/amazonaws/chopper-plugin/3.1-amzn-LATEST/chopper-plugin-3.1-amzn-LATEST.jar -P /usr/lib/spark/jars/
```

Example spark-submit

```
spark-submit --deploy-mode cluster \
--conf spark.shuffle.storage.s3.path=s3://<ShuffleBucket>/<shuffle-dir> \
--conf spark.driver.extraClassPath=<Path to plugin jar> \ 
--conf spark.executor.extraClassPath=<Path to plugin jar> \
--class <your test class name> s3://<ShuffleBucket>/<Your application jar> \
```

## Optional configurations
<a name="cloud-shuffle-storage-plugin-optional"></a>

These are optional configuration values that control Amazon S3 shuffle behavior. 
+ `spark.shuffle.storage.s3.enableServerSideEncryption`: Enable/disable S3 SSE for shuffle and spill files. Default value is `true`.
+ `spark.shuffle.storage.s3.serverSideEncryption.algorithm`: The SSE algorithm to be used. Default value is `AES256`.
+ `spark.shuffle.storage.s3.serverSideEncryption.kms.key`: The KMS key ARN when SSE aws:kms is enabled.

Along with these configurations, you may need to set configurations such as `spark.hadoop.fs.s3.enableServerSideEncryption` and **other environment-specific configurations** to ensure appropriate encryption is applied for your use case.

## Plugin versions
<a name="cloud-shuffle-storage-plugin-versions"></a>

This plugin is supported for the Spark versions associated with each AWS Glue version. The following table shows the AWS Glue version, Spark version and associated plugin version with Amazon S3 location for the plugin's software binary.


| AWS Glue version | Spark version | Plugin version | Amazon S3 location | 
| --- | --- | --- | --- | 
| 3.0 | 3.1 | 3.1-amzn-LATEST |  s3://aws-glue-etl-artifacts/release/com/amazonaws/chopper-plugin/3.1-amzn-0/chopper-plugin-3.1-amzn-LATEST.jar  | 
| 4.0 | 3.3 | 3.3-amzn-LATEST |  s3://aws-glue-etl-artifacts/release/com/amazonaws/chopper-plugin/3.3-amzn-0/chopper-plugin-3.3-amzn-LATEST.jar  | 

## License
<a name="cloud-shuffle-storage-plugin-binary-license"></a>

The software binary for this plugin is licensed under the Apache-2.0 License.

# Monitoring AWS Glue Spark jobs
<a name="monitor-spark"></a>

**Topics**
+ [Spark Metrics available in AWS Glue Studio](#console-jobs-details-metrics-spark)
+ [Monitoring jobs using the Apache Spark web UI](monitor-spark-ui.md)
+ [Monitoring with AWS Glue job run insights](monitor-job-insights.md)
+ [Monitoring with Amazon CloudWatch](monitor-cloudwatch.md)
+ [Job monitoring and debugging](monitor-profile-glue-job-cloudwatch-metrics.md)

## Spark Metrics available in AWS Glue Studio
<a name="console-jobs-details-metrics-spark"></a>

The **Metrics** tab shows metrics collected when a job runs and profiling is turned on. The following graphs are shown in Spark jobs: 
+ ETL Data Movement
+ Memory Profile: Driver and Executors

Choose **View additional metrics** to show the following graphs:
+ ETL Data Movement
+ Memory Profile: Driver and Executors
+ Data Shuffle Across Executors
+ CPU Load: Driver and Executors
+ Job Execution: Active Executors, Completed Stages & Maximum Needed Executors

Data for these graphs is pushed to CloudWatch metrics if the job is configured to collect metrics. For more information about how to turn on metrics and interpret the graphs, see [Job monitoring and debugging](monitor-profile-glue-job-cloudwatch-metrics.md). 

**Example ETL data movement graph**  
The ETL Data Movement graph shows the following metrics:  
+ The number of bytes read from Amazon S3 by all executors—[`glue.ALL.s3.filesystem.read_bytes`](monitoring-awsglue-with-cloudwatch-metrics.md#glue.ALL.s3.filesystem.read_bytes)
+ The number of bytes written to Amazon S3 by all executors—[`glue.ALL.s3.filesystem.write_bytes`](monitoring-awsglue-with-cloudwatch-metrics.md#glue.ALL.s3.filesystem.write_bytes)

![\[The graph for ETL Data Movement in the Metrics tab of the AWS Glue console.\]](http://docs.aws.amazon.com/glue/latest/dg/images/job_detailed_etl.png)


**Example Memory profile graph**  
The Memory Profile graph shows the following metrics:  
+ The fraction of memory used by the JVM heap for this driver (scale: 0–1) by the driver, an executor identified by *executorId*, or all executors—
  + [`glue.driver.jvm.heap.usage`](monitoring-awsglue-with-cloudwatch-metrics.md#glue.driver.jvm.heap.usage)
  + [`glue.executorId.jvm.heap.usage`](monitoring-awsglue-with-cloudwatch-metrics.md#glue.executorId.jvm.heap.usage)
  + [`glue.ALL.jvm.heap.usage`](monitoring-awsglue-with-cloudwatch-metrics.md#glue.ALL.jvm.heap.usage)

![\[The graph for Memory Profile in the Metrics tab of the AWS Glue console.\]](http://docs.aws.amazon.com/glue/latest/dg/images/job_detailed_mem.png)


**Example Data shuffle across executors graph**  
The Data Shuffle Across Executors graph shows the following metrics:  
+ The number of bytes read by all executors to shuffle data between them—[`glue.driver.aggregate.shuffleLocalBytesRead`](monitoring-awsglue-with-cloudwatch-metrics.md#glue.driver.aggregate.shuffleLocalBytesRead)
+ The number of bytes written by all executors to shuffle data between them—[`glue.driver.aggregate.shuffleBytesWritten`](monitoring-awsglue-with-cloudwatch-metrics.md#glue.driver.aggregate.shuffleBytesWritten)

![\[The graph for Data Shuffle Across Executors in the Metrics tab of the AWS Glue console.\]](http://docs.aws.amazon.com/glue/latest/dg/images/job_detailed_data.png)


**Example CPU load graph**  
The CPU Load graph shows the following metrics:  
+ The fraction of CPU system load used (scale: 0–1) by the driver, an executor identified by *executorId*, or all executors—
  + [`glue.driver.system.cpuSystemLoad`](monitoring-awsglue-with-cloudwatch-metrics.md#glue.driver.system.cpuSystemLoad)
  + [`glue.executorId.system.cpuSystemLoad`](monitoring-awsglue-with-cloudwatch-metrics.md#glue.executorId.system.cpuSystemLoad)
  + [`glue.ALL.system.cpuSystemLoad`](monitoring-awsglue-with-cloudwatch-metrics.md#glue.ALL.system.cpuSystemLoad)

![\[The graph for CPU Load in the Metrics tab of the AWS Glue console.\]](http://docs.aws.amazon.com/glue/latest/dg/images/job_detailed_cpu.png)


**Example Job execution graph**  
The Job Execution graph shows the following metrics:  
+ The number of actively running executors—[`glue.driver.ExecutorAllocationManager.executors.numberAllExecutors`](monitoring-awsglue-with-cloudwatch-metrics.md#glue.driver.ExecutorAllocationManager.executors.numberAllExecutors)
+ The number of completed stages—[`glue.aggregate.numCompletedStages`](monitoring-awsglue-with-cloudwatch-metrics.md#glue.driver.aggregate.numCompletedStages)
+ The number of maximum needed executors—[`glue.driver.ExecutorAllocationManager.executors.numberMaxNeededExecutors`](monitoring-awsglue-with-cloudwatch-metrics.md#glue.driver.ExecutorAllocationManager.executors.numberMaxNeededExecutors)

![\[The graph for Job Execution in the Metrics tab of the AWS Glue console.\]](http://docs.aws.amazon.com/glue/latest/dg/images/job_detailed_exec.png)


# Monitoring jobs using the Apache Spark web UI
<a name="monitor-spark-ui"></a>

You can use the Apache Spark web UI to monitor and debug AWS Glue ETL jobs running on the AWS Glue job system, and also Spark applications running on AWS Glue development endpoints. The Spark UI enables you to check the following for each job:
+ The event timeline of each Spark stage
+ A directed acyclic graph (DAG) of the job
+ Physical and logical plans for SparkSQL queries
+ The underlying Spark environmental variables for each job

For more information about using the Spark Web UI, see [Web UI](https://spark.apache.org/docs/3.3.0/web-ui.html) in the Spark documentation. For guidance on how to interpret Spark UI results to improve the performance of your job, see [Best practices for performance tuning AWS Glue for Apache Spark jobs](https://docs.aws.amazon.com/prescriptive-guidance/latest/tuning-aws-glue-for-apache-spark/introduction.html) in AWS Prescriptive Guidance.

 You can see the Spark UI in the AWS Glue console. This is available when an AWS Glue job runs on AWS Glue 3.0 or later versions with logs generated in the Standard (rather than legacy) format, which is the default for newer jobs. If you have log files greater than 0.5 GB, you can enable rolling log support for job runs on AWS Glue 4.0 or later versions to simplify log archiving, analysis, and troubleshooting.

You can enable the Spark UI by using the AWS Glue console or the AWS Command Line Interface (AWS CLI). When you enable the Spark UI, AWS Glue ETL jobs and Spark applications on AWS Glue development endpoints can back up Spark event logs to a location that you specify in Amazon Simple Storage Service (Amazon S3). You can use the backed up event logs in Amazon S3 with the Spark UI, both in real time as the job is operating and after the job is complete. While the logs remain in Amazon S3, the Spark UI in the AWS Glue console can view them. 

## Permissions
<a name="monitor-spark-ui-limitations-permissions"></a>

 In order to use the Spark UI in the AWS Glue console, you can use `UseGlueStudio` or add all the individual service APIs. All APIs are needed to use the Spark UI completely, however users can access SparkUI features by adding its service APIs in their IAM permission for fine-grained access. 

 `RequestLogParsing` is the most critical as it performs the parsing of logs. The remaining APIs are for reading the respective parsed data. For example, `GetStages` provides access to the data about all stages of a Spark job. 

 The list of Spark UI service APIs mapped to `UseGlueStudio` are below in the sample policy. The policy below provides access to use only Spark UI features. To add more permissions like Amazon S3 and IAM see [ Creating Custom IAM Policies for AWS Glue Studio. ](https://docs.aws.amazon.com/glue/latest/dg/getting-started-min-privs.html#getting-started-all-gs-privs.html) 

 The list of Spark UI service APIs mapped to `UseGlueStudio` is below in the sample policy. When using a Spark UI service API, use the following namespace: `glue:<ServiceAPI>`. 

------
#### [ JSON ]

****  

```
{
  "Version":"2012-10-17",		 	 	 
  "Statement": [
    {
      "Sid": "AllowGlueStudioSparkUI",
      "Effect": "Allow",
      "Action": [
        "glue:RequestLogParsing",
        "glue:GetLogParsingStatus",
        "glue:GetEnvironment",
        "glue:GetJobs",
        "glue:GetJob",
        "glue:GetStage",
        "glue:GetStages",
        "glue:GetStageFiles",
        "glue:BatchGetStageFiles",
        "glue:GetStageAttempt",
        "glue:GetStageAttemptTaskList",
        "glue:GetStageAttemptTaskSummary",
        "glue:GetExecutors",
        "glue:GetExecutorsThreads",
        "glue:GetStorage",
        "glue:GetStorageUnit",
        "glue:GetQueries",
        "glue:GetQuery"
      ],
      "Resource": [
        "*"
      ]
    }
  ]
}
```

------

## Limitations
<a name="monitor-spark-ui-limitations"></a>
+ Spark UI in the AWS Glue console is not available for job runs that occurred before 20 Nov 2023 because they are in the legacy log format.
+  Spark UI in the AWS Glue console supports rolling logs for AWS Glue 4.0, such as those generated by default in streaming jobs. The maximum sum of all generated rolled log event files is 2 GB. For AWS Glue jobs without rolled log support, the maximum log event file size supported for SparkUI is 0.5 GB. 
+  Serverless Spark UI is not available for Spark event logs stored in an Amazon S3 bucket that can only be accessed by your VPC. 

## Example: Apache Spark web UI
<a name="monitor-spark-ui-limitations-example"></a>

This example shows you how to use the Spark UI to understand your job performance. Screen shots show the Spark web UI as provided by a self-managed Spark history server. Spark UI in the AWS Glue console provides similar views. For more information about using the Spark Web UI, see [Web UI](https://spark.apache.org/docs/3.3.0/web-ui.html) in the Spark documentation.

The following is an example of a Spark application that reads from two data sources, performs a join transform, and writes it out to Amazon S3 in Parquet format.

```
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.sql.functions import count, when, expr, col, sum, isnull
from pyspark.sql.functions import countDistinct
from awsglue.dynamicframe import DynamicFrame
 
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
 
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
 
job = Job(glueContext)
job.init(args['JOB_NAME'])
 
df_persons = spark.read.json("s3://awsglue-datasets/examples/us-legislators/all/persons.json")
df_memberships = spark.read.json("s3://awsglue-datasets/examples/us-legislators/all/memberships.json")
 
df_joined = df_persons.join(df_memberships, df_persons.id == df_memberships.person_id, 'fullouter')
df_joined.write.parquet("s3://aws-glue-demo-sparkui/output/")
 
job.commit()
```

The following DAG visualization shows the different stages in this Spark job.

![\[Screenshot of Spark UI showing 2 completed stages for job 0.\]](http://docs.aws.amazon.com/glue/latest/dg/images/spark-ui1.png)


The following event timeline for a job shows the start, execution, and termination of different Spark executors.

![\[Screenshot of Spark UI showing the completed, failed, and active stages of different Spark executors.\]](http://docs.aws.amazon.com/glue/latest/dg/images/spark-ui2.png)


The following screen shows the details of the SparkSQL query plans:
+ Parsed logical plan
+ Analyzed logical plan
+ Optimized logical plan
+ Physical plan for execution

![\[SparkSQL query plans: parsed, analyzed, and optimized logical plan and physical plans for execution.\]](http://docs.aws.amazon.com/glue/latest/dg/images/spark-ui3.png)


**Topics**
+ [Permissions](#monitor-spark-ui-limitations-permissions)
+ [Limitations](#monitor-spark-ui-limitations)
+ [Example: Apache Spark web UI](#monitor-spark-ui-limitations-example)
+ [Enabling the Apache Spark web UI for AWS Glue jobs](monitor-spark-ui-jobs.md)
+ [Launching the Spark history server](monitor-spark-ui-history.md)

# Enabling the Apache Spark web UI for AWS Glue jobs
<a name="monitor-spark-ui-jobs"></a>

You can use the Apache Spark web UI to monitor and debug AWS Glue ETL jobs running on the AWS Glue job system. You can configure the Spark UI using the AWS Glue console or the AWS Command Line Interface (AWS CLI).

Every 30 seconds, AWS Glue backs up the Spark event logs to the Amazon S3 path that you specify.

**Topics**
+ [Configuring the Spark UI (console)](#monitor-spark-ui-jobs-console)
+ [Configuring the Spark UI (AWS CLI)](#monitor-spark-ui-jobs-cli)
+ [Configuring the Spark UI for sessions using Notebooks](#monitor-spark-ui-sessions)
+ [Enable rolling logs](#monitor-spark-ui-rolling-logs)

## Configuring the Spark UI (console)
<a name="monitor-spark-ui-jobs-console"></a>

Follow these steps to configure the Spark UI by using the AWS Management Console. When creating an AWS Glue job, Spark UI is enabled by default.

**To turn on the Spark UI when you create or edit a job**

1. Sign in to the AWS Management Console and open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/).

1. In the navigation pane, choose **Jobs**.

1. Choose **Add job**, or select an existing one.

1. In **Job details**, open the **Advanced properties**.

1. Under the **Spark UI** tab, choose **Write Spark UI logs to Amazon S3**.

1. Specify an Amazon S3 path for storing the Spark event logs for the job. Note that if you use a security configuration in the job, the encryption also applies to the Spark UI log file. For more information, see [Encrypting data written by AWS Glue](encryption-security-configuration.md).

1. Under **Spark UI logging and monitoring configuration**:
   + Select **Standard** if you are generating logs to view in the AWS Glue console.
   + Select **Legacy** if you are generating logs to view on a Spark history server.
   + You can also choose to generate both.

## Configuring the Spark UI (AWS CLI)
<a name="monitor-spark-ui-jobs-cli"></a>

To generate logs for viewing with Spark UI, in the AWS Glue console, use the AWS CLI to pass the following job parameters to AWS Glue jobs. For more information, see [Using job parameters in AWS Glue jobs](aws-glue-programming-etl-glue-arguments.md).

```
'--enable-spark-ui': 'true',
'--spark-event-logs-path': 's3://s3-event-log-path'
```

To distribute logs to their legacy locations, set the `--enable-spark-ui-legacy-path` parameter to `"true"`. If you do not want to generate logs in both formats, remove the `--enable-spark-ui` parameter.

## Configuring the Spark UI for sessions using Notebooks
<a name="monitor-spark-ui-sessions"></a>

**Warning**  
AWS Glue interactive sessions do not currently support Spark UI in the console. Configure a Spark history server.

 If you use AWS Glue notebooks, set up SparkUI config before starting the session. To do this, use the `%%configure` cell magic: 

```
%%configure { “--enable-spark-ui”: “true”, “--spark-event-logs-path”: “s3://path” }
```

## Enable rolling logs
<a name="monitor-spark-ui-rolling-logs"></a>

 Enabling SparkUI and rolling log event files for AWS Glue jobs provides several benefits: 
+  Rolling Log Event Files – With rolling log event files enabled, AWS Glue generates separate log files for each step of the job execution, making it easier to identify and troubleshoot issues specific to a particular stage or transformation. 
+  Better Log Management – Rolling log event files help in managing log files more efficiently. Instead of having a single, potentially large log file, the logs are split into smaller, more manageable files based on the job execution stages. This can simplify log archiving, analysis, and troubleshooting. 
+  Improved Fault Tolerance – If a AWS Glue job fails or is interrupted, the rolling log event files can provide valuable information about the last successful stage, making it easier to resume the job from that point rather than starting from scratch. 
+  Cost Optimization – By enabling rolling log event files, you can save on storage costs associated with log files. Instead of storing a single, potentially large log file, you store smaller, more manageable log files, which can be more cost-effective, especially for long-running or complex jobs. 

 In a new environment, users can explicitly enable rolling logs through: 

```
'—conf': 'spark.eventLog.rolling.enabled=true'
```

or

```
'—conf': 'spark.eventLog.rolling.enabled=true —conf 
spark.eventLog.rolling.maxFileSize=128m'
```

 When rolling logs are activated, `spark.eventLog.rolling.maxFileSize` specifies the maximum size of the event log file before it rolls over. The default value of this optional parameter if not specified is 128 MB. Minimum is 10 MB. 

 The maximum sum of all generated rolled log event files is 2 GB. For AWS Glue jobs without rolling log support, the maximum log event file size supported for SparkUI is 0.5 GB. 

You can turn off rolling logs for a streaming job by passing an additional configuration. Note that very large log files may be costly to maintain.

To turn off rolling logs, provide the following configuration:

```
'--spark-ui-event-logs-path': 'true',
'--conf': 'spark.eventLog.rolling.enabled=false'
```

# Launching the Spark history server
<a name="monitor-spark-ui-history"></a>

You can use a Spark history server to visualize Spark logs on your own infrastructure. You can see the same visualizations in the AWS Glue console for AWS Glue job runs on AWS Glue 4.0 or later versions with logs generated in the Standard (rather than legacy) format. For more information, see [Monitoring jobs using the Apache Spark web UI](monitor-spark-ui.md).

You can launch the Spark history server using a AWS CloudFormation template that hosts the server on an EC2 instance, or launch locally using Docker.

**Topics**
+ [Launching the Spark history server and viewing the Spark UI using AWS CloudFormation](#monitor-spark-ui-history-cfn)
+ [Launching the Spark history server and viewing the Spark UI using Docker](#monitor-spark-ui-history-local)

## Launching the Spark history server and viewing the Spark UI using AWS CloudFormation
<a name="monitor-spark-ui-history-cfn"></a>

You can use an AWS CloudFormation template to start the Apache Spark history server and view the Spark web UI. These templates are samples that you should modify to meet your requirements.

**To start the Spark history server and view the Spark UI using CloudFormation**

1. Choose one of the **Launch Stack** buttons in the following table. This launches the stack on the CloudFormation console.    
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitor-spark-ui-history.html)

1. On the **Specify template** page, choose **Next**.

1. On the **Specify stack details** page, enter the **Stack name**. Enter additional information under **Parameters**.

   1. 

**Spark UI configuration**

      Provide the following information:
      + **IP address range** — The IP address range that can be used to view the Spark UI. If you want to restrict access from a specific IP address range, you should use a custom value. 
      + **History server port** — The port for the Spark UI. You can use the default value.
      + **Event log directory** — Choose the location where Spark event logs are stored from the AWS Glue job or development endpoints. You must use **s3a://** for the event logs path scheme.
      + **Spark package location** — You can use the default value.
      + **Keystore path** — SSL/TLS keystore path for HTTPS. If you want to use a custom keystore file, you can specify the S3 path `s3://path_to_your_keystore_file` here. If you leave this parameter empty, a self-signed certificate based keystore is generated and used.
      + **Keystore password** — Enter a SSL/TLS keystore password for HTTPS.

   1. 

**EC2 instance configuration**

      Provide the following information:
      + **Instance type** — The type of Amazon EC2 instance that hosts the Spark history server. Because this template launches Amazon EC2 instance in your account, Amazon EC2 cost will be charged in your account separately.
      + **Latest AMI ID** — The AMI ID of Amazon Linux 2 for the Spark history server instance. You can use the default value.
      + **VPC ID** — The virtual private cloud (VPC) ID for the Spark history server instance. You can use any of the VPCs available in your account Using a default VPC with a [default Network ACL](https://docs.aws.amazon.com/vpc/latest/userguide/vpc-network-acls.html#default-network-acl) is not recommended. For more information, see [Default VPC and Default Subnets](https://docs.aws.amazon.com/vpc/latest/userguide/default-vpc.html) and [Creating a VPC](https://docs.aws.amazon.com/vpc/latest/userguide/working-with-vpcs.html#Create-VPC) in the *Amazon VPC User Guide*.
      + **Subnet ID** — The ID for the Spark history server instance. You can use any of the subnets in your VPC. You must be able to reach the network from your client to the subnet. If you want to access via the internet, you must use a public subnet that has the internet gateway in the route table.

   1. Choose **Next**.

1. On the **Configure stack options** page, to use the current user credentials for determining how CloudFormation can create, modify, or delete resources in the stack, choose **Next**. You can also specify a role in the ** Permissions** section to use instead of the current user permissions, and then choose **Next**.

1. On the **Review** page, review the template. 

   Select **I acknowledge that CloudFormation might create IAM resources**, and then choose **Create stack**.

1. Wait for the stack to be created.

1. Open the **Outputs** tab.

   1. Copy the URL of **SparkUiPublicUrl** if you are using a public subnet.

   1. Copy the URL of **SparkUiPrivateUrl** if you are using a private subnet.

1. Open a web browser, and paste in the URL. This lets you access the server using HTTPS on the specified port. Your browser may not recognize the server's certificate, in which case you have to override its protection and proceed anyway. 

## Launching the Spark history server and viewing the Spark UI using Docker
<a name="monitor-spark-ui-history-local"></a>

If you prefer local access (not to have an EC2 instance for the Apache Spark history server), you can also use Docker to start the Apache Spark history server and view the Spark UI locally. This Dockerfile is a sample that you should modify to meet your requirements. 

 **Prerequisites** 

For information about how to install Docker on your laptop see the [Docker Engine community](https://docs.docker.com/install/).

**To start the Spark history server and view the Spark UI locally using Docker**

1. Download files from GitHub.

   Download the Dockerfile and `pom.xml` from [ AWS Glue code samples](https://github.com/aws-samples/aws-glue-samples/tree/master/utilities/Spark_UI/).

1. Determine if you want to use your user credentials or federated user credentials to access AWS.
   + To use the current user credentials for accessing AWS, get the values to use for ` AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY` in the `docker run` command. For more information, see [Managing access keys for IAM users](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_access-keys.html) in the *IAM User Guide*.
   + To use SAML 2.0 federated users for accessing AWS, get the values for ` AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, and ` AWS_SESSION_TOKEN`. For more information, see [Requesting temporary security credentials](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_temp_request.html)

1. Determine the location of your event log directory, to use in the `docker run` command.

1. Build the Docker image using the files in the local directory, using the name ` glue/sparkui`, and the tag `latest`.

   ```
   $ docker build -t glue/sparkui:latest . 
   ```

1. Create and start the docker container.

   In the following commands, use the values obtained previously in steps 2 and 3.

   1. To create the docker container using your user credentials, use a command similar to the following

      ```
      docker run -itd -e SPARK_HISTORY_OPTS="$SPARK_HISTORY_OPTS -Dspark.history.fs.logDirectory=s3a://path_to_eventlog
       -Dspark.hadoop.fs.s3a.access.key=AWS_ACCESS_KEY_ID -Dspark.hadoop.fs.s3a.secret.key=AWS_SECRET_ACCESS_KEY"
       -p 18080:18080 glue/sparkui:latest "/opt/spark/bin/spark-class org.apache.spark.deploy.history.HistoryServer"
      ```

   1. To create the docker container using temporary credentials, use ` org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider` as the provider, and provide the credential values obtained in step 2. For more information, see [Using Session Credentials with TemporaryAWSCredentialsProvider](https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#Using_Session_Credentials_with_TemporaryAWSCredentialsProvider) in the *Hadoop: Integration with Amazon Web Services* documentation.

      ```
      docker run -itd -e SPARK_HISTORY_OPTS="$SPARK_HISTORY_OPTS -Dspark.history.fs.logDirectory=s3a://path_to_eventlog
       -Dspark.hadoop.fs.s3a.access.key=AWS_ACCESS_KEY_ID -Dspark.hadoop.fs.s3a.secret.key=AWS_SECRET_ACCESS_KEY
       -Dspark.hadoop.fs.s3a.session.token=AWS_SESSION_TOKEN
       -Dspark.hadoop.fs.s3a.aws.credentials.provider=org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider"
       -p 18080:18080 glue/sparkui:latest "/opt/spark/bin/spark-class org.apache.spark.deploy.history.HistoryServer"
      ```
**Note**  
These configuration parameters come from the [ Hadoop-AWS Module](https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html). You may need to add specific configuration based on your use case. For example: users in isolated regions will need to configure the ` spark.hadoop.fs.s3a.endpoint`.

1. Open `http://localhost:18080` in your browser to view the Spark UI locally.

# Monitoring with AWS Glue job run insights
<a name="monitor-job-insights"></a>

AWS Glue job run insights is a feature in AWS Glue that simplifies job debugging and optimization for your AWS Glue jobs. AWS Glue provides [Spark UI](https://docs.aws.amazon.com/glue/latest/dg/monitor-spark-ui.html), and [CloudWatch logs and metrics](https://docs.aws.amazon.com/glue/latest/dg/monitor-cloudwatch.html) for monitoring your AWS Glue jobs. With this feature, you get this information about your AWS Glue job's execution:
+ Line number of your AWS Glue job script that had a failure.
+ Spark action that executed last in the Spark query plan just before the failure of your job.
+ Spark exception events related to the failure presented in a time-ordered log stream.
+ Root cause analysis and recommended action (such as tuning your script) to fix the issue.
+ Common Spark events (log messages relating to a Spark action) with a recommended action that addresses the root cause.

All these insights are available to you using two new log streams in the CloudWatch logs for your AWS Glue jobs.

## Requirements
<a name="monitor-job-insights-requirements"></a>

The AWS Glue job run insights feature is available for AWS Glue versions 2.0 and above. You can follow the [migration guide](https://docs.aws.amazon.com/glue/latest/dg/migrating-version-30.html) for your existing jobs to upgrade them from older AWS Glue versions.

## Enabling job run insights for an AWS Glue ETL job
<a name="monitor-job-insights-enable"></a>

You can enable job run insights through AWS Glue Studio or the CLI.

### AWS Glue Studio
<a name="monitor-job-insights-requirements"></a>

When creating a job via AWS Glue Studio, you can enable or disable job run insights under the **Job Details** tab. Check that the **Generate job insights** box is selected.

![\[Enabling job run insights in AWS Glue Studio.\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-job-run-insights-1.png)


### Command line
<a name="monitor-job-insights-enable-cli"></a>

If creating a job via the CLI, you can start a job run with a single new [job parameter](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-glue-arguments.html): `--enable-job-insights = true`.

By default, the job run insights log streams are created under the same default log group used by [AWS Glue continuous logging](https://docs.aws.amazon.com/glue/latest/dg/monitor-continuous-logging.html), that is, `/aws-glue/jobs/logs-v2/`. You may set up custom log group name, log filters and log group configurations using the same set of arguments for continuous logging. For more information, see [Enabling Continuous Logging for AWS Glue Jobs](https://docs.aws.amazon.com/glue/latest/dg/monitor-continuous-logging-enable.html).

## Accessing the job run insights log streams in CloudWatch
<a name="monitor-job-insights-access"></a>

With the job run insights feature enabled, there may be two log streams created when a job run fails. When a job finishes successfully, neither of the streams are generated.

1. *Exception analysis log stream*: `<job-run-id>-job-insights-rca-driver`. This stream provides the following:
   + Line number of your AWS Glue job script that caused the failure.
   + Spark action that executed last in the Spark query plan (DAG).
   + Concise time-ordered events from the Spark driver and executors that are related to the exception. You can find details such as complete error messages, the failed Spark task and its executors ID that help you to focus on the specific executor's log stream for a deeper investigation if needed.

1. *Rule-based insights stream*: 
   + Root cause analysis and recommendations on how to fix the errors (such as using a specific job parameter to optimize the performance).
   + Relevant Spark events serving as the basis for root cause analysis and a recommended action.

**Note**  
The first stream will exist only if any exception Spark events are available for a failed job run, and the second stream will exist only if any insights are available for the failed job run. For example, if your job finishes successfully, neither of the streams will be generated; if your job fails but there isn't a service-defined rule that can match with your failure scenario, then only the first stream will be generated.

If the job is created from AWS Glue Studio, the links to the above streams are also available under the job run details tab (Job run insights) as "Concise and consolidated error logs" and "Error analysis and guidance".

![\[The Job Run Details page containing links to the log streams.\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-job-run-insights-2.png)


## Example for AWS Glue job run insights
<a name="monitor-job-insights-example"></a>

In this section we present an example of how the job run insights feature can help you resolve an issue in your failed job. In this example, a user forgot to import the required module (tensorflow) in an AWS Glue job to analyze and build a machine learning model on their data.

```
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.sql.types import *
from pyspark.sql.functions import udf,col

args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

data_set_1 = [1, 2, 3, 4]
data_set_2 = [5, 6, 7, 8]

scoresDf = spark.createDataFrame(data_set_1, IntegerType())

def data_multiplier_func(factor, data_vector):
    import tensorflow as tf
    with tf.compat.v1.Session() as sess:
        x1 = tf.constant(factor)
        x2 = tf.constant(data_vector)
        result = tf.multiply(x1, x2)
        return sess.run(result).tolist()

data_multiplier_udf = udf(lambda x:data_multiplier_func(x, data_set_2), ArrayType(IntegerType(),False))
factoredDf = scoresDf.withColumn("final_value", data_multiplier_udf(col("value")))
print(factoredDf.collect())
```

Without the job run insights feature, as the job fails, you only see this message thrown by Spark:

`An error occurred while calling o111.collectToPython. Traceback (most recent call last):`

The message is ambiguous and limits your debugging experience. In this case, this feature provides with you additional insights in two CloudWatch log streams:

1. The `job-insights-rca-driver` log stream:
   + *Exception events*: This log stream provides you the Spark exception events related to the failure collected from the Spark driver and different distributed workers. These events help you understand the time-ordered propagation of the exception as faulty code executes across Spark tasks, executors, and stages distributed across the AWS Glue workers.
   + *Line numbers*: This log stream identifies line 21, which made the call to import the missing Python module that caused the failure; it also identifies line 24, the call to Spark Action `collect()`, as the last executed line in your script.  
![\[The job-insights-rca-driver log stream.\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-job-run-insights-3.png)

1. The `job-insights-rule-driver` log stream:
   + *Root cause and recommendation*: In addition to the line number and last executed line number for the fault in your script, this log stream shows the root cause analysis and recommendation for you to follow the AWS Glue doc and set up the necessary job parameters in order to use an additional Python module in your AWS Glue job. 
   + *Basis event*: This log stream also shows the Spark exception event that was evaluated with the service-defined rule to infer the root cause and provide a recommendation.  
![\[The job-insights-rule-driver log stream.\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-job-run-insights-4.png)

# Monitoring with Amazon CloudWatch
<a name="monitor-cloudwatch"></a>

You can monitor AWS Glue using Amazon CloudWatch, which collects and processes raw data from AWS Glue into readable, near-real-time metrics. These statistics are recorded for a period of two weeks so that you can access historical information for a better perspective on how your web application or service is performing. By default, AWS Glue metrics data is sent to CloudWatch automatically. For more information, see [What Is Amazon CloudWatch?](https://docs.aws.amazon.com/AmazonCloudWatch/latest/DeveloperGuide/WhatIsCloudWatch.html) in the *Amazon CloudWatch User Guide*, and [AWS Glue metrics](monitoring-awsglue-with-cloudwatch-metrics.md#awsglue-metrics).

 **Continous logging** 

AWS Glue also supports real-time continuous logging for AWS Glue jobs. When continuous logging is enabled for a job, you can view the real-time logs on the AWS Glue console or the CloudWatch console dashboard. For more information, see [Logging for AWS Glue jobs](monitor-continuous-logging.md).

 **Observability metrics** 

 When **Job observability metrics** is enabled, additional Amazon CloudWatch metrics are generated when the job is run. Use AWS Glue Observability metrics to generate insights into what is happening inside your AWS Glue to improve triaging and analysis of issues. 

**Topics**
+ [Monitoring AWS Glue using Amazon CloudWatch metrics](monitoring-awsglue-with-cloudwatch-metrics.md)
+ [Setting up Amazon CloudWatch alarms on AWS Glue job profiles](monitor-profile-glue-job-cloudwatch-alarms.md)
+ [Logging for AWS Glue jobs](monitor-continuous-logging.md)
+ [Monitoring with AWS Glue Observability metrics](monitor-observability.md)

# Monitoring AWS Glue using Amazon CloudWatch metrics
<a name="monitoring-awsglue-with-cloudwatch-metrics"></a>

You can profile and monitor AWS Glue operations using AWS Glue job profiler. It collects and processes raw data from AWS Glue jobs into readable, near real-time metrics stored in Amazon CloudWatch. These statistics are retained and aggregated in CloudWatch so that you can access historical information for a better perspective on how your application is performing.

**Note**  
 You may incur additional charges when you enable job metrics and CloudWatch custom metrics are created. For more information, see [ Amazon CloudWatch pricing ](https://aws.amazon.com/cloudwatch/pricing/). 

## AWS Glue metrics overview
<a name="metrics-overview"></a>

When you interact with AWS Glue, it sends metrics to CloudWatch. You can view these metrics using the AWS Glue console (the preferred method), the CloudWatch console dashboard, or the AWS Command Line Interface (AWS CLI). 

**To view metrics using the AWS Glue console dashboard**

You can view summary or detailed graphs of metrics for a job, or detailed graphs for a job run. 

1. Sign in to the AWS Management Console and open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/).

1. In the navigation pane, choose **Job run monitoring**.

1. In **Job runs** choose **Actions** to stop a job that is currently running, view a job, or rewind job bookmark.

1. Select a job, then choose **View run details** to view additional information about the job run.

**To view metrics using the CloudWatch console dashboard**

Metrics are grouped first by the service namespace, and then by the various dimension combinations within each namespace.

1. Open the CloudWatch console at [https://console.aws.amazon.com/cloudwatch/](https://console.aws.amazon.com/cloudwatch/).

1. In the navigation pane, choose **Metrics**.

1. Choose the **Glue** namespace.

**To view metrics using the AWS CLI**
+ At a command prompt, use the following command.

  ```
  1. aws cloudwatch list-metrics --namespace Glue
  ```

AWS Glue reports metrics to CloudWatch every 30 seconds, and the CloudWatch metrics dashboards are configured to display them every minute. The AWS Glue metrics represent delta values from the previously reported values. Where appropriate, metrics dashboards aggregate (sum) the 30-second values to obtain a value for the entire last minute.

### AWS Glue metrics behavior for Spark jobs
<a name="metrics-overview-spark"></a>

 AWS Glue metrics are enabled at initialization of a `GlueContext` in a script and are generally updated only at the end of an Apache Spark task. They represent the aggregate values across all completed Spark tasks so far.

However, the Spark metrics that AWS Glue passes on to CloudWatch are generally absolute values representing the current state at the time they are reported. AWS Glue reports them to CloudWatch every 30 seconds, and the metrics dashboards generally show the average across the data points received in the last 1 minute.

AWS Glue metrics names are all preceded by one of the following types of prefix:
+ `glue.driver.` – Metrics whose names begin with this prefix either represent AWS Glue metrics that are aggregated from all executors at the Spark driver, or Spark metrics corresponding to the Spark driver.
+ `glue.`*executorId*`.` – The *executorId* is the number of a specific Spark executor. It corresponds with the executors listed in the logs.
+ `glue.ALL.` – Metrics whose names begin with this prefix aggregate values from all Spark executors.

## AWS Glue metrics
<a name="awsglue-metrics"></a>

AWS Glue profiles and sends the following metrics to CloudWatch every 30 seconds, and the AWS Glue Metrics Dashboard report them once a minute:


| Metric | Description | 
| --- | --- | 
|  `glue.driver.aggregate.bytesRead` |  The number of bytes read from all data sources by all completed Spark tasks running in all executors. Valid dimensions: `JobName` (the name of the AWS Glue Job), `JobRunId` (the JobRun ID. or `ALL`), and `Type` (count). Valid Statistics: SUM. This metric is a delta value from the last reported value, so on the AWS Glue Metrics Dashboard, a SUM statistic is used for aggregation.  Unit: Bytes Can be used to monitor: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html) This metric can be used the same way as the `glue.ALL.s3.filesystem.read_bytes` metric, with the difference that this metric is updated at the end of a Spark task and captures non-S3 data sources as well.  | 
|  `glue.driver.aggregate.elapsedTime` |  The ETL elapsed time in milliseconds (does not include the job bootstrap times). Valid dimensions: `JobName` (the name of the AWS Glue Job), `JobRunId` (the JobRun ID. or `ALL`), and `Type` (count). Valid Statistics: SUM. This metric is a delta value from the last reported value, so on the AWS Glue Metrics Dashboard, a SUM statistic is used for aggregation. Unit: Milliseconds Can be used to determine how long it takes a job run to run on average. Some ways to use the data: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html)  | 
|   `glue.driver.aggregate.numCompletedStages` |  The number of completed stages in the job. Valid dimensions: `JobName` (the name of the AWS Glue Job), `JobRunId` (the JobRun ID. or `ALL`), and `Type` (count). Valid Statistics: SUM. This metric is a delta value from the last reported value, so on the AWS Glue Metrics Dashboard, a SUM statistic is used for aggregation. Unit: Count Can be used to monitor: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html) Some ways to use the data: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html)  | 
|  `glue.driver.aggregate.numCompletedTasks` |  The number of completed tasks in the job. Valid dimensions: `JobName` (the name of the AWS Glue Job), `JobRunId` (the JobRun ID. or `ALL`), and `Type` (count). Valid Statistics: SUM. This metric is a delta value from the last reported value, so on the AWS Glue Metrics Dashboard, a SUM statistic is used for aggregation. Unit: Count Can be used to monitor: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html)  | 
|  `glue.driver.aggregate.numFailedTasks` |  The number of failed tasks. Valid dimensions: `JobName` (the name of the AWS Glue Job), `JobRunId` (the JobRun ID. or `ALL`), and `Type` (count). Valid Statistics: SUM. This metric is a delta value from the last reported value, so on the AWS Glue Metrics Dashboard, a SUM statistic is used for aggregation. Unit: Count Can be used to monitor: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html) The data can be used to set alarms for increased failures that might suggest abnormalities in data, cluster or scripts.  | 
|  `glue.driver.aggregate.numKilledTasks` |  The number of tasks killed. Valid dimensions: `JobName` (the name of the AWS Glue Job), `JobRunId` (the JobRun ID. or `ALL`), and `Type` (count). Valid Statistics: SUM. This metric is a delta value from the last reported value, so on the AWS Glue Metrics Dashboard, a SUM statistic is used for aggregation. Unit: Count Can be used to monitor: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html) Some ways to use the data: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html)  | 
|  `glue.driver.aggregate.recordsRead` |  The number of records read from all data sources by all completed Spark tasks running in all executors. Valid dimensions: `JobName` (the name of the AWS Glue Job), `JobRunId` (the JobRun ID. or `ALL`), and `Type` (count). Valid Statistics: SUM. This metric is a delta value from the last reported value, so on the AWS Glue Metrics Dashboard, a SUM statistic is used for aggregation. Unit: Count Can be used to monitor: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html) This metric can be used in a similar way to the `glue.ALL.s3.filesystem.read_bytes` metric, with the difference that this metric is updated at the end of a Spark task.  | 
|   `glue.driver.aggregate.shuffleBytesWritten` |  The number of bytes written by all executors to shuffle data between them since the previous report (aggregated by the AWS Glue Metrics Dashboard as the number of bytes written for this purpose during the previous minute). Valid dimensions: `JobName` (the name of the AWS Glue Job), `JobRunId` (the JobRun ID. or `ALL`), and `Type` (count). Valid Statistics: SUM. This metric is a delta value from the last reported value, so on the AWS Glue Metrics Dashboard, a SUM statistic is used for aggregation. Unit: Bytes Can be used to monitor: Data shuffle in jobs (large joins, groupBy, repartition, coalesce). Some ways to use the data: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html)  | 
|   `glue.driver.aggregate.shuffleLocalBytesRead` |  The number of bytes read by all executors to shuffle data between them since the previous report (aggregated by the AWS Glue Metrics Dashboard as the number of bytes read for this purpose during the previous minute). Valid dimensions: `JobName` (the name of the AWS Glue Job), `JobRunId` (the JobRun ID. or `ALL`), and `Type` (count). Valid Statistics: SUM. This metric is a delta value from the last reported value, so on the AWS Glue Metrics Dashboard, a SUM statistic is used for aggregation. Unit: Bytes Can be used to monitor: Data shuffle in jobs (large joins, groupBy, repartition, coalesce). Some ways to use the data: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html)  | 
|  `glue.driver.BlockManager.disk.diskSpaceUsed_MB` |  The number of megabytes of disk space used across all executors. Valid dimensions: `JobName` (the name of the AWS Glue Job), `JobRunId` (the JobRun ID. or `ALL`), and `Type` (gauge). Valid Statistics: Average. This is a Spark metric, reported as an absolute value. Unit: Megabytes Can be used to monitor: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html) Some ways to use the data: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html)  | 
|   `glue.driver.ExecutorAllocationManager.executors.numberAllExecutors` |  The number of actively running job executors. Valid dimensions: `JobName` (the name of the AWS Glue Job), `JobRunId` (the JobRun ID. or `ALL`), and `Type` (gauge). Valid Statistics: Average. This is a Spark metric, reported as an absolute value. Unit: Count Can be used to monitor: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html) Some ways to use the data: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html)  | 
|   `glue.driver.ExecutorAllocationManager.executors.numberMaxNeededExecutors` |  The number of maximum (actively running and pending) job executors needed to satisfy the current load. Valid dimensions: `JobName` (the name of the AWS Glue Job), `JobRunId` (the JobRun ID. or `ALL`), and `Type` (gauge). Valid Statistics: Maximum. This is a Spark metric, reported as an absolute value. Unit: Count Can be used to monitor: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html) Some ways to use the data: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html)  | 
|   `glue.driver.jvm.heap.usage`  `glue.`*executorId*`.jvm.heap.usage`  `glue.ALL.jvm.heap.usage`  |  The fraction of memory used by the JVM heap for this driver (scale: 0-1) for driver, executor identified by executorId, or ALL executors. Valid dimensions: `JobName` (the name of the AWS Glue Job), `JobRunId` (the JobRun ID. or `ALL`), and `Type` (gauge). Valid Statistics: Average. This is a Spark metric, reported as an absolute value. Unit: Percentage Can be used to monitor: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html) Some ways to use the data: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html)  | 
|  `glue.driver.jvm.heap.used`  `glue.`*executorId*`.jvm.heap.used`  `glue.ALL.jvm.heap.used`  |  The number of memory bytes used by the JVM heap for the driver, the executor identified by *executorId*, or ALL executors. Valid dimensions: `JobName` (the name of the AWS Glue Job), `JobRunId` (the JobRun ID. or `ALL`), and `Type` (gauge). Valid Statistics: Average. This is a Spark metric, reported as an absolute value. Unit: Bytes Can be used to monitor: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html) Some ways to use the data: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html)  | 
|   `glue.driver.s3.filesystem.read_bytes`  `glue.`*executorId*`.s3.filesystem.read_bytes`  `glue.ALL.s3.filesystem.read_bytes`  |  The number of bytes read from Amazon S3 by the driver, an executor identified by *executorId*, or ALL executors since the previous report (aggregated by the AWS Glue Metrics Dashboard as the number of bytes read during the previous minute). Valid dimensions: `JobName`, `JobRunId`, and `Type` (gauge). Valid Statistics: SUM. This metric is a delta value from the last reported value, so on the AWS Glue Metrics Dashboard a SUM statistic is used for aggregation. The area under the curve on the AWS Glue Metrics Dashboard can be used to visually compare bytes read by two different job runs. Unit: Bytes. Can be used to monitor: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html) Resulting data can be used for: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html)  | 
|   `glue.driver.s3.filesystem.write_bytes`  `glue.`*executorId*`.s3.filesystem.write_bytes`  `glue.ALL.s3.filesystem.write_bytes`  |  The number of bytes written to Amazon S3 by the driver, an executor identified by *executorId*, or ALL executors since the previous report (aggregated by the AWS Glue Metrics Dashboard as the number of bytes written during the previous minute). Valid dimensions: `JobName`, `JobRunId`, and `Type` (gauge). Valid Statistics: SUM. This metric is a delta value from the last reported value, so on the AWS Glue Metrics Dashboard a SUM statistic is used for aggregation. The area under the curve on the AWS Glue Metrics Dashboard can be used to visually compare bytes written by two different job runs. Unit: Bytes Can be used to monitor: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html) Some ways to use the data: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html)  | 
|  `glue.driver.streaming.numRecords` |  The number of records that are received in a micro-batch. This metric is only available for AWS Glue streaming jobs with AWS Glue version 2.0 and above. Valid dimensions: `JobName` (the name of the AWS Glue job), `JobRunId` (the JobRun ID. or `ALL`), and `Type` (count). Valid Statistics: Sum, Maximum, Minimum, Average, Percentile Unit: Count Can be used to monitor: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html)  | 
|  `glue.driver.streaming.batchProcessingTimeInMs` |  The time it takes to process the batches in milliseconds. This metric is only available for AWS Glue streaming jobs with AWS Glue version 2.0 and above. Valid dimensions: `JobName` (the name of the AWS Glue job), `JobRunId` (the JobRun ID. or `ALL`), and `Type` (count). Valid Statistics: Sum, Maximum, Minimum, Average, Percentile Unit: Count Can be used to monitor: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html)  | 
|   `glue.driver.system.cpuSystemLoad`  `glue.`*executorId*`.system.cpuSystemLoad`  `glue.ALL.system.cpuSystemLoad`  |  The fraction of CPU system load used (scale: 0-1) by the driver, an executor identified by *executorId*, or ALL executors. Valid dimensions: `JobName` (the name of the AWS Glue job), `JobRunId` (the JobRun ID. or `ALL`), and `Type` (gauge). Valid Statistics: Average. This metric is reported as an absolute value. Unit: Percentage Can be used to monitor: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html) Some ways to use the data: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html)  | 

## Dimensions for AWS Glue Metrics
<a name="awsglue-metricdimensions"></a>

AWS Glue metrics use the AWS Glue namespace and provide metrics for the following dimensions:


| Dimension | Description | 
| --- | --- | 
|  `JobName`  |  This dimension filters for metrics of all job runs of a specific AWS Glue job.  | 
|  `JobRunId`  |  This dimension filters for metrics of a specific AWS Glue job run by a JobRun ID, or `ALL`.  | 
|  `Type`  |  This dimension filters for metrics by either `count` (an aggregate number) or `gauge` (a value at a point in time).  | 

For more information, see the [Amazon CloudWatch User Guide](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/).

# Setting up Amazon CloudWatch alarms on AWS Glue job profiles
<a name="monitor-profile-glue-job-cloudwatch-alarms"></a>

AWS Glue metrics are also available in Amazon CloudWatch. You can set up alarms on any AWS Glue metric for scheduled jobs. 

A few common scenarios for setting up alarms are as follows:
+ Jobs running out of memory (OOM): Set an alarm when the memory usage exceeds the normal average for either the driver or an executor for an AWS Glue job.
+ Straggling executors: Set an alarm when the number of executors falls below a certain threshold for a large duration of time in an AWS Glue job.
+ Data backlog or reprocessing: Compare the metrics from individual jobs in a workflow using a CloudWatch math expression. You can then trigger an alarm on the resulting expression value (such as the ratio of bytes written by a job and bytes read by a following job).

For detailed instructions on setting alarms, see [Create or Edit a CloudWatch Alarm](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/ConsoleAlarms.html) in the *[Amazon CloudWatch Events User Guide](https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/)*. 

For monitoring and debugging scenarios using CloudWatch, see [Job monitoring and debugging](monitor-profile-glue-job-cloudwatch-metrics.md).

# Logging for AWS Glue jobs
<a name="monitor-continuous-logging"></a>

 In AWS Glue 5.0, all jobs have real-time logging capabilities. Additionally, you can specify custom configuration options to tailor the logging behavior. These options include setting the Amazon CloudWatch log group name, the Amazon CloudWatch log stream prefix (which will precede the AWS Glue job run ID and driver/executor ID), and the log conversion pattern for log messages. These configurations allow you to aggregate logs in custom Amazon CloudWatch log groups with different expiration policies. Furthermore, you can analyze the logs more effectively by using custom log stream prefixes and conversion patterns. This level of customization enables you to optimize log management and analysis according to your specific requirements. 

## Logging behavior in AWS Glue 5.0
<a name="monitor-logging-behavior-glue-50"></a>

 By default, system logs, Spark daemon logs, and user AWS Glue Logger logs are written to the `/aws-glue/jobs/error` log group in Amazon CloudWatch. On the other hand, user stdout (standard output) and stderr (standard error) logs are written to the `/aws-glue/jobs/output` log group by default. 

## Custom logging
<a name="monitor-logging-custom"></a>

 You can customize the default log group and log stream prefixes using the following job arguments: 
+  `--custom-logGroup-prefix`: Allows you to specify a custom prefix for the `/aws-glue/jobs/error` and `/aws-glue/jobs/output` log groups. If you provide a custom prefix, the log group names will be in the following format: 
  +  `/aws-glue/jobs/error` will be `<customer prefix>/error` 
  +  `/aws-glue/jobs/output ` will be `<customer prefix>/output` 
+  `--custom-logStream-prefix`: Allows you to specify a custom prefix for the log stream names within the log groups. If you provide a custom prefix, the log stream names will be in the following format: 
  +  `jobrunid-driver` will be `<customer log stream>-driver` 
  +  `jobrunid-executorNum` will be `<customer log stream>-executorNum` 

 Validation rules and limitations for custom prefixes: 
+  The entire log stream name must be between 1 and 512 characters long. 
+  The custom prefix itself is restricted to 400 characters. 
+  The custom prefix must match the regular expression pattern `[^:\$1]\$1` (special characters allowed are '\$1', '-', and '/'). 

## Logging application-specific messages using the custom script logger
<a name="monitor-logging-script"></a>

You can use the AWS Glue logger to log any application-specific messages in the script that are sent in real time to the driver log stream.

The following example shows a Python script.

```
from awsglue.context import GlueContext
from pyspark.context import SparkContext

sc = SparkContext()
glueContext = GlueContext(sc)
logger = glueContext.get_logger()
logger.info("info message")
logger.warn("warn message")
logger.error("error message")
```

The following example shows a Scala script.

```
import com.amazonaws.services.glue.log.GlueLogger

object GlueApp {
  def main(sysArgs: Array[String]) {
    val logger = new GlueLogger
    logger.info("info message")
    logger.warn("warn message")
    logger.error("error message")
  }
}
```

## Enabling the progress bar to show job progress
<a name="monitor-logging-progress"></a>

AWS Glue provides a real-time progress bar under the `JOB_RUN_ID-progress-bar` log stream to check AWS Glue job run status. Currently it supports only jobs that initialize `glueContext`. If you run a pure Spark job without initializing `glueContext`, the AWS Glue progress bar does not appear.

The progress bar shows the following progress update every 5 seconds.

```
Stage Number (Stage Name): > (numCompletedTasks + numActiveTasks) / totalNumOfTasksInThisStage]
```

## Security configuration with Amazon CloudWatch logging
<a name="monitor-security-config-logging"></a>

 When a security configuration is enabled for Amazon CloudWatch logs, AWS Glue creates log groups with specific naming patterns that incorporate the security configuration name. 

### Log group naming with security configuration
<a name="monitor-log-group-naming"></a>

 The default and custom log groups will be as follows: 
+  **Default error log group:** `/aws-glue/jobs/Security-Configuration-Name-role/glue-job-role/error` 
+  **Default output log group:** `/aws-glue/jobs/Security-Configuration-Name-role/glue-job-role/output` 
+  **Custom error log group (AWS Glue 5.0):** `custom-log-group-prefix/Security-Configuration-Name-role/glue-job-role/error` 
+  **Custom output log group (AWS Glue 5.0):** `custom-log-group-prefix/Security-Configuration-Name-role/glue-job-role/output` 

### Required IAM Permissions
<a name="monitor-logging-iam-permissions"></a>

 You need to add the `logs:AssociateKmsKey` permission to your IAM role permissions, if you enable a security configuration with Amazon CloudWatch Logs. If that permission is not included, continuous logging will be disabled. 

 Also, to configure the encryption for the Amazon CloudWatch Logs, follow the instructions at [ Encrypt Log Data in Amazon CloudWatch Logs Using AWS Key Management Service](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/encrypt-log-data-kms.html) in the Amazon Amazon CloudWatch Logs User Guide. 

### Additional Information
<a name="additional-info"></a>

 For more information on creating security configurations, see [Managing security configurations on the AWS Glue console](https://docs.aws.amazon.com/glue/latest/dg/console-security-configurations.html). 

**Topics**
+ [Logging behavior in AWS Glue 5.0](#monitor-logging-behavior-glue-50)
+ [Custom logging](#monitor-logging-custom)
+ [Logging application-specific messages using the custom script logger](#monitor-logging-script)
+ [Enabling the progress bar to show job progress](#monitor-logging-progress)
+ [Security configuration with Amazon CloudWatch logging](#monitor-security-config-logging)
+ [Enabling continuous logging for AWS Glue 4.0 and earlier jobs](monitor-continuous-logging-enable.md)
+ [Viewing logs for AWS Glue jobs](monitor-continuous-logging-view.md)

# Enabling continuous logging for AWS Glue 4.0 and earlier jobs
<a name="monitor-continuous-logging-enable"></a>

**Note**  
 In AWS Glue 4.0 and earlier versions, continuous logging was an available feature. However, with the introduction of AWS Glue 5.0, all jobs have real-time logging capability. For more details on the logging capabilities and configuration options in AWS Glue 5.0, see [Logging for AWS Glue jobs](https://docs.aws.amazon.com/glue/latest/dg/monitor-continuous-logging.html). 

You can enable continuous logging using the AWS Glue console or through the AWS Command Line Interface (AWS CLI). 

You can enable continuous logging when you create a new job, edit an existing job, or enable it through the AWS CLI.

You can also specify custom configuration options such as the Amazon CloudWatch log group name, CloudWatch log stream prefix before the AWS Glue job run ID driver/executor ID, and log conversion pattern for log messages. These configurations help you to set aggregate logs in custom CloudWatch log groups with different expiration policies, and analyze them further with custom log stream prefixes and conversions patterns. 

**Topics**
+ [Using the AWS Management Console](#monitor-continuous-logging-enable-console)
+ [Logging application-specific messages using the custom script logger](#monitor-continuous-logging-script)
+ [Enabling the progress bar to show job progress](#monitor-continuous-logging-progress)
+ [Security configuration with continuous logging](#monitor-logging-encrypt-log-data)

## Using the AWS Management Console
<a name="monitor-continuous-logging-enable-console"></a>

Follow these steps to use the console to enable continuous logging when creating or editing an AWS Glue job.

**To create a new AWS Glue job with continuous logging**

1. Sign in to the AWS Management Console and open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/).

1. In the navigation pane, choose **ETL jobs**.

1. Choose **Visual ETL**.

1. In the **Job details** tab, expand the **Advanced properties** section.

1. Under **Continuous logging** select **Enable logs in CloudWatch**.

**To enable continuous logging for an existing AWS Glue job**

1. Open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/).

1. In the navigation pane, choose **Jobs**.

1. Choose an existing job from the **Jobs** list.

1. Choose **Action**, **Edit job**.

1. In the **Job details** tab, expand the **Advanced properties** section.

1. Under **Continuous logging** select **Enable logs in CloudWatch**.

### Using the AWS CLI
<a name="monitor-continuous-logging-cli"></a>

To enable continuous logging, you pass in job parameters to an AWS Glue job. Pass the following special job parameters similar to other AWS Glue job parameters. For more information, see [Using job parameters in AWS Glue jobs](aws-glue-programming-etl-glue-arguments.md).

```
'--enable-continuous-cloudwatch-log': 'true'
```

You can specify a custom Amazon CloudWatch log group name. If not specified, the default log group name is `/aws-glue/jobs/logs-v2`.

```
'--continuous-log-logGroup': 'custom_log_group_name'
```

You can specify a custom Amazon CloudWatch log stream prefix. If not specified, the default log stream prefix is the job run ID.

```
'--continuous-log-logStreamPrefix': 'custom_log_stream_prefix'
```

You can specify a custom continuous logging conversion pattern. If not specified, the default conversion pattern is `%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n`. Note that the conversion pattern only applies to driver logs and executor logs. It does not affect the AWS Glue progress bar.

```
'--continuous-log-conversionPattern': 'custom_log_conversion_pattern'
```

## Logging application-specific messages using the custom script logger
<a name="monitor-continuous-logging-script"></a>

You can use the AWS Glue logger to log any application-specific messages in the script that are sent in real time to the driver log stream.

The following example shows a Python script.

```
from awsglue.context import GlueContext
from pyspark.context import SparkContext

sc = SparkContext()
glueContext = GlueContext(sc)
logger = glueContext.get_logger()
logger.info("info message")
logger.warn("warn message")
logger.error("error message")
```

The following example shows a Scala script.

```
import com.amazonaws.services.glue.log.GlueLogger

object GlueApp {
  def main(sysArgs: Array[String]) {
    val logger = new GlueLogger
    logger.info("info message")
    logger.warn("warn message")
    logger.error("error message")
  }
}
```

## Enabling the progress bar to show job progress
<a name="monitor-continuous-logging-progress"></a>

AWS Glue provides a real-time progress bar under the `JOB_RUN_ID-progress-bar` log stream to check AWS Glue job run status. Currently it supports only jobs that initialize `glueContext`. If you run a pure Spark job without initializing `glueContext`, the AWS Glue progress bar does not appear.

The progress bar shows the following progress update every 5 seconds.

```
Stage Number (Stage Name): > (numCompletedTasks + numActiveTasks) / totalNumOfTasksInThisStage]
```

## Security configuration with continuous logging
<a name="monitor-logging-encrypt-log-data"></a>

If a security configuration is enabled for CloudWatch logs, AWS Glue will create a log group named as follows for continuous logs:

```
<Log-Group-Name>-<Security-Configuration-Name>
```

The default and custom log groups will be as follows:
+ The default continuous log group will be `/aws-glue/jobs/error-<Security-Configuration-Name>`
+ The custom continuous log group will be `<custom-log-group-name>-<Security-Configuration-Name>`

You need to add the `logs:AssociateKmsKey` to your IAM role permissions, if you enable a security configuration with CloudWatch Logs. If that permission is not included, continuous logging will be disabled. Also, to configure the encryption for the CloudWatch Logs, follow the instructions at [Encrypt Log Data in CloudWatch Logs Using AWS Key Management Service](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/encrypt-log-data-kms.html) in the *Amazon CloudWatch Logs User Guide*.

For more information on creating security configurations, see [Managing security configurations on the AWS Glue console](console-security-configurations.md).

**Note**  
 You may incur additional charges when you enable logging and additional CloudWatch log events are created. For more information, see [ Amazon CloudWatch pricing ](https://aws.amazon.com/cloudwatch/pricing/). 

# Viewing logs for AWS Glue jobs
<a name="monitor-continuous-logging-view"></a>

You can view real-time logs using the AWS Glue console or the Amazon CloudWatch console.

**To view real-time logs using the AWS Glue console dashboard**

1. Sign in to the AWS Management Console and open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/).

1. In the navigation pane, choose **Jobs**.

1. Add or start an existing job. Choose **Action**, **Run job**.

   When you start running a job, you navigate to a page that contains information about the running job:
   + The **Logs** tab shows the older aggregated application logs.
   + The **Logs** tab shows a real-time progress bar when the job is running with `glueContext` initialized.
   + The **Logs** tab also contains the **Driver logs**, which capture real-time Apache Spark driver logs, and application logs from the script logged using the AWS Glue application logger when the job is running.

1. For older jobs, you can also view the real-time logs under the **Job History** view by choosing **Logs**. This action takes you to the CloudWatch console that shows all Spark driver, executor, and progress bar log streams for that job run.

**To view real-time logs using the CloudWatch console dashboard**

1. Open the CloudWatch console at [https://console.aws.amazon.com/cloudwatch/](https://console.aws.amazon.com/cloudwatch/).

1. In the navigation pane, choose **Log**.

1. Choose the **/aws-glue/jobs/error/** log group.

1. In the **Filter** box, paste the job run ID.

   You can view the driver logs, executor logs, and progress bar (if using the **Standard filter**).

# Monitoring with AWS Glue Observability metrics
<a name="monitor-observability"></a>

**Note**  
AWS Glue Observability metrics is available on AWS Glue 4.0 and later versions.

 Use AWS Glue Observability metrics to generate insights into what is happening inside your AWS Glue for Apache Spark jobs to improve triaging and analysis of issues. Observability metrics are visualized through Amazon CloudWatch dashboards and can be used to help perform root cause analysis for errors and for diagnosing performance bottlenecks. You can reduce the time spent debugging issues at scale so you can focus on resolving issues faster and more effectively. 

 AWS Glue Observability provides Amazon CloudWatch metrics categorized in following four groups: 
+  **Reliability (i.e., Errors Classes)** – easily identify the most common failure reasons at given time range that you may want to address. 
+  **Performance (i.e., Skewness)** – identify a performance bottleneck and apply tuning techniques. For example, when you experience degraded performance due to job skewness, you may want to enable Spark Adaptive Query Execution and fine-tune the skew join threshold. 
+  **Throughput (i.e., per source/sink throughput)** – monitor trends of data reads and writes. You can also configure Amazon CloudWatch alarms for anomalies. 
+  **Resource Utilization (i.e., workers, memory and disk utilization)** – efficiently find the jobs with low capacity utilization. You may want to enable AWS Glue auto-scaling for those jobs. 

## Getting started with AWS Glue Observability metrics
<a name="monitor-observability-getting-started"></a>

**Note**  
 The new metrics are enabled by default in the AWS Glue Studio console. 

**To configure observability metrics in AWS Glue Studio:**

1. Log in to the AWS Glue console and choose **ETL jobs** from the console menu.

1. Choose a job by clicking on the job name in the **Your jobs** section.

1. Choose the **Job details** tab.

1. Scroll to the bottom and choose **Advanced properties**, then **Job observability metrics**.  
![\[The screenshot shows the Job details tab Advanced properties. The Job observability metrics option is highlighted.\]](http://docs.aws.amazon.com/glue/latest/dg/images/job-details-observability-metrics.png)

**To enable AWS Glue Observability metrics using AWS CLI:**
+  Add to the `--default-arguments` map the following key-value in the input JSON file: 

  ```
  --enable-observability-metrics, true
  ```

## Using AWS Glue observability
<a name="monitor-observability-cloudwatch"></a>

 Because the AWS Glue observability metrics is provided through Amazon CloudWatch, you can use the Amazon CloudWatch console, AWS CLI, SDK or API to query the observability metrics datapoints. See [ Using Glue Observability for monitoring resource utilization to reduce cost ](https://aws.amazon.com/blogs/big-data/enhance-monitoring-and-debugging-for-aws-glue-jobs-using-new-job-observability-metrics/) for an example use case when to use AWS Glue observability metrics. 

### Using AWS Glue observability in the Amazon CloudWatch console
<a name="monitor-observability-cloudwatch-console"></a>

**To query and visualize metrics in the Amazon CloudWatch console:**

1.  Open the Amazon CloudWatch console and choose **All metrics**. 

1.  Under custom namespaces, choose **AWS Glue**. 

1.  Choose **Job Observability Metrics, Observability Metrics Per Source, or Observability Metrics Per Sink** . 

1. Search for the specific metric name, job name, job run ID, and select them.

1. Under the **Graphed metrics** tab, configure your preferred statistic, period, and other options.  
![\[The screenshot shows the Amazon CloudWatch console and metrics graph.\]](http://docs.aws.amazon.com/glue/latest/dg/images/cloudwatch-console-metrics.png)

**To query an Observability metric using AWS CLI:**

1.  Create a metric definition JSON file and replace `your-Glue-job-name`and `your-Glue-job-run-id` with yours. 

   ```
   $ cat multiplequeries.json
   [
       {
           "Id": "avgWorkerUtil_0",
           "MetricStat": {
               "Metric": {
                   "Namespace": "Glue",
                   "MetricName": "glue.driver.workerUtilization",
                   "Dimensions": [
                       {
                           "Name": "JobName",
                           "Value": "<your-Glue-job-name-A>"
                       },
                       {
                           "Name": "JobRunId",
                           "Value": "<your-Glue-job-run-id-A>"
                       },
                       {
                           "Name": "Type",
                           "Value": "gauge"
                       },
                       {
                           "Name": "ObservabilityGroup",
                           "Value": "resource_utilization"
                       }
                   ]
               },
               "Period": 1800,
               "Stat": "Minimum",
               "Unit": "None"
           }
       },
       {
           "Id": "avgWorkerUtil_1",
           "MetricStat": {
               "Metric": {
                   "Namespace": "Glue",
                   "MetricName": "glue.driver.workerUtilization",
                   "Dimensions": [
                       {
                           "Name": "JobName",
                           "Value": "<your-Glue-job-name-B>"
                       },
                       {
                           "Name": "JobRunId",
                           "Value": "<your-Glue-job-run-id-B>"
                       },
                       {
                           "Name": "Type",
                           "Value": "gauge"
                       },
                       {
                           "Name": "ObservabilityGroup",
                           "Value": "resource_utilization"
                       }
                   ]
               },
               "Period": 1800,
               "Stat": "Minimum",
               "Unit": "None"
           }
       }
   ]
   ```

1.  Run the `get-metric-data` command: 

   ```
   $ aws cloudwatch get-metric-data --metric-data-queries file: //multiplequeries.json \
        --start-time '2023-10-28T18: 20' \
        --end-time '2023-10-28T19: 10'  \
        --region us-east-1
   {
       "MetricDataResults": [
           {
               "Id": "avgWorkerUtil_0",
               "Label": "<your-label-for-A>",
               "Timestamps": [
                   "2023-10-28T18:20:00+00:00"
               ],
               "Values": [
                   0.06718750000000001
               ],
               "StatusCode": "Complete"
           },
           {
               "Id": "avgWorkerUtil_1",
               "Label": "<your-label-for-B>",
               "Timestamps": [
                   "2023-10-28T18:50:00+00:00"
               ],
               "Values": [
                   0.5959183673469387
               ],
               "StatusCode": "Complete"
           }
       ],
       "Messages": []
   }
   ```

## Observability metrics
<a name="monitor-observability-metrics-definitions"></a>

 AWS Glue Observability profiles and sends the following metrics to Amazon CloudWatch every 30 seconds, and some of these metrics can be visible in the AWS Glue Studio Job Runs Monitoring Page. 


| Metric | Description | Category | 
| --- | --- | --- | 
| glue.driver.skewness.stage |  Metric Category: job\$1performance The spark stages execution Skewness: this metric is an indicator of how long the maximum task duration time in a certain stage is compared to the median task duration in this stage. It captures execution skewness, which might be caused by input data skewness or by a transformation (e.g., skewed join). The values of this metric falls into the range of [0, infinity[, where 0 means the ratio of the maximum to median tasks' execution time, among all tasks in the stage is less than a certain stage skewness factor. The default stage skewness factor is `5` and it can be overwritten via spark conf: spark.metrics.conf.driver.source.glue.jobPerformance.skewnessFactor A stage skewness value of 1 means the ratio is twice the stage skewness factor.  The value of stage skewness is updated every 30 seconds to reflect the current skewness. The value at the end of the stage reflects the final stage skewness. This stage-level metric is used to calculate the job-level metric `glue.driver.skewness.job`. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (gauge), and ObservabilityGroup (job\$1performance) Valid Statistics: Average, Maximum, Minimum, Percentile Unit: Count  | job\$1performance | 
| glue.driver.skewness.job |  Metric Category: job\$1performance  Job skewness is the maximum of weighted skewness of all stages. The stage skewness (glue.driver.skewness.stage) is weighted with stage duration. This is to avoid the corner case when a very skewed stage is actually running for very short time relative to other stages (and thus its skewness is not significant for the overall job performance and does not worth the effort to try to address its skewness).  This metric is updated upon completion of each stage, and thus the last value reflects the actual overall job skewness. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (gauge), and ObservabilityGroup (job\$1performance) Valid Statistics: Average, Maximum, Minimum, Percentile Unit: Count  | job\$1performance | 
| glue.succeed.ALL |  Metric Category: error Total number of successful job runs, to complete the picture of failures categories Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (count), and ObservabilityGroup (error) Valid Statistics: SUM Unit: Count  | error | 
| glue.error.ALL |  Metric Category: error  Total number of job run errors, to complete the picture of failures categories Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (count), and ObservabilityGroup (error) Valid Statistics: SUM Unit: Count  | error | 
| glue.error.[error category] |  Metric Category: error  This is actually a set of metrics, that are updated only when a job run fails. The error categorization helps with triaging and debugging. When a job run fails, the error causing the failure is categorized and the corresponding error category metric is set to 1. This helps to perform over time failures analysis, as well as over all jobs error analysis to identify most common failure categories to start addressing them. AWS Glue has 28 error categories, including OUT\$1OF\$1MEMORY (driver and executor), PERMISSION, SYNTAX and THROTTLING error categories. Error categories also include COMPILATION, LAUNCH and TIMEOUT error categories. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (count), and ObservabilityGroup (error) Valid Statistics: SUM Unit: Count  | error | 
| glue.driver.workerUtilization |  Metric Category: resource\$1utilization  The percentage of the allocated workers which are actually used. If not good, auto scaling can help. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (gauge), and ObservabilityGroup (resource\$1utilization) Valid Statistics: Average, Maximum, Minimum, Percentile Unit: Percentage  | resource\$1utilization | 
| glue.driver.memory.heap.[available \$1 used] |  Metric Category: resource\$1utilization  The driver's available / used heap memory during the job run. This helps to understand memory usage trends, especially over time, which can help avoid potential failures, in addition to debugging memory related failures. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (gauge), and ObservabilityGroup (resource\$1utilization) Valid Statistics: Average Unit: Bytes  | resource\$1utilization | 
| glue.driver.memory.heap.used.percentage |  Metric Category: resource\$1utilization  The driver's used (%) heap memory during the job run. This helps to understand memory usage trends, especially over time, which can help avoid potential failures, in addition to debugging memory related failures. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (gauge), and ObservabilityGroup (resource\$1utilization) Valid Statistics: Average Unit: Percentage  | resource\$1utilization | 
| glue.driver.memory.non-heap.[available \$1 used] |  Metric Category: resource\$1utilization  The driver's available / used non-heap memory during the job run. This helps to understand memory usage trensd, especially over time, which can help avoid potential failures, in addition to debugging memory related failures. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (gauge), and ObservabilityGroup (resource\$1utilization) Valid Statistics: Average Unit: Bytes  | resource\$1utilization | 
| glue.driver.memory.non-heap.used.percentage |  Metric Category: resource\$1utilization  The driver's used (%) non-heap memory during the job run. This helps to understand memory usage trends, especially over time, which can help avoid potential failures, in addition to debugging memory related failures. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (gauge), and ObservabilityGroup (resource\$1utilization) Valid Statistics: Average Unit: Percentage  | resource\$1utilization | 
| glue.driver.memory.total.[available \$1 used] |  Metric Category: resource\$1utilization  The driver's available / used total memory during the job run. This helps to understand memory usage trends, especially over time, which can help avoid potential failures, in addition to debugging memory related failures. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (gauge), and ObservabilityGroup (resource\$1utilization) Valid Statistics: Average Unit: Bytes  | resource\$1utilization | 
| glue.driver.memory.total.used.percentage |  Metric Category: resource\$1utilization  The driver's used (%) total memory during the job run. This helps to understand memory usage trends, especially over time, which can help avoid potential failures, in addition to debugging memory related failures. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (gauge), and ObservabilityGroup (resource\$1utilization) Valid Statistics: Average Unit: Percentage  | resource\$1utilization | 
| glue.ALL.memory.heap.[available \$1 used] |  Metric Category: resource\$1utilization  The executors' available/used heap memory. ALL means all executors. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (gauge), and ObservabilityGroup (resource\$1utilization) Valid Statistics: Average Unit: Bytes  | resource\$1utilization | 
| glue.ALL.memory.heap.used.percentage |  Metric Category: resource\$1utilization  The executors' used (%) heap memory. ALL means all executors. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (gauge), and ObservabilityGroup (resource\$1utilization) Valid Statistics: Average Unit: Percentage  | resource\$1utilization | 
| glue.ALL.memory.non-heap.[available \$1 used] |  Metric Category: resource\$1utilization  The executors' available/used non-heap memory. ALL means all executors. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (gauge), and ObservabilityGroup (resource\$1utilization) Valid Statistics: Average Unit: Bytes  | resource\$1utilization | 
| glue.ALL.memory.non-heap.used.percentage |  Metric Category: resource\$1utilization  The executors' used (%) non-heap memory. ALL means all executors. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (gauge), and ObservabilityGroup (resource\$1utilization) Valid Statistics: Average Unit: Percentage  | resource\$1utilization | 
| glue.ALL.memory.total.[available \$1 used] |  Metric Category: resource\$1utilization  The executors' available/used total memory. ALL means all executors. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (gauge), and ObservabilityGroup (resource\$1utilization) Valid Statistics: Average Unit: Bytes  | resource\$1utilization | 
| glue.ALL.memory.total.used.percentage |  Metric Category: resource\$1utilization  The executors' used (%) total memory. ALL means all executors. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (gauge), and ObservabilityGroup (resource\$1utilization) Valid Statistics: Average Unit: Percentage  | resource\$1utilization | 
| glue.driver.disk.[available\$1GB \$1 used\$1GB] |  Metric Category: resource\$1utilization  The driver's available/used disk space during the job run. This helps to understand disk usage trends, especially over time, which can help avoid potential failures, in addition to debugging not enought disk space related failures. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (gauge), and ObservabilityGroup (resource\$1utilization) Valid Statistics: Average Unit: Gigabytes  | resource\$1utilization | 
| glue.driver.disk.used.percentage] |  Metric Category: resource\$1utilization  The driver's available/used disk space during the job run. This helps to understand disk usage trends, especially over time, which can help avoid potential failures, in addition to debugging not enought disk space related failures. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (gauge), and ObservabilityGroup (resource\$1utilization) Valid Statistics: Average Unit: Percentage  | resource\$1utilization | 
| glue.ALL.disk.[available\$1GB \$1 used\$1GB] |  Metric Category: resource\$1utilization  The executors' available/used disk space. ALL means all executors. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (gauge), and ObservabilityGroup (resource\$1utilization) Valid Statistics: Average Unit: Gigabytes  | resource\$1utilization | 
| glue.ALL.disk.used.percentage |  Metric Category: resource\$1utilization  The executors' available/used/used(%) disk space. ALL means all executors. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (gauge), and ObservabilityGroup (resource\$1utilization) Valid Statistics: Average Unit: Percentage  | resource\$1utilization | 
| glue.driver.bytesRead |  Metric Category: throughput  The number of bytes read per input source in this job run, as well as well as for ALL sources. This helps understand the data volume and its changes over time, which helps addressing issues such as data skewness. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (gauge), ObservabilityGroup (resource\$1utilization), and Source (source data location) Valid Statistics: Average Unit: Bytes  | throughput | 
| glue.driver.[recordsRead \$1 filesRead]  |  Metric Category: throughput  The number of records/files read per input source in this job run, as well as well as for ALL sources. This helps understand the data volume and its changes over time, which helps addressing issues such as data skewness. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (gauge), ObservabilityGroup (resource\$1utilization), and Source (source data location) Valid Statistics: Average Unit: Count  | throughput | 
| glue.driver.partitionsRead  |  Metric Category: throughput  The number of partitions read per Amazon S3 input source in this job run, as well as well as for ALL sources. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (gauge), ObservabilityGroup (resource\$1utilization), and Source (source data location) Valid Statistics: Average Unit: Count  | throughput | 
| glue.driver.bytesWrittten |  Metric Category: throughput  The number of bytes written per output sink in this job run, as well as well as for ALL sinks. This helps understand the data volume and how it evolves over time, which helps addressing issues such as processing skewness. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (gauge), ObservabilityGroup (resource\$1utilization), and Sink (sink data location) Valid Statistics: Average Unit: Bytes  | throughput | 
| glue.driver.[recordsWritten \$1 filesWritten] |  Metric Category: throughput  The nnumber of records/files written per output sink in this job run, as well as well as for ALL sinks. This helps understand the data volume and how it evolves over time, which helps addressing issues such as processing skewness. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (gauge), ObservabilityGroup (resource\$1utilization), and Sink (sink data location) Valid Statistics: Average Unit: Count  | throughput | 

## Error categories
<a name="monitor-observability-error-categories"></a>


| Error categories | Description | 
| --- | --- | 
| COMPILATION\$1ERROR | Errors arise during the compilation of Scala code. | 
| CONNECTION\$1ERROR | Errors arise during connecting to a service/remote host/database service, etc. | 
| DISK\$1NO\$1SPACE\$1ERROR |  Errors arise when there is no space left in disk on driver/executor.  | 
| OUT\$1OF\$1MEMORY\$1ERROR | Errors arise when there is no space left in memory on driver/executor. | 
| IMPORT\$1ERROR | Errors arise when import dependencies. | 
| INVALID\$1ARGUMENT\$1ERROR | Errors arise when the input arguments are invalid/illegal. | 
| PERMISSION\$1ERROR | Errors arise when lacking the permission to service, data, etc.  | 
| RESOURCE\$1NOT\$1FOUND\$1ERROR |  Errors arise when data, location, etc does not exit.   | 
| QUERY\$1ERROR | Errors arise from Spark SQL query execution.  | 
| SYNTAX\$1ERROR | Errors arise when there is syntax error in the script.  | 
| THROTTLING\$1ERROR | Errors arise when hitting service concurrency limitation or execeding service quota limitaion. | 
| DATA\$1LAKE\$1FRAMEWORK\$1ERROR | Errors arise from AWS Glue native-supported data lake framework like Hudi, Iceberg, etc. | 
| UNSUPPORTED\$1OPERATION\$1ERROR | Errors arise when making unsupported operation. | 
| RESOURCES\$1ALREADY\$1EXISTS\$1ERROR | Errors arise when a resource to be created or added already exists. | 
| GLUE\$1INTERNAL\$1SERVICE\$1ERROR | Errors arise when there is a AWS Glue internal service issue.  | 
| GLUE\$1OPERATION\$1TIMEOUT\$1ERROR | Errors arise when a AWS Glue operation is timeout. | 
| GLUE\$1VALIDATION\$1ERROR | Errors arise when a required value could not be validated for AWS Glue job. | 
| GLUE\$1JOB\$1BOOKMARK\$1VERSION\$1MISMATCH\$1ERROR | Errors arise when same job exon the same source bucket and write to the same/different destination concurrently (concurrency >1) | 
| LAUNCH\$1ERROR | Errors arise during the AWS Glue job launch phase. | 
| DYNAMODB\$1ERROR | Generic errors arise from Amazon DynamoDB service. | 
| GLUE\$1ERROR | Generic Errors arise from AWS Glue service. | 
| LAKEFORMATION\$1ERROR | Generic Errors arise from AWS Lake Formation service. | 
| REDSHIFT\$1ERROR | Generic Errors arise from Amazon Redshift service. | 
| S3\$1ERROR | Generic Errors arise from Amazon S3 service. | 
| SYSTEM\$1EXIT\$1ERROR | Generic system exit error. | 
| TIMEOUT\$1ERROR | Generic errors arise when job failed by operation time out. | 
| UNCLASSIFIED\$1SPARK\$1ERROR | Generic errors arise from Spark. | 
| UNCLASSIFIED\$1ERROR | Default error category. | 

## Limitations
<a name="monitoring-observability-limitations"></a>

**Note**  
`glueContext` must be initialized to publish the metrics.

 In the Source Dimension, the value is either Amazon S3 path or table name, depending on the source type. In addition, if the source is JDBC and the query option is used, the query string is set in the source dimension. If the value is longer than 500 characters, it is trimmed within 500 characters.The following are limitations in the value: 
+ Non-ASCII characters will be removed.
+ If the source name doesn’t contain any ASCII character, it is converted to <non-ASCII input>.

### Limitations and considerations for throughput metrics
<a name="monitoring-observability-considerations"></a>
+  DataFrame and DataFrame-based DynamicFrame (e.g. JDBC, reading from parquet on Amazon S3) are supported, however, RDD-based DynamicFrame (e.g. reading csv, json on Amazon S3, etc.) is not supported. Technically, all reads and writes visible on Spark UI are supported. 
+  The `recordsRead` metric will be emitted if the data source is catalog table and the format is JSON, CSV, text, or Iceberg. 
+  `glue.driver.throughput.recordsWritten`, `glue.driver.throughput.bytesWritten`, and `glue.driver.throughput.filesWritten` metrics are not available in JDBC and Iceberg tables. 
+  Metrics may be delayed. If the job finishes in about one minute, there may be no throughput metrics in Amazon CloudWatch Metrics. 

# Job monitoring and debugging
<a name="monitor-profile-glue-job-cloudwatch-metrics"></a>

You can collect metrics about AWS Glue jobs and visualize them on the AWS Glue and Amazon CloudWatch consoles to identify and fix issues. Profiling your AWS Glue jobs requires the following steps:

1.  Enable metrics: 

   1.  Enable the **Job metrics** option in the job definition. You can enable profiling in the AWS Glue console or as a parameter to the job. For more information see [Defining job properties for Spark jobs](add-job.md#create-job) or [Using job parameters in AWS Glue jobs](aws-glue-programming-etl-glue-arguments.md). 

   1.  Enable the **AWS Glue Observability metrics** option in the job definition. You can enable Observability in the AWS Glue console or as a parameter to the job. For more information see [Monitoring with AWS Glue Observability metrics](monitor-observability.md). 

1. Confirm that the job script initializes a `GlueContext`. For example, the following script snippet initializes a `GlueContext` and shows where profiled code is placed in the script. This general format is used in the debugging scenarios that follow. 

   ```
   import sys
   from awsglue.transforms import *
   from awsglue.utils import getResolvedOptions
   from pyspark.context import SparkContext
   from awsglue.context import GlueContext
   from awsglue.job import Job
   import time
   
   ## @params: [JOB_NAME]
   args = getResolvedOptions(sys.argv, ['JOB_NAME'])
   
   sc = SparkContext()
   glueContext = GlueContext(sc)
   spark = glueContext.spark_session
   job = Job(glueContext)
   job.init(args['JOB_NAME'], args)
   
   ...
   ...
   code-to-profile
   ...
   ...
   
   
   job.commit()
   ```

1. Run the job.

1. Visualize the metrics:

   1. Visualize job metrics on the AWS Glue console and identify abnormal metrics for the driver or an executor.

   1. Check observability metrics in the Job run monitoring page, job run details page, or on Amazon CloudWatch. For more information, see [Monitoring with AWS Glue Observability metrics](monitor-observability.md).

1. Narrow down the root cause using the identified metric.

1. Optionally, confirm the root cause using the log stream of the identified driver or job executor.

 **Use cases for AWS Glue observability metrics** 
+  [Debugging OOM exceptions and job abnormalities](monitor-profile-debug-oom-abnormalities.md) 
+  [Debugging demanding stages and straggler tasks](monitor-profile-debug-straggler.md) 
+  [Monitoring the progress of multiple jobs](monitor-debug-multiple.md) 
+  [Monitoring for DPU capacity planning](monitor-debug-capacity.md) 
+  [ Using AWS Glue Observability for monitoring resource utilization to reduce cost ](https://aws.amazon.com/blogs/big-data/enhance-monitoring-and-debugging-for-aws-glue-jobs-using-new-job-observability-metrics) 

# Debugging OOM exceptions and job abnormalities
<a name="monitor-profile-debug-oom-abnormalities"></a>

You can debug out-of-memory (OOM) exceptions and job abnormalities in AWS Glue. The following sections describe scenarios for debugging out-of-memory exceptions of the Apache Spark driver or a Spark executor. 
+ [Debugging a driver OOM exception](#monitor-profile-debug-oom-driver)
+ [Debugging an executor OOM exception](#monitor-profile-debug-oom-executor)

## Debugging a driver OOM exception
<a name="monitor-profile-debug-oom-driver"></a>

In this scenario, a Spark job is reading a large number of small files from Amazon Simple Storage Service (Amazon S3). It converts the files to Apache Parquet format and then writes them out to Amazon S3. The Spark driver is running out of memory. The input Amazon S3 data has more than 1 million files in different Amazon S3 partitions. 

The profiled code is as follows:

```
data = spark.read.format("json").option("inferSchema", False).load("s3://input_path")
data.write.format("parquet").save(output_path)
```

### Visualize the profiled metrics on the AWS Glue console
<a name="monitor-debug-oom-visualize"></a>

The following graph shows the memory usage as a percentage for the driver and executors. This usage is plotted as one data point that is averaged over the values reported in the last minute. You can see in the memory profile of the job that the [driver memory](monitoring-awsglue-with-cloudwatch-metrics.md#glue.driver.jvm.heap.usage) crosses the safe threshold of 50 percent usage quickly. On the other hand, the [average memory usage](monitoring-awsglue-with-cloudwatch-metrics.md#glue.ALL.jvm.heap.usage) across all executors is still less than 4 percent. This clearly shows abnormality with driver execution in this Spark job. 

![\[The memory usage in percentage for the driver and executors.\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-debug-oom-memoryprofile.png)


The job run soon fails, and the following error appears in the **History** tab on the AWS Glue console: Command Failed with Exit Code 1. This error string means that the job failed due to a systemic error—which in this case is the driver running out of memory.

![\[The error message shown on the AWS Glue console.\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-debug-oom-errorstring.png)


On the console, choose the **Error logs** link on the **History** tab to confirm the finding about driver OOM from the CloudWatch Logs. Search for "**Error**" in the job's error logs to confirm that it was indeed an OOM exception that failed the job:

```
# java.lang.OutOfMemoryError: Java heap space
# -XX:OnOutOfMemoryError="kill -9 %p"
# Executing /bin/sh -c "kill -9 12039"...
```

On the **History** tab for the job, choose **Logs**. You can find the following trace of driver execution in the CloudWatch Logs at the beginning of the job. The Spark driver tries to list all the files in all the directories, constructs an `InMemoryFileIndex`, and launches one task per file. This in turn results in the Spark driver having to maintain a large amount of state in memory to track all the tasks. It caches the complete list of a large number of files for the in-memory index, resulting in a driver OOM.

### Fix the processing of multiple files using grouping
<a name="monitor-debug-oom-fix"></a>

You can fix the processing of the multiple files by using the *grouping* feature in AWS Glue. Grouping is automatically enabled when you use dynamic frames and when the input dataset has a large number of files (more than 50,000). Grouping allows you to coalesce multiple files together into a group, and it allows a task to process the entire group instead of a single file. As a result, the Spark driver stores significantly less state in memory to track fewer tasks. For more information about manually enabling grouping for your dataset, see [Reading input files in larger groups](grouping-input-files.md).

To check the memory profile of the AWS Glue job, profile the following code with grouping enabled:

```
df = glueContext.create_dynamic_frame_from_options("s3", {'paths': ["s3://input_path"], "recurse":True, 'groupFiles': 'inPartition'}, format="json")
datasink = glueContext.write_dynamic_frame.from_options(frame = df, connection_type = "s3", connection_options = {"path": output_path}, format = "parquet", transformation_ctx = "datasink")
```

You can monitor the memory profile and the ETL data movement in the AWS Glue job profile.

The driver runs below the threshold of 50 percent memory usage over the entire duration of the AWS Glue job. The executors stream the data from Amazon S3, process it, and write it out to Amazon S3. As a result, they consume less than 5 percent memory at any point in time.

![\[The memory profile showing the issue is fixed.\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-debug-oom-memoryprofile-fixed.png)


The data movement profile below shows the total number of Amazon S3 bytes that are [read](monitoring-awsglue-with-cloudwatch-metrics.md#glue.ALL.s3.filesystem.read_bytes) and [written](monitoring-awsglue-with-cloudwatch-metrics.md#glue.ALL.s3.filesystem.write_bytes) in the last minute by all executors as the job progresses. Both follow a similar pattern as the data is streamed across all the executors. The job finishes processing all one million files in less than three hours.

![\[The data movement profile showing the issue is fixed.\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-debug-oom-etlmovement.png)


## Debugging an executor OOM exception
<a name="monitor-profile-debug-oom-executor"></a>

In this scenario, you can learn how to debug OOM exceptions that could occur in Apache Spark executors. The following code uses the Spark MySQL reader to read a large table of about 34 million rows into a Spark dataframe. It then writes it out to Amazon S3 in Parquet format. You can provide the connection properties and use the default Spark configurations to read the table.

```
val connectionProperties = new Properties()
connectionProperties.put("user", user)
connectionProperties.put("password", password)
connectionProperties.put("Driver", "com.mysql.jdbc.Driver")
val sparkSession = glueContext.sparkSession
val dfSpark = sparkSession.read.jdbc(url, tableName, connectionProperties)
dfSpark.write.format("parquet").save(output_path)
```

### Visualize the profiled metrics on the AWS Glue console
<a name="monitor-debug-oom-visualize-2"></a>

If the slope of the memory usage graph is positive and crosses 50 percent, then if the job fails before the next metric is emitted, then memory exhaustion is a good candidate for the cause. The following graph shows that within a minute of execution, the [average memory usage](monitoring-awsglue-with-cloudwatch-metrics.md#glue.ALL.jvm.heap.usage) across all executors spikes up quickly above 50 percent. The usage reaches up to 92 percent and the container running the executor is stopped by Apache Hadoop YARN. 

![\[The average memory usage across all executors.\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-debug-oom-2-memoryprofile.png)


As the following graph shows, there is always a [single executor](monitoring-awsglue-with-cloudwatch-metrics.md#glue.driver.ExecutorAllocationManager.executors.numberAllExecutors) running until the job fails. This is because a new executor is launched to replace the stopped executor. The JDBC data source reads are not parallelized by default because it would require partitioning the table on a column and opening multiple connections. As a result, only one executor reads in the complete table sequentially.

![\[The job execution shows a single executor running until the job fails.\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-debug-oom-2-execution.png)


As the following graph shows, Spark tries to launch a new task four times before failing the job. You can see the [memory profile](monitoring-awsglue-with-cloudwatch-metrics.md#glue.ALL.jvm.heap.used) of three executors. Each executor quickly uses up all of its memory. The fourth executor runs out of memory, and the job fails. As a result, its metric is not reported immediately.

![\[The memory profiles of the executors.\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-debug-oom-2-exec-memprofile.png)


You can confirm from the error string on the AWS Glue console that the job failed due to OOM exceptions, as shown in the following image.

![\[The error message shown on the AWS Glue console.\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-debug-oom-2-errorstring.png)


**Job output logs:** To further confirm your finding of an executor OOM exception, look at the CloudWatch Logs. When you search for **Error**, you find the four executors being stopped in roughly the same time windows as shown on the metrics dashboard. All are terminated by YARN as they exceed their memory limits.

Executor 1

```
18/06/13 16:54:29 WARN YarnAllocator: Container killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
18/06/13 16:54:29 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
18/06/13 16:54:29 ERROR YarnClusterScheduler: Lost executor 1 on ip-10-1-2-175.ec2.internal: Container killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
18/06/13 16:54:29 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, ip-10-1-2-175.ec2.internal, executor 1): ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
```

Executor 2

```
18/06/13 16:55:35 WARN YarnAllocator: Container killed by YARN for exceeding memory limits. 5.8 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
18/06/13 16:55:35 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container killed by YARN for exceeding memory limits. 5.8 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
18/06/13 16:55:35 ERROR YarnClusterScheduler: Lost executor 2 on ip-10-1-2-16.ec2.internal: Container killed by YARN for exceeding memory limits. 5.8 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
18/06/13 16:55:35 WARN TaskSetManager: Lost task 0.1 in stage 0.0 (TID 1, ip-10-1-2-16.ec2.internal, executor 2): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 5.8 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
```

Executor 3

```
18/06/13 16:56:37 WARN YarnAllocator: Container killed by YARN for exceeding memory limits. 5.8 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
18/06/13 16:56:37 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container killed by YARN for exceeding memory limits. 5.8 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
18/06/13 16:56:37 ERROR YarnClusterScheduler: Lost executor 3 on ip-10-1-2-189.ec2.internal: Container killed by YARN for exceeding memory limits. 5.8 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
18/06/13 16:56:37 WARN TaskSetManager: Lost task 0.2 in stage 0.0 (TID 2, ip-10-1-2-189.ec2.internal, executor 3): ExecutorLostFailure (executor 3 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 5.8 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
```

Executor 4

```
18/06/13 16:57:18 WARN YarnAllocator: Container killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
18/06/13 16:57:18 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
18/06/13 16:57:18 ERROR YarnClusterScheduler: Lost executor 4 on ip-10-1-2-96.ec2.internal: Container killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
18/06/13 16:57:18 WARN TaskSetManager: Lost task 0.3 in stage 0.0 (TID 3, ip-10-1-2-96.ec2.internal, executor 4): ExecutorLostFailure (executor 4 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
```

### Fix the fetch size setting using AWS Glue dynamic frames
<a name="monitor-debug-oom-fix-2"></a>

The executor ran out of memory while reading the JDBC table because the default configuration for the Spark JDBC fetch size is zero. This means that the JDBC driver on the Spark executor tries to fetch the 34 million rows from the database together and cache them, even though Spark streams through the rows one at a time. With Spark, you can avoid this scenario by setting the fetch size parameter to a non-zero default value.

You can also fix this issue by using AWS Glue dynamic frames instead. By default, dynamic frames use a fetch size of 1,000 rows that is a typically sufficient value. As a result, the executor does not take more than 7 percent of its total memory. The AWS Glue job finishes in less than two minutes with only a single executor. While using AWS Glue dynamic frames is the recommended approach, it is also possible to set the fetch size using the Apache Spark `fetchsize` property. See the [Spark SQL, DataFrames and Datasets Guide](https://spark.apache.org/docs/2.2.0/sql-programming-guide.html#jdbc-to-other-databases).

```
val (url, database, tableName) = {
 ("jdbc_url", "db_name", "table_name")
 } 
val source = glueContext.getSource(format, sourceJson)
val df = source.getDynamicFrame
glueContext.write_dynamic_frame.from_options(frame = df, connection_type = "s3", connection_options = {"path": output_path}, format = "parquet", transformation_ctx = "datasink")
```

**Normal profiled metrics:** The [executor memory](monitoring-awsglue-with-cloudwatch-metrics.md#glue.ALL.jvm.heap.usage) with AWS Glue dynamic frames never exceeds the safe threshold, as shown in the following image. It streams in the rows from the database and caches only 1,000 rows in the JDBC driver at any point in time. An out of memory exception does not occur.

![\[AWS Glue console showing executor memory below the safe threshold.\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-debug-oom-2-memoryprofile-fixed.png)


# Debugging demanding stages and straggler tasks
<a name="monitor-profile-debug-straggler"></a>

You can use AWS Glue job profiling to identify demanding stages and straggler tasks in your extract, transform, and load (ETL) jobs. A straggler task takes much longer than the rest of the tasks in a stage of an AWS Glue job. As a result, the stage takes longer to complete, which also delays the total execution time of the job.

## Coalescing small input files into larger output files
<a name="monitor-profile-debug-straggler-scenario-1"></a>

A straggler task can occur when there is a non-uniform distribution of work across the different tasks, or a data skew results in one task processing more data.

You can profile the following code—a common pattern in Apache Spark—to coalesce a large number of small files into larger output files. For this example, the input dataset is 32 GB of JSON Gzip compressed files. The output dataset has roughly 190 GB of uncompressed JSON files. 

The profiled code is as follows:

```
datasource0 = spark.read.format("json").load("s3://input_path")
df = datasource0.coalesce(1)
df.write.format("json").save(output_path)
```

### Visualize the profiled metrics on the AWS Glue console
<a name="monitor-debug-straggler-visualize"></a>

You can profile your job to examine four different sets of metrics:
+ ETL data movement
+ Data shuffle across executors
+ Job execution
+ Memory profile

**ETL data movement**: In the **ETL Data Movement** profile, the bytes are [read](monitoring-awsglue-with-cloudwatch-metrics.md#glue.ALL.s3.filesystem.read_bytes) fairly quickly by all the executors in the first stage that completes within the first six minutes. However, the total job execution time is around one hour, mostly consisting of the data [writes](monitoring-awsglue-with-cloudwatch-metrics.md#glue.ALL.s3.filesystem.write_bytes).

![\[Graph showing the ETL Data Movement profile.\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-debug-straggler-1.png)


**Data shuffle across executors:** The number of bytes [read](monitoring-awsglue-with-cloudwatch-metrics.md#glue.driver.aggregate.shuffleLocalBytesRead) and [written](monitoring-awsglue-with-cloudwatch-metrics.md#glue.driver.aggregate.shuffleBytesWritten) during shuffling also shows a spike before Stage 2 ends, as indicated by the **Job Execution** and **Data Shuffle** metrics. After the data shuffles from all executors, the reads and writes proceed from executor number 3 only.

![\[The metrics for data shuffle across the executors.\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-debug-straggler-2.png)


**Job execution:** As shown in the graph below, all other executors are idle and are eventually relinquished by the time 10:09. At that point, the total number of executors decreases to only one. This clearly shows that executor number 3 consists of the straggler task that is taking the longest execution time and is contributing to most of the job execution time.

![\[The execution metrics for the active executors.\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-debug-straggler-3.png)


**Memory profile:** After the first two stages, only [executor number 3](monitoring-awsglue-with-cloudwatch-metrics.md#glue.executorId.jvm.heap.used) is actively consuming memory to process the data. The remaining executors are simply idle or have been relinquished shortly after the completion of the first two stages. 

![\[The metrics for the memory profile after the first two stages.\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-debug-straggler-4.png)


### Fix straggling executors using grouping
<a name="monitor-debug-straggler-fix"></a>

You can avoid straggling executors by using the *grouping* feature in AWS Glue. Use grouping to distribute the data uniformly across all the executors and coalesce files into larger files using all the available executors on the cluster. For more information, see [Reading input files in larger groups](grouping-input-files.md).

To check the ETL data movements in the AWS Glue job, profile the following code with grouping enabled:

```
df = glueContext.create_dynamic_frame_from_options("s3", {'paths': ["s3://input_path"], "recurse":True, 'groupFiles': 'inPartition'}, format="json")
datasink = glueContext.write_dynamic_frame.from_options(frame = df, connection_type = "s3", connection_options = {"path": output_path}, format = "json", transformation_ctx = "datasink4")
```

**ETL data movement:** The data writes are now streamed in parallel with the data reads throughout the job execution time. As a result, the job finishes within eight minutes—much faster than previously.

![\[The ETL data movements showing the issue is fixed.\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-debug-straggler-5.png)


**Data shuffle across executors:** As the input files are coalesced during the reads using the grouping feature, there is no costly data shuffle after the data reads.

![\[The data shuffle metrics showing the issue is fixed.\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-debug-straggler-6.png)


**Job execution:** The job execution metrics show that the total number of active executors running and processing data remains fairly constant. There is no single straggler in the job. All executors are active and are not relinquished until the completion of the job. Because there is no intermediate shuffle of data across the executors, there is only a single stage in the job.

![\[The metrics for the Job Execution widget showing that there are no stragglers in the job.\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-debug-straggler-7.png)


**Memory profile:** The metrics show the [active memory consumption](monitoring-awsglue-with-cloudwatch-metrics.md#glue.executorId.jvm.heap.used) across all executors—reconfirming that there is activity across all executors. As data is streamed in and written out in parallel, the total memory footprint of all executors is roughly uniform and well below the safe threshold for all executors.

![\[The memory profile metrics showing active memory consumption across all executors.\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-debug-straggler-8.png)


# Monitoring the progress of multiple jobs
<a name="monitor-debug-multiple"></a>

You can profile multiple AWS Glue jobs together and monitor the flow of data between them. This is a common workflow pattern, and requires monitoring for individual job progress, data processing backlog, data reprocessing, and job bookmarks.

**Topics**
+ [Profiled code](#monitor-debug-multiple-profile)
+ [Visualize the profiled metrics on the AWS Glue console](#monitor-debug-multiple-visualize)
+ [Fix the processing of files](#monitor-debug-multiple-fix)

## Profiled code
<a name="monitor-debug-multiple-profile"></a>

In this workflow, you have two jobs: an Input job and an Output job. The Input job is scheduled to run every 30 minutes using a periodic trigger. The Output job is scheduled to run after each successful run of the Input job. These scheduled jobs are controlled using job triggers.

![\[Console screenshot showing the job triggers controlling the scheduling of the Input and Output jobs.\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-debug-multiple-1.png)


**Input job**: This job reads in data from an Amazon Simple Storage Service (Amazon S3) location, transforms it using `ApplyMapping`, and writes it to a staging Amazon S3 location. The following code is profiled code for the Input job:

```
datasource0 = glueContext.create_dynamic_frame.from_options(connection_type="s3", connection_options = {"paths": ["s3://input_path"], "useS3ListImplementation":True,"recurse":True}, format="json")
applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [map_spec])
datasink2 = glueContext.write_dynamic_frame.from_options(frame = applymapping1, connection_type = "s3", connection_options = {"path": staging_path, "compression": "gzip"}, format = "json")
```

**Output job**: This job reads the output of the Input job from the staging location in Amazon S3, transforms it again, and writes it to a destination:

```
datasource0 = glueContext.create_dynamic_frame.from_options(connection_type="s3", connection_options = {"paths": [staging_path], "useS3ListImplementation":True,"recurse":True}, format="json")
applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [map_spec])
datasink2 = glueContext.write_dynamic_frame.from_options(frame = applymapping1, connection_type = "s3", connection_options = {"path": output_path}, format = "json")
```

## Visualize the profiled metrics on the AWS Glue console
<a name="monitor-debug-multiple-visualize"></a>

The following dashboard superimposes the Amazon S3 bytes written metric from the Input job onto the Amazon S3 bytes read metric on the same timeline for the Output job. The timeline shows different job runs of the Input and Output jobs. The Input job (shown in red) starts every 30 minutes. The Output Job (shown in brown) starts at the completion of the Input Job, with a Max Concurrency of 1. 

![\[Graph showing the data read and written.\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-debug-multiple-4.png)


In this example, [job bookmarks](https://docs.aws.amazon.com/glue/latest/dg/monitor-continuations.html) are not enabled. No transformation contexts are used to enable job bookmarks in the script code. 

**Job History**: The Input and Output jobs have multiple runs, as shown on the **History** tab, starting from 12:00 PM.

The Input job on the AWS Glue console looks like this:

![\[Console screenshot showing the History tab of the Input job.\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-debug-multiple-2.png)


The following image shows the Output job:

![\[Console screenshot showing the History tab of the Output job.\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-debug-multiple-3.png)


**First job runs**: As shown in the Data Bytes Read and Written graph below, the first job runs of the Input and Output jobs between 12:00 and 12:30 show roughly the same area under the curves. Those areas represent the Amazon S3 bytes written by the Input job and the Amazon S3 bytes read by the Output job. This data is also confirmed by the ratio of Amazon S3 bytes written (summed over 30 minutes – the job trigger frequency for the Input job). The data point for the ratio for the Input job run that started at 12:00PM is also 1.

The following graph shows the data flow ratio across all the job runs:

![\[Graph showing the data flow ratio: bytes written and bytes read.\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-debug-multiple-5.png)


**Second job runs**: In the second job run, there is a clear difference in the number of bytes read by the Output job compared to the number of bytes written by the Input job. (Compare the area under the curve across the two job runs for the Output job, or compare the areas in the second run of the Input and Output jobs.) The ratio of the bytes read and written shows that the Output Job read about 2.5x the data written by the Input job in the second span of 30 minutes from 12:30 to 13:00. This is because the Output Job reprocessed the output of the first job run of the Input job because job bookmarks were not enabled. A ratio above 1 shows that there is an additional backlog of data that was processed by the Output job.

**Third job runs**: The Input job is fairly consistent in terms of the number of bytes written (see the area under the red curves). However, the third job run of the Input job ran longer than expected (see the long tail of the red curve). As a result, the third job run of the Output job started late. The third job run processed only a fraction of the data accumulated in the staging location in the remaining 30 minutes between 13:00 and 13:30. The ratio of the bytes flow shows that it only processed 0.83 of data written by the third job run of the Input job (see the ratio at 13:00).

**Overlap of Input and Output jobs**: The fourth job run of the Input job started at 13:30 as per the schedule, before the third job run of the Output job finished. There is a partial overlap between these two job runs. However, the third job run of the Output job captures only the files that it listed in the staging location of Amazon S3 when it began around 13:17. This consists of all data output from the first job runs of the Input job. The actual ratio at 13:30 is around 2.75. The third job run of the Output job processed about 2.75x of data written by the fourth job run of the Input job from 13:30 to 14:00.

As these images show, the Output job is reprocessing data from the staging location from all prior job runs of the Input job. As a result, the fourth job run for the Output job is the longest and overlaps with the entire fifth job run of the Input job.

## Fix the processing of files
<a name="monitor-debug-multiple-fix"></a>

You should ensure that the Output job processes only the files that haven't been processed by previous job runs of the Output job. To do this, enable job bookmarks and set the transformation context in the Output job, as follows:

```
datasource0 = glueContext.create_dynamic_frame.from_options(connection_type="s3", connection_options = {"paths": [staging_path], "useS3ListImplementation":True,"recurse":True}, format="json", transformation_ctx = "bookmark_ctx")
```

With job bookmarks enabled, the Output job doesn't reprocess the data in the staging location from all the previous job runs of the Input job. In the following image showing the data read and written, the area under the brown curve is fairly consistent and similar with the red curves. 

![\[Graph showing the data read and written as red and brown lines.\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-debug-multiple-6.png)


The ratios of byte flow also remain roughly close to 1 because there is no additional data processed.

![\[Graph showing the data flow ratio: bytes written and bytes read\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-debug-multiple-7.png)


A job run for the Output job starts and captures the files in the staging location before the next Input job run starts putting more data into the staging location. As long as it continues to do this, it processes only the files captured from the previous Input job run, and the ratio stays close to 1.

![\[Graph showing the data flow ratio: bytes written and bytes read\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-debug-multiple-7.png)


Suppose that the Input job takes longer than expected, and as a result, the Output job captures files in the staging location from two Input job runs. The ratio is then higher than 1 for that Output job run. However, the following job runs of the Output job don't process any files that are already processed by the previous job runs of the Output job.

# Monitoring for DPU capacity planning
<a name="monitor-debug-capacity"></a>

You can use job metrics in AWS Glue to estimate the number of data processing units (DPUs) that can be used to scale out an AWS Glue job.

**Note**  
This page is only applicable to AWS Glue versions 0.9 and 1.0. Later versions of AWS Glue contain cost-saving features that introduce additional considerations when capacity planning. 

**Topics**
+ [Profiled code](#monitor-debug-capacity-profile)
+ [Visualize the profiled metrics on the AWS Glue console](#monitor-debug-capacity-visualize)
+ [Determine the optimal DPU capacity](#monitor-debug-capacity-fix)

## Profiled code
<a name="monitor-debug-capacity-profile"></a>

The following script reads an Amazon Simple Storage Service (Amazon S3) partition containing 428 gzipped JSON files. The script applies a mapping to change the field names, and converts and writes them to Amazon S3 in Apache Parquet format. You provision 10 DPUs as per the default and run this job. 

```
datasource0 = glueContext.create_dynamic_frame.from_options(connection_type="s3", connection_options = {"paths": [input_path], "useS3ListImplementation":True,"recurse":True}, format="json")
applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [(map_spec])
datasink2 = glueContext.write_dynamic_frame.from_options(frame = applymapping1, connection_type = "s3", connection_options = {"path": output_path}, format = "parquet")
```

## Visualize the profiled metrics on the AWS Glue console
<a name="monitor-debug-capacity-visualize"></a>

**Job run 1:** In this job run we show how to find if there are under-provisioned DPUs in the cluster. The job execution functionality in AWS Glue shows the total [number of actively running executors](monitoring-awsglue-with-cloudwatch-metrics.md#glue.driver.ExecutorAllocationManager.executors.numberAllExecutors), the [number of completed stages](monitoring-awsglue-with-cloudwatch-metrics.md#glue.driver.aggregate.numCompletedStages), and the [number of maximum needed executors](monitoring-awsglue-with-cloudwatch-metrics.md#glue.driver.ExecutorAllocationManager.executors.numberMaxNeededExecutors).

The number of maximum needed executors is computed by adding the total number of running tasks and pending tasks, and dividing by the tasks per executor. This result is a measure of the total number of executors required to satisfy the current load. 

In contrast, the number of actively running executors measures how many executors are running active Apache Spark tasks. As the job progresses, the maximum needed executors can change and typically goes down towards the end of the job as the pending task queue diminishes.

The horizontal red line in the following graph shows the number of maximum allocated executors, which depends on the number of DPUs that you allocate for the job. In this case, you allocate 10 DPUs for the job run. One DPU is reserved for management. Nine DPUs run two executors each and one executor is reserved for the Spark driver. The Spark driver runs inside the primary application. So, the number of maximum allocated executors is 2\$19 - 1 = 17 executors.

![\[The job metrics showing active executors and maximum needed executors.\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-debug-capacity-1.png)


As the graph shows, the number of maximum needed executors starts at 107 at the beginning of the job, whereas the number of active executors remains 17. This is the same as the number of maximum allocated executors with 10 DPUs. The ratio between the maximum needed executors and maximum allocated executors (adding 1 to both for the Spark driver) gives you the under-provisioning factor: 108/18 = 6x. You can provision 6 (under provisioning ratio) \$19 (current DPU capacity - 1) \$1 1 DPUs = 55 DPUs to scale out the job to run it with maximum parallelism and finish faster. 

The AWS Glue console displays the detailed job metrics as a static line representing the original number of maximum allocated executors. The console computes the maximum allocated executors from the job definition for the metrics. By constrast, for detailed job run metrics, the console computes the maximum allocated executors from the job run configuration, specifically the DPUs allocated for the job run. To view metrics for an individual job run, select the job run and choose **View run metrics**.

![\[The job metrics showing ETL data movement.\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-debug-capacity-2.png)


Looking at the Amazon S3 bytes [read](monitoring-awsglue-with-cloudwatch-metrics.md#glue.ALL.s3.filesystem.read_bytes) and [written](monitoring-awsglue-with-cloudwatch-metrics.md#glue.ALL.s3.filesystem.write_bytes), notice that the job spends all six minutes streaming in data from Amazon S3 and writing it out in parallel. All the cores on the allocated DPUs are reading and writing to Amazon S3. The maximum number of needed executors being 107 also matches the number of files in the input Amazon S3 path—428. Each executor can launch four Spark tasks to process four input files (JSON gzipped).

## Determine the optimal DPU capacity
<a name="monitor-debug-capacity-fix"></a>

Based on the results of the previous job run, you can increase the total number of allocated DPUs to 55, and see how the job performs. The job finishes in less than three minutes—half the time it required previously. The job scale-out is not linear in this case because it is a short running job. Jobs with long-lived tasks or a large number of tasks (a large number of max needed executors) benefit from a close-to-linear DPU scale-out performance speedup.

![\[Graph showing increasing the total number of allocated DPUs\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-debug-capacity-3.png)


As the above image shows, the total number of active executors reaches the maximum allocated—107 executors. Similarly, the maximum needed executors is never above the maximum allocated executors. The maximum needed executors number is computed from the actively running and pending task counts, so it might be smaller than the number of active executors. This is because there can be executors that are partially or completely idle for a short period of time and are not yet decommissioned.

![\[Graph showing the total number of active executors reaching the maximum allocated.\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-debug-capacity-4.png)


This job run uses 6x more executors to read and write from Amazon S3 in parallel. As a result, this job run uses more Amazon S3 bandwidth for both reads and writes, and finishes faster. 

### Identify overprovisioned DPUs
<a name="monitor-debug-capacity-over"></a>

Next, you can determine whether scaling out the job with 100 DPUs (99 \$1 2 = 198 executors) helps to scale out any further. As the following graph shows, the job still takes three minutes to finish. Similarly, the job does not scale out beyond 107 executors (55 DPUs configuration), and the remaining 91 executors are overprovisioned and not used at all. This shows that increasing the number of DPUs might not always improve performance, as evident from the maximum needed executors.

![\[Graph showing that job performance does not always increase by increasing the number of DPUs.\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-debug-capacity-5.png)


### Compare time differences
<a name="monitor-debug-capacity-time"></a>

The three job runs shown in the following table summarize the job execution times for 10 DPUs, 55 DPUs, and 100 DPUs. You can find the DPU capacity to improve the job execution time using the estimates you established by monitoring the first job run.


| Job ID | Number of DPUs | Execution time | 
| --- | --- | --- | 
| jr\$1c894524c8ef5048a4d9... | 10 | 6 min. | 
| jr\$11a466cf2575e7ffe6856... | 55 | 3 min. | 
| jr\$134fa1ed4c6aa9ff0a814... | 100 | 3 min. | 

# Generative AI troubleshooting for Apache Spark in AWS Glue
<a name="troubleshoot-spark"></a>

 Generative AI Troubleshooting for Apache Spark jobs in AWS Glue is a new capability that helps data engineers and scientists diagnose and fix issues in their Spark applications with ease. Utilizing machine learning and generative AI technologies, this feature analyzes issues in Spark jobs and provides detailed root cause analysis along with actionable recommendations to resolve those issues. The generative AI troubleshooting for Apache Spark is available for jobs running on AWS Glue version 4.0 and above. 


|  | 
| --- |
|  Transform your Apache Spark troubleshooting with our AI-powered Troubleshooting Agent, now supporting all major deployment modes including AWS Glue, Amazon EMR-EC2, Amazon EMR-Serverless and Amazon SageMaker AI Notebooks. This powerful agent eliminates complex debugging processes by combining natural language interactions, real-time workload analysis, and smart code recommendations into a seamless experience. For implementation details, refer to [What is Apache Spark Troubleshooting Agent for Amazon EMR](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/spark-troubleshoot.html). View the second demonstration in [Using the Troubleshooting Agent](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/spark-troubleshooting-using-troubleshooting-agent.html) for AWS Glue troubleshooting examples.  | 

## How does Generative AI Troubleshooting for Apache Spark work?
<a name="troubleshoot-spark-how-it-works"></a>

 For your failed Spark jobs, Generative AI Troubleshooting analyzes the job metadata and the precise metrics and logs associated with the error signature of your job to generate a root cause analysis, and recommends specific solutions and best practices to help address job failures. 

## Setting up Generative AI Troubleshooting for Apache Spark for your jobs
<a name="w2aac37c11c12c33c13"></a>

### Configuring IAM permissions
<a name="troubleshoot-spark-iam-permissions"></a>

 Granting permissions to the APIs used by Spark Troubleshooting for your jobs in AWS Glue requires appropriate IAM permissions. You can obtain permissions by attaching the following custom AWS policy to your IAM identity (such as a user, role, or group). 

------
#### [ JSON ]

****  

```
{
  "Version":"2012-10-17",		 	 	 
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "glue:StartCompletion",
        "glue:GetCompletion"
      ],
      "Resource": [
        "arn:aws:glue:*:*:completion/*",
        "arn:aws:glue:*:*:job/*"
      ]
    }
  ]
}
```

------

**Note**  
 The following two APIs are used in the IAM policy for enabling this experience through the AWS Glue Studio Console: `StartCompletion` and `GetCompletion`. 

### Assigning permissions
<a name="troubleshoot-spark-assigning-permissions"></a>

 To provide access, add permissions to your users, groups, or roles: 
+  For users and groups in IAM Identity Center: Create a permission set. Follow the instructions in [ Create a permission set ](https://docs.aws.amazon.com/singlesignon/latest/userguide/howtocreatepermissionset.html) in the IAM Identity Center User Guide. 
+  For users managed in IAM through an identity provider: Create a role for identity federation. Follow the instructions in [ Creating a role for a third-party identity provider (federation) ](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_create_for-idp.html) in the IAM User Guide. 
+  For IAM users: Create a role that your user can assume. Follow the instructions in [ Creating a role for an IAM user ](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_create_for-user.html) in the IAM User Guide. 

## Running troubleshooting analysis from a failed job run
<a name="troubleshoot-spark-run-analysis"></a>

 You can access the troubleshooting feature through multiple paths in the AWS Glue console. Here's how to get started: 

### Option 1: From the Jobs List page
<a name="troubleshoot-spark-from-jobs-list"></a>

1.  Open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/). 

1.  In the navigation pane, choose **ETL Jobs**. 

1.  Locate your failed job in the jobs list. 

1.  Select the **Runs** tab in the job details section. 

1.  Click on the failed job run you want to analyze. 

1.  Choose **Troubleshoot with AI** to start the analysis. 

1.  When the troubleshooting analysis is complete, you can view the root-cause analysis and recommendations in the **Troubleshooting analysis** tab at the bottom of the screen. 

![\[The GIF shows an end to end implementation of a failed run and the troubleshoot with AI feature running.\]](http://docs.aws.amazon.com/glue/latest/dg/images/troubleshoot_spark_option_1_jobs_list.gif)


### Option 2: Using the Job Run Monitoring page
<a name="troubleshoot-spark-job-run-monitoring-page"></a>

1.  Navigate to the **Job run monitoring** page. 

1.  Locate your failed job run. 

1.  Choose the **Actions** drop-down menu. 

1.  Choose **Troubleshoot with AI**. 

![\[The GIF shows an end to end implementation of a failed run and the troubleshoot with AI feature running.\]](http://docs.aws.amazon.com/glue/latest/dg/images/troubleshoot_spark_option_2_job_monitoring.gif)


### Option 3: From the Job Run Details page
<a name="troubleshoot-spark-job-run-details-page"></a>

1.  Navigate to your failed job run's details page by either clicking **View details** on a failed run from the **Runs** tab or selecting the job run from the **Job run monitoring** page. 

1.  In the job run details page, find the **Troubleshooting analysis** tab. 

## Supported troubleshooting categories
<a name="troubleshoot-spark-supported-troubleshooting-categories"></a>

 This service focuses on three primary categories of issues that data engineers and developers frequently encounter in their Spark applications: 
+  **Resource setup and access errors:** When running Spark applications in AWS Glue, resource setup and access errors are among the most common yet challenging issues to diagnose. These errors often occur when your Spark application attempts to interact with AWS resources but encounters permission issues, missing resources, or configuration problems. 
+  **Spark driver and executor memory issues:** Memory-related errors in Apache Spark jobs can be complex to diagnose and resolve. These errors often manifest when your data processing requirements exceed the available memory resources, either on the driver node or executor nodes. 
+  **Spark disk capacity issues:** Storage-related errors in AWS Glue Spark jobs often emerge during shuffle operations, data spilling, or when dealing with large-scale data transformations. These errors can be particularly tricky because they might not manifest until your job has been running for a while, potentially wasting valuable compute time and resources. 
+  **Query execution errors:** Query failures in Spark SQL and DataFrame operations can be difficult to troubleshoot because error messages may not clearly point to the root cause, and queries that work fine with small datasets can suddenly fail at scale. These errors become even more challenging when they occur deep within complex transformation pipelines, where the actual issue may stem from data quality problems in earlier stages rather than the query logic itself. 

**Note**  
 Before implementing any suggested changes in your production environment, review the suggested changes thoroughly. The service provides recommendations based on patterns and best practices, but your specific use case might require additional considerations. 

## Supported regions
<a name="troubleshoot-spark-supported-regions"></a>

Generative AI troubleshooting for Apache Spark is available in the following regions:
+ **Africa**: Cape Town (af-south-1)
+ **Asia Pacific**: Hong Kong (ap-east-1), Tokyo (ap-northeast-1), Seoul (ap-northeast-2), Osaka (ap-northeast-3), Mumbai (ap-south-1), Singapore (ap-southeast-1), Sydney (ap-southeast-2), and Jakarta (ap-southeast-3)
+ **Europe**: Frankfurt (eu-central-1), Stockholm (eu-north-1), Milan (eu-south-1), Ireland (eu-west-1), London (eu-west-2), and Paris (eu-west-3)
+ **Middle East**: Bahrain (me-south-1) and UAE (me-central-1)
+ **North America**: Canada (ca-central-1)
+ **South America**: São Paulo (sa-east-1)
+ **United States**: North Virginia (us-east-1), Ohio (us-east-2), North California (us-west-1), and Oregon (us-west-2)

# Using materialized views with AWS Glue
<a name="materialized-views"></a>

AWS Glue version 5.1 and later supports creating and managing Apache Iceberg materialized views in the AWS Glue Data Catalog. A materialized view is a managed table that stores the precomputed result of a SQL query in Apache Iceberg format and incrementally updates as the underlying source tables change. You can use materialized views to simplify data transformation pipelines and accelerate query performance for complex analytical workloads.

When you create a materialized view using Spark in AWS Glue, the view definition and metadata are stored in the AWS Glue Data Catalog. The precomputed results are stored as Apache Iceberg tables in Amazon S3 Tables buckets or Amazon S3 general purpose buckets within your account. The AWS Glue Data Catalog automatically monitors source tables and refreshes materialized views using managed compute infrastructure.

**Topics**
+ [How materialized views work with AWS Glue](#materialized-views-how-they-work)
+ [Prerequisites](#materialized-views-prerequisites)
+ [Configuring Spark to use materialized views](#materialized-views-configuring-spark)
+ [Creating materialized views](#materialized-views-creating)
+ [Querying materialized views](#materialized-views-querying)
+ [Refreshing materialized views](#materialized-views-refreshing)
+ [Managing materialized views](#materialized-views-managing)
+ [Permissions for materialized views](#materialized-views-permissions)
+ [Monitoring materialized view operations](#materialized-views-monitoring)
+ [Example: Complete workflow](#materialized-views-complete-workflow)
+ [Considerations and limitations](#materialized-views-considerations-limitations)

## How materialized views work with AWS Glue
<a name="materialized-views-how-they-work"></a>

Materialized views integrate with AWS Glue through Apache Spark's Iceberg support in AWS Glue jobs and AWS Glue Studio notebooks. When you configure your Spark session to use the AWS Glue Data Catalog, you can create materialized views using standard SQL syntax. The Spark optimizer can automatically rewrite queries to use materialized views when they provide better performance, eliminating the need to manually modify application code.

The AWS Glue Data Catalog handles all operational aspects of materialized view maintenance, including:
+ Detecting changes in source tables using Apache Iceberg's metadata layer
+ Scheduling and executing refresh operations using managed Spark compute
+ Determining whether to perform full or incremental refresh based on the data changes
+ Storing precomputed results in Apache Iceberg format for multi-engine access

You can query materialized views from AWS Glue using the same Spark SQL interfaces you use for regular tables. The precomputed data is also accessible from other services including Amazon Athena and Amazon Redshift.

## Prerequisites
<a name="materialized-views-prerequisites"></a>

To use materialized views with AWS Glue, you need:
+ An account
+ AWS Glue version 5.1 or later
+ Source tables in Apache Iceberg format registered in the AWS Glue Data Catalog
+ AWS Lake Formation permissions configured for source tables and target databases
+ An S3 Tables bucket or S3 general purpose bucket registered with AWS Lake Formation for storing materialized view data
+ An IAM role with permissions to access AWS Glue Data Catalog and Amazon S3

## Configuring Spark to use materialized views
<a name="materialized-views-configuring-spark"></a>

To create and manage materialized views in AWS Glue, configure your Spark session with the required Iceberg extensions and catalog settings. The configuration method varies depending on whether you're using AWS Glue jobs or AWS Glue Studio notebooks.

### Configuring AWS Glue jobs
<a name="materialized-views-configuring-glue-jobs"></a>

When creating or updating an AWS Glue job, add the following configuration parameters as job parameters:

#### For S3 Tables buckets
<a name="materialized-views-s3-tables-buckets"></a>

```
job = glue.create_job(
    Name='materialized-view-job',
    Role='arn:aws:iam::111122223333:role/GlueServiceRole',
    Command={
        'Name': 'glueetl',
        'ScriptLocation': 's3://amzn-s3-demo-bucket/scripts/mv-script.py',
        'PythonVersion': '3'
    },
    DefaultArguments={
        '--enable-glue-datacatalog': 'true',
        '--conf': 'spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions '
        '--conf spark.sql.catalog.glue_catalog=org.apache.iceberg.spark.SparkCatalog '
                  '--conf spark.sql.catalog.glue_catalog.type=glue '
                  '--conf spark.sql.catalog.glue_catalog.warehouse=s3://amzn-s3-demo-bucket/warehouse '
                  '--conf spark.sql.catalog.glue_catalog.glue.region=us-east-1 '
                  '--conf spark.sql.catalog.glue_catalog.glue.id=111122223333 '
                  '--conf spark.sql.catalog.glue_catalog.glue.account-id=111122223333 ',
                  '--conf spark.sql.catalog.glue_catalog.glue.lakeformation-enabled=true ',
                  '--conf spark.sql.catalog.s3t_catalog=org.apache.iceberg.spark.SparkCatalog '
                  '--conf spark.sql.catalog.s3t_catalog.type=glue '
                  '--conf spark.sql.catalog.s3t_catalog.glue.id=111122223333:s3tablescatalog/my-table-bucket ',
                  '--conf spark.sql.catalog.s3t_catalog.glue.account-id=111122223333 ',
                  '--conf spark.sql.catalog.s3t_catalog.glue.lakeformation-enabled=true ',
                  '--conf spark.sql.catalog.s3t_catalog.warehouse=s3://amzn-s3-demo-bucket/mv-warehouse '
                  '--conf spark.sql.catalog.s3t_catalog.glue.region=us-east-1 '
                  '--conf spark.sql.defaultCatalog=s3t_catalog '
                  '--conf spark.sql.optimizer.answerQueriesWithMVs.enabled=true '
                  '--conf spark.sql.materializedViews.metadataCache.enabled=true'
    },
    GlueVersion='5.1'
)
```

#### For S3 general purpose buckets
<a name="materialized-views-s3-general-purpose-buckets"></a>

```
job = glue.create_job(
    Name='materialized-view-job',
    Role='arn:aws:iam::111122223333:role/GlueServiceRole',
    Command={
        'Name': 'glueetl',
        'ScriptLocation': 's3://amzn-s3-demo-bucket/scripts/mv-script.py',
        'PythonVersion': '3'
    },
    DefaultArguments={
        '--enable-glue-datacatalog': 'true',
        '--conf': 'spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions '
                  '--conf spark.sql.catalog.glue_catalog=org.apache.iceberg.spark.SparkCatalog '
                  '--conf spark.sql.catalog.glue_catalog.type=glue '
                  '--conf spark.sql.catalog.glue_catalog.warehouse=s3://amzn-s3-demo-bucket/warehouse '
                  '--conf spark.sql.catalog.glue_catalog.glue.region=us-east-1 '
                  '--conf spark.sql.catalog.glue_catalog.glue.id=111122223333 ',
                  '--conf spark.sql.catalog.glue_catalog.glue.account-id=111122223333 ',
                  '--conf spark.sql.catalog.glue_catalog.glue.lakeformation-enabled=true ',
                  '--conf spark.sql.defaultCatalog=glue_catalog '
                  '--conf spark.sql.optimizer.answerQueriesWithMVs.enabled=true '
                  '--conf spark.sql.materializedViews.metadataCache.enabled=true'
    },
    GlueVersion='5.1'
)
```

### Configuring AWS Glue Studio notebooks
<a name="materialized-views-configuring-glue-studio-notebooks"></a>

In AWS Glue Studio notebooks, configure your Spark session using the %%configure magic command at the beginning of your notebook:

```
%%configure
{
    "conf": {
        "spark.sql.extensions": "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions",
        "spark.sql.catalog.glue_catalog": "org.apache.iceberg.spark.SparkCatalog",
        "spark.sql.catalog.glue_catalog.type": "glue",
        "spark.sql.catalog.glue_catalog.warehouse": "s3://amzn-s3-demo-bucket/warehouse",
        "spark.sql.catalog.glue_catalog.glue.region": "us-east-1",
        "spark.sql.catalog.glue_catalog.glue.id": "111122223333",
        "spark.sql.catalog.glue_catalog.glue.account-id": "111122223333",
        "spark.sql.catalog.glue_catalog.glue.lakeformation-enabled": "true",
        "spark.sql.defaultCatalog": "glue_catalog",
        "spark.sql.optimizer.answerQueriesWithMVs.enabled": "true",
        "spark.sql.materializedViews.metadataCache.enabled": "true"
    }
}
```

### Enabling incremental refresh
<a name="materialized-views-enabling-incremental-refresh"></a>

To enable incremental refresh optimization, add the following configuration properties to your job parameters or notebook configuration:

```
--conf spark.sql.optimizer.incrementalMVRefresh.enabled=true
--conf spark.sql.optimizer.incrementalMVRefresh.deltaThresholdCheckEnabled=false
```

### Configuration parameters
<a name="materialized-views-configuration-parameters"></a>

The following configuration parameters control materialized view behavior:
+ `spark.sql.extensions` – Enables Iceberg Spark session extensions required for materialized view support.
+ `spark.sql.optimizer.answerQueriesWithMVs.enabled` – Enables automatic query rewrite to use materialized views. Set to true to activate this optimization.
+ `spark.sql.materializedViews.metadataCache.enabled` – Enables caching of materialized view metadata for query optimization. Set to true to improve query rewrite performance.
+ `spark.sql.optimizer.incrementalMVRefresh.enabled` – Enables incremental refresh optimization. Set to true to process only changed data during refresh operations.
+ `spark.sql.optimizer.answerQueriesWithMVs.decimalAggregateCheckEnabled` – Controls validation of decimal aggregate operations in query rewrite. Set to false to disable certain decimal overflow checks.

## Creating materialized views
<a name="materialized-views-creating"></a>

You create materialized views using the CREATE MATERIALIZED VIEW SQL statement in AWS Glue jobs or notebooks. The view definition specifies the transformation logic as a SQL query that references one or more source tables.

### Creating a basic materialized view in AWS Glue jobs
<a name="materialized-views-creating-basic-glue-jobs"></a>

The following example demonstrates creating a materialized view in a AWS Glue job script, use fully qualified table names with three part naming convention in view definition:

```
from awsglue.context import GlueContext
from pyspark.context import SparkContext

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session

# Create materialized view
spark.sql("""
    CREATE MATERIALIZED VIEW customer_orders
    AS 
    SELECT 
        customer_name, 
        COUNT(*) as order_count, 
        SUM(amount) as total_amount 
    FROM glue_catalog.sales.orders
    GROUP BY customer_name
""")
```

### Creating a materialized view with automatic refresh
<a name="materialized-views-creating-automatic-refresh"></a>

To configure automatic refresh, specify a refresh schedule when creating the view, using fully qualified table names with three part naming convention in view definition:

```
spark.sql("""
    CREATE MATERIALIZED VIEW customer_orders
    SCHEDULE REFRESH EVERY 1 HOUR
    AS 
    SELECT 
        customer_name, 
        COUNT(*) as order_count, 
        SUM(amount) as total_amount 
    FROM glue_catalog.sales.orders
    GROUP BY customer_name
""")
```

### Creating a materialized view with cross-catalog references
<a name="materialized-views-creating-cross-catalog"></a>

When your source tables are in a different catalog than your materialized view, use fully qualified table names with three-part naming convention in both view name and view definition:

```
spark.sql("""
    CREATE MATERIALIZED VIEW s3t_catalog.analytics.customer_summary
    AS 
    SELECT 
        customer_name, 
        COUNT(*) as order_count, 
        SUM(amount) as total_amount 
    FROM glue_catalog.sales.orders
    GROUP BY customer_name
""")
```

### Creating materialized views in AWS Glue Studio notebooks
<a name="materialized-views-creating-glue-studio-notebooks"></a>

In AWS Glue Studio notebooks, you can use the %%sql magic command to create materialized views, using fully qualified table names with three part naming convention in view definition:

```
%%sql
CREATE MATERIALIZED VIEW customer_orders
AS 
SELECT 
    customer_name, 
    COUNT(*) as order_count, 
    SUM(amount) as total_amount 
FROM glue_catalog.sales.orders
GROUP BY customer_name
```

## Querying materialized views
<a name="materialized-views-querying"></a>

After creating a materialized view, you can query it like any other table using standard SQL SELECT statements in your AWS Glue jobs or notebooks.

### Querying in AWS Glue jobs
<a name="materialized-views-querying-glue-jobs"></a>

```
from awsglue.context import GlueContext
from pyspark.context import SparkContext

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session

# Query materialized view
result = spark.sql("SELECT * FROM customer_orders")
result.show()
```

### Querying in AWS Glue Studio notebooks
<a name="materialized-views-querying-glue-studio-notebooks"></a>

```
%%sql
SELECT * FROM customer_orders
```

### Automatic query rewrite
<a name="materialized-views-automatic-query-rewrite"></a>

When automatic query rewrite is enabled, the Spark optimizer analyzes your queries and automatically uses materialized views when they can improve performance. For example, if you execute the following query:

```
result = spark.sql("""
    SELECT 
        customer_name, 
        COUNT(*) as order_count, 
        SUM(amount) as total_amount 
    FROM orders
    GROUP BY customer_name
""")
```

The Spark optimizer automatically rewrites this query to use the customer\$1orders materialized view instead of processing the base orders table, provided the materialized view is current.

### Verifying automatic query rewrite
<a name="materialized-views-verifying-automatic-query-rewrite"></a>

To verify whether a query uses automatic query rewrite, use the EXPLAIN EXTENDED command:

```
spark.sql("""
    EXPLAIN EXTENDED
    SELECT customer_name, COUNT(*) as order_count, SUM(amount) as total_amount 
    FROM orders
    GROUP BY customer_name
""").show(truncate=False)
```

In the execution plan, look for the materialized view name in the BatchScan operation. If the plan shows BatchScan glue\$1catalog.analytics.customer\$1orders instead of BatchScan glue\$1catalog.sales.orders, the query has been automatically rewritten to use the materialized view.

Note that automatic query rewrite requires time for the Spark metadata cache to populate after creating a materialized view. This process typically completes within 30 seconds.

## Refreshing materialized views
<a name="materialized-views-refreshing"></a>

You can refresh materialized views using two methods: full refresh or incremental refresh. Full refresh recomputes the entire materialized view from all base table data, while incremental refresh processes only the data that has changed since the last refresh.

### Manual full refresh in AWS Glue jobs
<a name="materialized-views-manual-full-refresh-glue-jobs"></a>

To perform a full refresh of a materialized view:

```
spark.sql("REFRESH MATERIALIZED VIEW customer_orders FULL")

# Verify updated results
result = spark.sql("SELECT * FROM customer_orders")
result.show()
```

### Manual incremental refresh in AWS Glue jobs
<a name="materialized-views-manual-incremental-refresh-glue-jobs"></a>

To perform an incremental refresh, ensure incremental refresh is enabled in your Spark session configuration, then execute:

```
spark.sql("REFRESH MATERIALIZED VIEW customer_orders")

# Verify updated results
result = spark.sql("SELECT * FROM customer_orders")
result.show()
```

The AWS Glue Data Catalog automatically determines whether incremental refresh is applicable based on the view definition and the amount of changed data. If incremental refresh is not possible, the operation falls back to full refresh.

### Refreshing in AWS Glue Studio notebooks
<a name="materialized-views-refreshing-glue-studio-notebooks"></a>

In notebooks, use the %%sql magic command:

```
%%sql
REFRESH MATERIALIZED VIEW customer_orders FULL
```

### Verifying incremental refresh execution
<a name="materialized-views-verifying-incremental-refresh"></a>

To confirm that incremental refresh executed successfully, enable debug logging in your AWS Glue job:

```
from awsglue.context import GlueContext
from pyspark.context import SparkContext
import logging

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session

# Enable debug logging
logger = logging.getLogger('org.apache.spark.sql')
logger.setLevel(logging.DEBUG)

# Execute refresh
spark.sql("REFRESH MATERIALIZED VIEW customer_orders")
```

Search for the following message in the AWS Glue job logs:

```
DEBUG RefreshMaterializedViewExec: Executed Incremental Refresh
```

## Managing materialized views
<a name="materialized-views-managing"></a>

AWS Glue provides SQL commands for managing the lifecycle of materialized views in your jobs and notebooks.

### Describing a materialized view
<a name="materialized-views-describing"></a>

To view metadata about a materialized view, including its definition, refresh status, and last refresh timestamp:

```
spark.sql("DESCRIBE EXTENDED customer_orders").show(truncate=False)
```

### Altering a materialized view
<a name="materialized-views-altering"></a>

To modify the refresh schedule of an existing materialized view:

```
spark.sql("""
    ALTER MATERIALIZED VIEW customer_orders 
    ADD SCHEDULE REFRESH EVERY 2 HOURS
""")
```

To remove automatic refresh:

```
spark.sql("""
    ALTER MATERIALIZED VIEW customer_orders 
    DROP SCHEDULE
""")
```

### Dropping a materialized view
<a name="materialized-views-dropping"></a>

To delete a materialized view:

```
spark.sql("DROP MATERIALIZED VIEW customer_orders")
```

This command removes the materialized view definition from the AWS Glue Data Catalog and deletes the underlying Iceberg table data from your S3 bucket.

### Listing materialized views
<a name="materialized-views-listing"></a>

To list all materialized views in a database:

```
spark.sql("SHOW VIEWS FROM analytics").show()
```

## Permissions for materialized views
<a name="materialized-views-permissions"></a>

To create and manage materialized views, you must configure AWS Lake Formation permissions. The IAM role creating the materialized view (the definer role) requires specific permissions on source tables and target databases.

### Required permissions for the definer role
<a name="materialized-views-required-permissions-definer-role"></a>

The definer role must have the following Lake Formation permissions:
+ On source tables – SELECT or ALL permissions without row, column, or cell filters
+ On the target database – CREATE\$1TABLE permission
+ On the AWS Glue Data Catalog – GetTable and CreateTable API permissions

When you create a materialized view, the definer role's ARN is stored in the view definition. The AWS Glue Data Catalog assumes this role when executing automatic refresh operations. If the definer role loses access to source tables, refresh operations will fail until permissions are restored.

### IAM permissions for AWS Glue jobs
<a name="materialized-views-iam-permissions-glue-jobs"></a>

Your AWS Glue job's IAM role requires the following permissions:

```
{
    "Version": "2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "glue:GetCatalog",
                "glue:GetCatalogs",
                "glue:GetTable",
                "glue:GetTables",
                "glue:CreateTable",
                "glue:UpdateTable",
                "glue:DeleteTable",
                "glue:GetDatabase",
                "glue:GetDatabases",
                "cloudwatch:PutMetricData"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:PutObject",
                "s3:DeleteObject"
            ],
            "Resource": [
                "arn:aws:s3:::amzn-s3-demo-bucket/*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::amzn-s3-demo-bucket"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "logs:CreateLogGroup",
                "logs:CreateLogStream",
                "logs:PutLogEvents"
            ],
            "Resource": [
                "arn:aws:logs:*:*:*:/aws-glue/*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "lakeformation:GetDataAccess"
            ],
            "Resource": "*"
        }
    ]
}
```

The role you use for Materialized View auto-refresh must have the iam:PassRole permission on the role.

```
{
  "Version":"2012-10-17",		 	 	 
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "iam:PassRole"
      ],
      "Resource": [
        "arn:aws:iam::111122223333:role/materialized-view-role-name"
      ]
    }
  ]
}
```

To let Glue automatically refresh the materialized view for you, the role must also have the following trust policy that enables the service to assume the role.

```
{
  "Version":"2012-10-17",		 	 	 
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "iam:PassRole"
      ],
      "Resource": [
        "arn:aws:iam::111122223333:role/materialized-view-role-name"
      ]
    }
  ]
}
```

If the Materialized View is stored in S3 Tables Buckets, you also need to add the following permission to the role.

```
{
  "Version": "2012-10-17",		 	 	 
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3tables:PutTableMaintenanceConfiguration"
      ],
      "Resource": "arn:aws:s3tables:*:123456789012:*"
    }
  ]
}
```

### Granting access to materialized views
<a name="materialized-views-granting-access"></a>

To grant other users access to query a materialized view, use AWS Lake Formation to grant SELECT permission on the materialized view table. Users can query the materialized view without requiring direct access to the underlying source tables.

For detailed information about configuring Lake Formation permissions, see Granting and revoking permissions on Data Catalog resources in the AWS Lake Formation Developer Guide.

## Monitoring materialized view operations
<a name="materialized-views-monitoring"></a>

The AWS Glue Data Catalog publishes metrics and logs for materialized view refresh operations to Amazon CloudWatch. You can monitor refresh status, duration, and data volume processed through CloudWatch metrics.

### Viewing job logs
<a name="materialized-views-viewing-job-logs"></a>

To view logs for AWS Glue jobs that create or refresh materialized views:

1. Open the AWS Glue console.

1. Choose Jobs from the navigation pane.

1. Select your job and choose Runs.

1. Select a specific run and choose Logs to view CloudWatch logs.

### Setting up alarms
<a name="materialized-views-setting-up-alarms"></a>

To receive notifications when refresh operations fail or exceed expected duration, create CloudWatch alarms on materialized view metrics. You can also configure Amazon EventBridge rules to trigger automated responses to refresh events.

## Example: Complete workflow
<a name="materialized-views-complete-workflow"></a>

The following example demonstrates a complete workflow for creating and using a materialized view in AWS Glue.

### Example AWS Glue job script
<a name="materialized-views-example-glue-job-script"></a>

```
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# Create database and base table
spark.sql("CREATE DATABASE IF NOT EXISTS sales")
spark.sql("USE sales")

spark.sql("""
    CREATE TABLE IF NOT EXISTS orders (
        id INT,
        customer_name STRING,
        amount DECIMAL(10,2),
        order_date DATE
    )
""")

# Insert sample data
spark.sql("""
    INSERT INTO orders VALUES 
        (1, 'John Doe', 150.00, DATE('2024-01-15')),
        (2, 'Jane Smith', 200.50, DATE('2024-01-16')),
        (3, 'Bob Johnson', 75.25, DATE('2024-01-17'))
""")

# Create materialized view
spark.sql("""
    CREATE MATERIALIZED VIEW customer_summary
    AS 
    SELECT 
        customer_name, 
        COUNT(*) as order_count, 
        SUM(amount) as total_amount 
    FROM glue_catalog.sales.orders
    GROUP BY customer_name
""")

# Query the materialized view
print("Initial materialized view data:")
spark.sql("SELECT * FROM customer_summary").show()

# Insert additional data
spark.sql("""
    INSERT INTO orders VALUES 
        (4, 'Jane Smith', 350.00, DATE('2024-01-18')),
        (5, 'Bob Johnson', 100.25, DATE('2024-01-19'))
""")

# Refresh the materialized view
spark.sql("REFRESH MATERIALIZED VIEW customer_summary FULL")

# Query updated results
print("Updated materialized view data:")
spark.sql("SELECT * FROM customer_summary").show()

job.commit()
```

### Example AWS Glue Studio notebook
<a name="materialized-views-example-glue-studio-notebook"></a>

```
%%configure
{
    "conf": {
        "spark.sql.extensions": "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions",
        "spark.sql.catalog.glue_catalog": "org.apache.iceberg.spark.SparkCatalog",
        "spark.sql.catalog.glue_catalog.type": "glue",
        "spark.sql.catalog.glue_catalog.warehouse": "s3://amzn-s3-demo-bucket/warehouse",
        "spark.sql.catalog.glue_catalog.glue.region": "us-east-1",
        "spark.sql.catalog.glue_catalog.glue.id": "111122223333",
        "spark.sql.catalog.glue_catalog.glue.account-id": "111122223333",
        "spark.sql.catalog.glue_catalog.glue.lakeformation-enabled": "true",
        "spark.sql.defaultCatalog": "glue_catalog",
        "spark.sql.optimizer.answerQueriesWithMVs.enabled": "true",
        "spark.sql.materializedViews.metadataCache.enabled": "true"
    }
}
```

```
%%sql
CREATE DATABASE IF NOT EXISTS sales
```

```
%%sql
USE sales
```

```
%%sql
CREATE TABLE IF NOT EXISTS orders (
    id INT,
    customer_name STRING,
    amount DECIMAL(10,2),
    order_date DATE
)
```

```
%%sql
INSERT INTO orders VALUES 
    (1, 'John Doe', 150.00, DATE('2024-01-15')),
    (2, 'Jane Smith', 200.50, DATE('2024-01-16')),
    (3, 'Bob Johnson', 75.25, DATE('2024-01-17'))
```

```
%%sql
CREATE MATERIALIZED VIEW customer_summary
AS 
SELECT 
    customer_name, 
    COUNT(*) as order_count, 
    SUM(amount) as total_amount 
FROM glue_catalog.sales.orders
GROUP BY customer_name
```

```
%%sql
SELECT * FROM customer_summary
```

```
%%sql
INSERT INTO orders VALUES 
    (4, 'Jane Smith', 350.00, DATE('2024-01-18')),
    (5, 'Bob Johnson', 100.25, DATE('2024-01-19'))
```

```
%%sql
REFRESH MATERIALIZED VIEW customer_summary FULL
```

```
%%sql
SELECT * FROM customer_summary
```

## Considerations and limitations
<a name="materialized-views-considerations-limitations"></a>

Consider the following when using materialized views with AWS Glue:
+ Materialized views require AWS Glue version 5.1 or later.
+ Source tables must be Apache Iceberg tables registered in the AWS Glue Data Catalog. Apache Hive, Apache Hudi, and Linux Foundation Delta Lake tables are not supported at launch.
+ Source tables must reside in the same Region and account as the materialized view.
+ All source tables must be governed by AWS Lake Formation. IAM-only permissions and hybrid access are not supported.
+ Materialized views cannot reference AWS Glue Data Catalog views, multi-dialect views, or other materialized views as source tables.
+ The view definer role must have full read access (SELECT or ALL permission) on all source tables without row, column, or cell filters applied.
+ Materialized views are eventually consistent with source tables. During the refresh window, queries may return stale data. Execute manual refresh for immediate consistency.
+ The minimum automatic refresh interval is one hour.
+ Incremental refresh supports a restricted subset of SQL operations. The view definition must be a single SELECT-FROM-WHERE-GROUP BY-HAVING block and cannot contain set operations, subqueries, the DISTINCT keyword in SELECT or aggregate functions, window functions, or joins other than INNER JOIN.
+ Incremental refresh does not support user-defined functions or certain built-in functions. Only a subset of Spark SQL built-in functions are supported.
+ Query automatic rewrite only considers materialized views whose definitions belong to a restricted SQL subset similar to incremental refresh restrictions.
+ Identifiers containing special characters other than alphanumeric characters and underscores are not supported in CREATE MATERIALIZED VIEW queries. This applies to all identifier types including catalog/namespace/table names, column and struct field names, CTEs, and aliases.
+ Materialized view columns starting with the \$1\$1ivm prefix are reserved for system use. Amazon reserves the right to modify or remove these columns in future releases.
+ The SORT BY, LIMIT, OFFSET, CLUSTER BY, and ORDER BY clauses are not supported in materialized view definitions.
+ Cross-Region and cross-account source tables are not supported.
+ Tables referenced in the view query must use three-part naming convention (e.g., glue\$1catalog.my\$1db.my\$1table) because automatic refresh does not use default catalog and database settings.
+ Full refresh operations override the entire table and make previous snapshots unavailable.
+ Non-deterministic functions such as rand() or current\$1timestamp() are not supported in materialized view definitions.

# AWS Glue worker types
<a name="worker-types"></a>

## Overview
<a name="worker-types-overview"></a>

AWS Glue provides multiple worker types to accommodate different workload requirements, from small streaming jobs to large-scale, memory-intensive data processing tasks. This section provides comprehensive information about all available worker types, their specifications, and usage recommendations.

### Worker type categories
<a name="worker-type-categories"></a>

AWS Glue offers two main categories of worker types:
+ **G Worker Types**: General-purpose compute workers optimized for standard ETL workloads
+ **R Worker Types**: Memory-optimized workers designed for memory-intensive Spark applications

### Data Processing Units (DPUs)
<a name="data-processing-units"></a>

The resources available on AWS Glue workers are measured in DPUs. A DPU is a relative measure of processing power that consists of 4 vCPUs of compute capacity and 16 GB of memory.

**Memory-Optimized DPUs (M-DPUs)**: R type workers use M-DPUs, which provide double the memory allocation for a given size compared to standard DPUs. This means that while a standard DPU provides 16 GB of memory, an M-DPU in R type workers provides 32GB of memory optimized for memory-intensive Spark applications.

## Available worker types
<a name="available-worker-types"></a>

### G.1X
<a name="g1x-standard-worker"></a>
+ **DPU**: 1 DPU (4 vCPUs, 16 GB memory)
+ **Storage**: 94GB disk (approximately 44GB free)
+ **Use Case**: Data transforms, joins, and queries - scalable and cost-effective for most jobs

### G.2X
<a name="g2x-standard-worker"></a>
+ **DPU**: 2 DPU (8 vCPUs, 32 GB memory)
+ **Storage**: 138GB disk (approximately 78GB free)
+ **Use Case**: Data transforms, joins, and queries - scalable and cost-effective for most jobs

### G.4X
<a name="g4x-large-worker"></a>
+ **DPU**: 4 DPU (16 vCPUs, 64 GB memory)
+ **Storage**: 256GB disk (approximately 230GB free)
+ **Use Case**: Demanding transforms, aggregations, joins, and queries

### G.8X
<a name="g8x-extra-large-worker"></a>
+ **DPU**: 8 DPU (32 vCPUs, 128 GB memory)
+ **Storage**: 512GB disk (approximately 485GB free)
+ **Use Case**: Demanding transforms, aggregations, joins, and queries

### G.12X
<a name="g12x-very-large-worker"></a>
+ **DPU**: 12 DPU (48 vCPUs, 192 GB memory)
+ **Storage**: 768GB disk (approximately 741GB free)
+ **Use Case**: Very large and resource-intensive workloads requiring significant compute capacity

### G.16X
<a name="g16x-maximum-worker"></a>
+ **DPU**: 16 DPU (64 vCPUs, 256 GB memory)
+ **Storage**: 1024GB disk (approximately 996GB free)
+ **Use Case**: Largest and most resource-intensive workloads requiring maximum compute capacity

### R.1X - Memory-Optimized\$1
<a name="r1x-memory-optimized-small"></a>
+ **DPU**: 1 M-DPU (4 vCPUs, 32 GB memory)
+ **Use Case**: Memory-intensive workloads with frequent out-of-memory errors or high memory-to-CPU ratio requirements

### R.2X - Memory-Optimized\$1
<a name="r2x-memory-optimized-medium"></a>
+ **DPU**: 2 M-DPU (8 vCPUs, 64 GB memory)
+ **Use Case**: Memory-intensive workloads with frequent out-of-memory errors or high memory-to-CPU ratio requirements

### R.4X - Memory-Optimized\$1
<a name="r4x-memory-optimized-large"></a>
+ **DPU**: 4 M-DPU (16 vCPUs, 128 GB memory)
+ **Use Case**: Large memory-intensive workloads with frequent out-of-memory errors or high memory-to-CPU ratio requirements

### R.8X - Memory-Optimized\$1
<a name="r8x-memory-optimized-extra-large"></a>
+ **DPU**: 8 M-DPU (32 vCPUs, 256 GB memory)
+ **Use Case**: Very large memory-intensive workloads with frequent out-of-memory errors or high memory-to-CPU ratio requirements

**\$1** You may encounter higher startup latency with these workers. To resolve the issue, try the following:
+ Wait a few minutes and then submit your job again.
+ Submit a new job with a reduced number of workers.
+ Submit a new job using a different worker type or size.

## Worker type specifications table
<a name="worker-type-specifications"></a>


**Worker Type Specifications**  

| Worker Type | DPU per Node | vCPU | Memory (GB) | Disk (GB) | Approximate Free Disk Space (GB) | Spark Executors per Node | 
| --- | --- | --- | --- | --- | --- | --- | 
| G.1X | 1 | 4 | 16 | 94 | 44 | 1 | 
| G.2X | 2 | 8 | 32 | 138 | 78 | 1 | 
| G.4X | 4 | 16 | 64 | 256 | 230 | 1 | 
| G.8X | 8 | 32 | 128 | 512 | 485 | 1 | 
| G.12X | 12 | 48 | 192 | 768 | 741 | 1 | 
| G.16X | 16 | 64 | 256 | 1024 | 996 | 1 | 
| R.1X | 1 | 4 | 32 | 94 | 44 | 1 | 
| R.2X | 2 | 8 | 64 | 138 | 78 | 1 | 
| R.4X | 4 | 16 | 128 | 256 | 230 | 1 | 
| R.8X | 8 | 32 | 256 | 512 | 485 | 1 | 

*Note*: R worker types have memory-optimized configurations with specifications optimized for memory-intensive workloads.

## Important considerations
<a name="important-considerations"></a>

### Startup latency
<a name="startup-latency"></a>

**Important**  
G.12X and G.16X worker types, as well as all R worker types (R.1X through R.8X), may encounter higher startup latency. To resolve the issue, try the following:  
Wait a few minutes and then submit your job again.
Submit a new job with a reduced number of workers.
Submit a new job using a different worker type and size.

## Choosing the right worker type
<a name="choosing-right-worker-type"></a>

### For standard ETL workloads
<a name="standard-etl-workloads"></a>
+ **G.1X or G.2X**: Most cost-effective for typical data transforms, joins, and queries
+ **G.4X or G.8X**: For more demanding workloads with larger datasets

### For large-scale workloads
<a name="large-scale-workloads"></a>
+ **G.12X**: Very large datasets requiring significant compute resources
+ **G.16X**: Maximum compute capacity for the most demanding workloads

### For memory-intensive workloads
<a name="memory-intensive-workloads"></a>
+ **R.1X or R.2X**: Small to medium memory-intensive jobs
+ **R.4X or R.8X**: Large memory-intensive workloads with frequent OOM errors

## Cost Optimization Considerations
<a name="cost-optimization-considerations"></a>
+ **Standard G workers**: Provide a balance of compute, memory and networking resources, and can be used for a variety of diverse workloads at lower cost
+ **R workers**: Specialized for memory-intensive tasks with fast performance for workloads that process large data sets in memory

## Best practices
<a name="best-practices"></a>

### Worker selection guidelines
<a name="worker-selection-guidelines"></a>

1. **Start with standard workers** (G.1X, G.2X) for most workloads

1. **Use R workers** when experiencing frequent out-of-memory errors or workloads with memory-intensive operations like caching, shuffling, and aggregating

1. **Consider G.12X/G.16X** for compute-intensive workloads requiring maximum resources

1. **Account for capacity constraints** when using new worker types in time-sensitive workflows

### Performance optimization
<a name="performance-optimization"></a>
+ Monitor CloudWatch metrics to understand resource utilization
+ Use appropriate worker counts based on data size and complexity
+ Consider data partitioning strategies to optimize worker efficiency

# Streaming ETL jobs in AWS Glue
<a name="add-job-streaming"></a>

You can create streaming extract, transform, and load (ETL) jobs that run continuously, consume data from streaming sources like Amazon Kinesis Data Streams, Apache Kafka, and Amazon Managed Streaming for Apache Kafka (Amazon MSK). The jobs cleanse and transform the data, and then load the results into Amazon S3 data lakes or JDBC data stores.

Additionally, you can produce data for Amazon Kinesis Data Streams streams. This feature is only available when writing AWS Glue scripts. For more information, see [Kinesis connections](aws-glue-programming-etl-connect-kinesis-home.md). 

By default, AWS Glue processes and writes out data in 100-second windows. This allows data to be processed efficiently and permits aggregations to be performed on data arriving later than expected. You can modify this window size to increase timeliness or aggregation accuracy. AWS Glue streaming jobs use checkpoints rather than job bookmarks to track the data that has been read.

**Note**  
AWS Glue bills hourly for streaming ETL jobs while they are running.

This video discusses streaming ETL cost challenges, and cost-saving features in AWS Glue.

[![AWS Videos](http://img.youtube.com/vi/https://www.youtube.com/embed/6ggTFOtfUxU/0.jpg)](http://www.youtube.com/watch?v=https://www.youtube.com/embed/6ggTFOtfUxU)


Creating a streaming ETL job involves the following steps:

1. For an Apache Kafka streaming source, create an AWS Glue connection to the Kafka source or the Amazon MSK cluster.

1. Manually create a Data Catalog table for the streaming source.

1. Create an ETL job for the streaming data source. Define streaming-specific job properties, and supply your own script or optionally modify the generated script.

For more information, see [Streaming ETL in AWS Glue](components-overview.md#streaming-etl-intro).

When creating a streaming ETL job for Amazon Kinesis Data Streams, you don't have to create an AWS Glue connection. However, if there is a connection attached to the AWS Glue streaming ETL job that has Kinesis Data Streams as a source, then a virtual private cloud (VPC) endpoint to Kinesis is required. For more information, see [Creating an interface endpoint](https://docs.aws.amazon.com/vpc/latest/userguide/vpce-interface.html#create-interface-endpoint) in the *Amazon VPC User Guide*. When specifying a Amazon Kinesis Data Streams stream in another account, you must setup the roles and policies to allow cross-account access. For more information, see [ Example: Read From a Kinesis Stream in a Different Account](https://docs.aws.amazon.com/kinesisanalytics/latest/java/examples-cross.html).

AWS Glue streaming ETL jobs can auto-detect compressed data, transparently decompress the streaming data, perform the usual transformations on the input source, and load to the output store. 

AWS Glue supports auto-decompression for the following compression types given the input format:


| Compression type | Avro file | Avro datum | JSON | CSV | Grok | 
| --- | --- | --- | --- | --- | --- | 
| BZIP2 | Yes | Yes | Yes | Yes | Yes | 
| GZIP | No | Yes | Yes | Yes | Yes | 
| SNAPPY | Yes (raw Snappy) | Yes (framed Snappy) | Yes (framed Snappy) | Yes (framed Snappy) | Yes (framed Snappy) | 
| XZ | Yes | Yes | Yes | Yes | Yes | 
| ZSTD | Yes | No | No | No | No | 
| DEFLATE | Yes | Yes | Yes | Yes | Yes | 

**Topics**
+ [Creating an AWS Glue connection for an Apache Kafka data stream](#create-conn-streaming)
+ [Creating a Data Catalog table for a streaming source](#create-table-streaming)
+ [Notes and restrictions for Avro streaming sources](#streaming-avro-notes)
+ [Applying grok patterns to streaming sources](#create-table-streaming-grok)
+ [Defining job properties for a streaming ETL job](#create-job-streaming-properties)
+ [Streaming ETL notes and restrictions](#create-job-streaming-restrictions)

## Creating an AWS Glue connection for an Apache Kafka data stream
<a name="create-conn-streaming"></a>

To read from an Apache Kafka stream, you must create an AWS Glue connection. 

**To create an AWS Glue connection for a Kafka source (Console)**

1. Open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/).

1. In the navigation pane, under **Data catalog**, choose **Connections**.

1. Choose **Add connection**, and on the **Set up your connection’s properties** page, enter a connection name.
**Note**  
For more information about specifying connection properties, see [AWS Glue connection properties.](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-catalog-connections).

1. For **Connection type**, choose **Kafka**.

1. For **Kafka bootstrap servers URLs**, enter the host and port number for the bootstrap brokers for your Amazon MSK cluster or Apache Kafka cluster. Use only Transport Layer Security (TLS) endpoints for establishing the initial connection to the Kafka cluster. Plaintext endpoints are not supported.

   The following is an example list of hostname and port number pairs for an Amazon MSK cluster.

   ```
   myserver1.kafka.us-east-1.amazonaws.com:9094,myserver2.kafka.us-east-1.amazonaws.com:9094,
   myserver3.kafka.us-east-1.amazonaws.com:9094
   ```

   For more information about getting the bootstrap broker information, see [Getting the Bootstrap Brokers for an Amazon MSK Cluster](https://docs.aws.amazon.com/msk/latest/developerguide/msk-get-bootstrap-brokers.html) in the *Amazon Managed Streaming for Apache Kafka Developer Guide*. 

1. If you want a secure connection to the Kafka data source, select **Require SSL connection**, and for **Kafka private CA certificate location**, enter a valid Amazon S3 path to a custom SSL certificate.

   For an SSL connection to self-managed Kafka, the custom certificate is mandatory. It's optional for Amazon MSK.

   For more information about specifying a custom certificate for Kafka, see [AWS Glue SSL connection properties](connection-properties.md#connection-properties-SSL).

1. Use AWS Glue Studio or the AWS CLI to specify a Kafka client authentication method. To access AWS Glue Studio select **AWS Glue** from the **ETL** menu in the left navigation pane.

   For more information about Kafka client authentication methods, see [AWS Glue Kafka connection properties for client authentication](#connection-properties-kafka-client-auth).

1. Optionally enter a description, and then choose **Next**.

1. For an Amazon MSK cluster, specify its virtual private cloud (VPC), subnet, and security group. The VPC information is optional for self-managed Kafka.

1. Choose **Next** to review all connection properties, and then choose **Finish**.

For more information about AWS Glue connections, see [Connecting to data](glue-connections.md).

### AWS Glue Kafka connection properties for client authentication
<a name="connection-properties-kafka-client-auth"></a>

**SASL/GSSAPI (Kerberos) authentication**  
Choosing this authentication method will allow you to specify Kerberos properties.

**Kerberos Keytab**  
Choose the location of the keytab file. A keytab stores long-term keys for one or more principals. For more information, see [MIT Kerberos Documentation: Keytab ](https://web.mit.edu/kerberos/krb5-latest/doc/basic/keytab_def.html). 

**Kerberos krb5.conf file**  
Choose the krb5.conf file. This contains the default realm (a logical network, similar to a domain, that defines a group of systems under the same KDC) and the location of the KDC server. For more information, see [MIT Kerberos Documentation: krb5.conf ](https://web.mit.edu/kerberos/krb5-1.12/doc/admin/conf_files/krb5_conf.html). 

**Kerberos principal and Kerberos service name**  
Enter the Kerberos principal and service name. For more information, see [MIT Kerberos Documentation: Kerberos principal ](https://web.mit.edu/kerberos/krb5-1.5/krb5-1.5.4/doc/krb5-user/What-is-a-Kerberos-Principal_003f.html). 

**SASL/SCRAM-SHA-512 authentication**  
 Choosing this authentication method will allow you to specify authentication credentials. 

**AWS Secrets Manager**  
Search for your token in the Search box by typing the name or ARN. 

**Provider username and password directly**  
Search for your token in the Search box by typing the name or ARN. 

**SSL client authentication**  
Choosing this authentication method allows you to select the location of the Kafka client keystore by browsing Amazon S3. Optionally, you can enter the Kafka client keystore password and Kafka client key password. 

**IAM authentication**  
This authentication method does not require any additional specifications and is only applicable when the Streaming source is MSK Kafka. 

**SASL/PLAIN authentication**  
Choosing this authentication method allows you to specify authentication credentials. 

## Creating a Data Catalog table for a streaming source
<a name="create-table-streaming"></a>

A Data Catalog table that specifies source data stream properties, including the data schema can be manually created for a streaming source. This table is used as the data source for the streaming ETL job. 

If you don't know the schema of the data in the source data stream, you can create the table without a schema. Then when you create the streaming ETL job, you can turn on the AWS Glue schema detection function. AWS Glue determines the schema from the streaming data.

Use the [AWS Glue console](https://console.aws.amazon.com/glue/), the AWS Command Line Interface (AWS CLI), or the AWS Glue API to create the table. For information about creating a table manually with the AWS Glue console, see [Creating tables](tables-described.md).

**Note**  
You can't use the AWS Lake Formation console to create the table; you must use the AWS Glue console.

Also consider the following information for streaming sources in Avro format or for log data that you can apply Grok patterns to. 
+ [Notes and restrictions for Avro streaming sources](#streaming-avro-notes)
+ [Applying grok patterns to streaming sources](#create-table-streaming-grok)

**Topics**
+ [Kinesis data source](#kinesis-source)
+ [Kafka data source](#kafka-source)
+ [AWS Glue Schema Registry table source](#schema-registry-table)

### Kinesis data source
<a name="kinesis-source"></a>

When creating the table, set the following streaming ETL properties (console).

**Type of Source**  
**Kinesis**

**For a Kinesis source in the same account:**    
**Region**  
The AWS Region where the Amazon Kinesis Data Streams service resides. The Region and Kinesis stream name are together translated to a Stream ARN.  
Example: https://kinesis.us-east-1.amazonaws.com  
**Kinesis stream name**  
Stream name as described in [Creating a Stream](https://docs.aws.amazon.com/streams/latest/dev/kinesis-using-sdk-java-create-stream.html) in the* Amazon Kinesis Data Streams Developer Guide*.

**For a Kinesis source in another account, refer to [this example](https://docs.aws.amazon.com/kinesisanalytics/latest/java/examples-cross.html) to set up the roles and policies to allow cross-account access. Configure these settings:**    
**Stream ARN**  
The ARN of the Kinesis data stream that the consumer is registered with. For more information, see [Amazon Resource Names (ARNs) and AWS Service Namespaces](https://docs.aws.amazon.com/general/latest/gr/aws-arns-and-namespaces.html) in the *AWS General Reference*.  
**Assumed Role ARN**  
The Amazon Resource Name (ARN) of the role to assume.  
**Session name (optional)**  
An identifier for the assumed role session.  
Use the role session name to uniquely identify a session when the same role is assumed by different principals or for different reasons. In cross-account scenarios, the role session name is visible to, and can be logged by the account that owns the role. The role session name is also used in the ARN of the assumed role principal. This means that subsequent cross-account API requests that use the temporary security credentials will expose the role session name to the external account in their AWS CloudTrail logs.

**To set streaming ETL properties for Amazon Kinesis Data Streams (AWS Glue API or AWS CLI)**
+ To set up streaming ETL properties for a Kinesis source in the same account, specify the `streamName` and `endpointUrl` parameters in the `StorageDescriptor` structure of the `CreateTable` API operation or the `create_table` CLI command.

  ```
  "StorageDescriptor": {
  	"Parameters": {
  		"typeOfData": "kinesis",
  		"streamName": "sample-stream",
  		"endpointUrl": "https://kinesis.us-east-1.amazonaws.com"
  	}
  	...
  }
  ```

  Or, specify the `streamARN`.  
**Example**  

  ```
  "StorageDescriptor": {
  	"Parameters": {
  		"typeOfData": "kinesis",
  		"streamARN": "arn:aws:kinesis:us-east-1:123456789:stream/sample-stream"
  	}
  	...
  }
  ```
+ To set up streaming ETL properties for a Kinesis source in another account, specify the `streamARN`, `awsSTSRoleARN` and `awsSTSSessionName` (optional) parameters in the `StorageDescriptor` structure in the `CreateTable` API operation or the `create_table` CLI command.

  ```
  "StorageDescriptor": {
  	"Parameters": {
  		"typeOfData": "kinesis",
  		"streamARN": "arn:aws:kinesis:us-east-1:123456789:stream/sample-stream",
  		"awsSTSRoleARN": "arn:aws:iam::123456789:role/sample-assume-role-arn",
  		"awsSTSSessionName": "optional-session"
  	}
  	...
  }
  ```

### Kafka data source
<a name="kafka-source"></a>

When creating the table, set the following streaming ETL properties (console).

**Type of Source**  
 **Kafka**

**For a Kafka source:**    
**Topic name**  
Topic name as specified in Kafka.  
**Connection**  
An AWS Glue connection that references a Kafka source, as described in [Creating an AWS Glue connection for an Apache Kafka data stream](#create-conn-streaming).

### AWS Glue Schema Registry table source
<a name="schema-registry-table"></a>

To use AWS Glue Schema Registry for streaming jobs, follow the instructions at [Use case: AWS Glue Data Catalog](schema-registry-integrations.md#schema-registry-integrations-aws-glue-data-catalog) to create or update a Schema Registry table.

Currently, AWS Glue Streaming supports only Glue Schema Registry Avro format with schema inference set to `false`.

## Notes and restrictions for Avro streaming sources
<a name="streaming-avro-notes"></a>

The following notes and restrictions apply for streaming sources in the Avro format:
+ When schema detection is turned on, the Avro schema must be included in the payload. When turned off, the payload should contain only data.
+ Some Avro data types are not supported in dynamic frames. You can't specify these data types when defining the schema with the **Define a schema** page in the create table wizard in the AWS Glue console. During schema detection, unsupported types in the Avro schema are converted to supported types as follows:
  + `EnumType => StringType`
  + `FixedType => BinaryType`
  + `UnionType => StructType`
+ If you define the table schema using the **Define a schema** page in the console, the implied root element type for the schema is `record`. If you want a root element type other than `record`, for example `array` or `map`, you can't specify the schema using the **Define a schema** page. Instead you must skip that page and specify the schema either as a table property or within the ETL script.
  + To specify the schema in the table properties, complete the create table wizard, edit the table details, and add a new key-value pair under **Table properties**. Use the key `avroSchema`, and enter a schema JSON object for the value, as shown in the following screenshot.  
![\[Under the Table properties heading, there are two columns of text fields. The left-hand column heading is Key, and the right-hand column heading is Value. The key/value pair in the first row is classification/avro. The key/value pair in the second row is avroSchema/{"type":"array","items":"string"}.\]](http://docs.aws.amazon.com/glue/latest/dg/images/table_properties_avro.png)
  + To specify the schema in the ETL script, modify the `datasource0` assignment statement and add the `avroSchema` key to the `additional_options` argument, as shown in the following Python and Scala examples.

------
#### [ Python ]

    ```
    SCHEMA_STRING = ‘{"type":"array","items":"string"}’
    datasource0 = glueContext.create_data_frame.from_catalog(database = "database", table_name = "table_name", transformation_ctx = "datasource0", additional_options = {"startingPosition": "TRIM_HORIZON", "inferSchema": "false", "avroSchema": SCHEMA_STRING})
    ```

------
#### [ Scala ]

    ```
    val SCHEMA_STRING = """{"type":"array","items":"string"}"""
    val datasource0 = glueContext.getCatalogSource(database = "database", tableName = "table_name", redshiftTmpDir = "", transformationContext = "datasource0", additionalOptions = JsonOptions(s"""{"startingPosition": "TRIM_HORIZON", "inferSchema": "false", "avroSchema":"$SCHEMA_STRING"}""")).getDataFrame()
    ```

------

## Applying grok patterns to streaming sources
<a name="create-table-streaming-grok"></a>

You can create a streaming ETL job for a log data source and use Grok patterns to convert the logs to structured data. The ETL job then processes the data as a structured data source. You specify the Grok patterns to apply when you create the Data Catalog table for the streaming source.

For information about Grok patterns and custom pattern string values, see [Writing grok custom classifiers](custom-classifier.md#custom-classifier-grok).

**To add grok patterns to the Data Catalog table (console)**
+ Use the create table wizard, and create the table with the parameters specified in [Creating a Data Catalog table for a streaming source](#create-table-streaming). Specify the data format as Grok, fill in the **Grok pattern** field, and optionally add custom patterns under **Custom patterns (optional)**.  
![\[*\]](http://docs.aws.amazon.com/glue/latest/dg/images/grok-data-format-create-table.png)

  Press **Enter** after each custom pattern.

**To add grok patterns to the Data Catalog table (AWS Glue API or AWS CLI)**
+ Add the `GrokPattern` parameter and optionally the `CustomPatterns` parameter to the `CreateTable` API operation or the `create_table` CLI command.

  ```
   "Parameters": {
  ...
      "grokPattern": "string",
      "grokCustomPatterns": "string",
  ...
  },
  ```

  Express `grokCustomPatterns` as a string and use "\$1n" as the separator between patterns.

  The following is an example of specifying these parameters.  
**Example**  

  ```
  "parameters": {
  ...
      "grokPattern": "%{USERNAME:username} %{DIGIT:digit:int}",
      "grokCustomPatterns": "digit \d",
  ...
  }
  ```

## Defining job properties for a streaming ETL job
<a name="create-job-streaming-properties"></a>

When you define a streaming ETL job in the AWS Glue console, provide the following streams-specific properties. For descriptions of additional job properties, see [Defining job properties for Spark jobs](add-job.md#create-job). 

**IAM role**  
Specify the AWS Identity and Access Management (IAM) role that is used for authorization to resources that are used to run the job, access streaming sources, and access target data stores.  
For access to Amazon Kinesis Data Streams, attach the `AmazonKinesisFullAccess` AWS managed policy to the role, or attach a similar IAM policy that permits more fine-grained access. For sample policies, see [Controlling Access to Amazon Kinesis Data Streams Resources Using IAM](https://docs.aws.amazon.com/streams/latest/dev/controlling-access.html).  
For more information about permissions for running jobs in AWS Glue, see [Identity and access management for AWS Glue](security-iam.md).

**Type**  
Choose **Spark streaming**.

**AWS Glue version**  
The AWS Glue version determines the versions of Apache Spark, and Python or Scala, that are available to the job. Choose a selection that specifies the version of Python or Scala available to the job. AWS Glue Version 2.0 with Python 3 support is the default for streaming ETL jobs.

**Maintenance window**  
Specifies a window where a streaming job can be restarted. See [Maintenance windows for AWS Glue Streaming](glue-streaming-maintenance.md).

**Job timeout**  
Optionally enter a duration in minutes. The default value is blank.  
+ Streaming jobs must have a timeout value less than 7 days or 10080 minutes.
+ When the value is left blank, the job will be restarted after 7 days, if you have not set up a maintenance window. If you have set up a maintenance window, the job will be restarted during the maintenance window after 7 days.

**Data source**  
Specify the table that you created in [Creating a Data Catalog table for a streaming source](#create-table-streaming).

**Data target**  
Do one of the following:  
+ Choose **Create tables in your data target** and specify the following data target properties.  
**Data store**  
Choose Amazon S3 or JDBC.  
**Format**  
Choose any format. All are supported for streaming.
+ Choose **Use tables in the data catalog and update your data target**, and choose a table for a JDBC data store.

**Output schema definition**  
Do one of the following:  
+ Choose **Automatically detect schema of each record** to turn on schema detection. AWS Glue determines the schema from the streaming data.
+ Choose **Specify output schema for all records** to use the Apply Mapping transform to define the output schema.

**Script**  
Optionally supply your own script or modify the generated script to perform operations that the Apache Spark Structured Streaming engine supports. For information on the available operations, see [Operations on streaming DataFrames/Datasets](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#operations-on-streaming-dataframesdatasets).

## Streaming ETL notes and restrictions
<a name="create-job-streaming-restrictions"></a>

Keep in mind the following notes and restrictions:
+ Auto-decompression for AWS Glue streaming ETL jobs is only available for the supported compression types. Also note the following:
  + Framed Snappy refers to the official [framing format](https://github.com/google/snappy/blob/main/framing_format.txt) for Snappy.
  + Deflate is supported in Glue version 3.0, not Glue version 2.0.
+ When using schema detection, you cannot perform joins of streaming data.
+ AWS Glue streaming ETL jobs do not support the Union data type for AWS Glue Schema Registry with Avro format.
+ Your ETL script can use AWS Glue's built-in transforms and the transforms native to Apache Spark Structured Streaming. For more information, see [Operations on streaming DataFrames/Datasets](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#operations-on-streaming-dataframesdatasets) on the Apache Spark website or [AWS Glue PySpark transforms reference](aws-glue-programming-python-transforms.md).
+ AWS Glue streaming ETL jobs use checkpoints to keep track of the data that has been read. Therefore, a stopped and restarted job picks up where it left off in the stream. If you want to reprocess data, you can delete the checkpoint folder referenced in the script.
+ Job bookmarks aren't supported.
+ To use enhanced fan-out feature of Kinesis Data Streams in your job, consult [Using enhanced fan-out in Kinesis streaming jobs](aws-glue-programming-etl-connect-kinesis-efo.md).
+ If you use a Data Catalog table created from AWS Glue Schema Registry, when a new schema version becomes available, to reflect the new schema, you need to do the following:

  1. Stop the jobs associated with the table.

  1. Update the schema for the Data Catalog table.

  1. Restart the jobs associated with the table.

# Record matching with AWS Lake Formation FindMatches
<a name="machine-learning"></a>

**Note**  
Record matching is currently unavailable in the following Regions in the AWS Glue console: Middle East (UAE), Europe (Spain), Asia Pacific (Jakarta), and Europe (Zurich).

AWS Lake Formation provides machine learning capabilities to create custom transforms to cleanse your data. There is currently one available transform named FindMatches. The FindMatches transform enables you to identify duplicate or matching records in your dataset, even when the records do not have a common unique identifier and no fields match exactly. This will not require writing any code or knowing how machine learning works. FindMatches can be useful in many different problems, such as: 
+ **Matching customers**: Linking customer records across different customer databases, even when many customer fields do not match exactly across the databases (e.g. different name spelling, address differences, missing or inaccurate data, etc).
+ **Matching products**: Matching products in your catalog against other product sources, such as product catalog against a competitor's catalog, where entries are structured differently.
+ **Improving fraud detection**: Identifying duplicate customer accounts, determining when a newly created account is (or might be) a match for a previously known fraudulent user.
+ **Other matching problems**: Match addresses, movies, parts lists, etc etc. In general, if a human being could look at your database rows and determine that they were a match, there is a really good chance that the FindMatches transform can help you.

 You can create these transforms when you create a job. The transform that you create is based on a source data store schema and example data from the source data set that you label (we call this process "teaching" a transform). The records that you label must be present in the source dataset. In this process we generate a file which you label and then upload back which the transform would in a manner learn from. After you teach your transform, you can call it from your Spark-based AWS Glue job (PySpark or Scala Spark) and use it in other scripts with a compatible source data store. 

 After the transform is created, it is stored in AWS Glue. On the AWS Glue console, you can manage the transforms that you create. In the navigation pane under **Data Integration and ETL**, **Data classification tools > Record Matching**, you can edit and continue to teach your machine learning transform. For more information about managing transforms on the console, see [Working with machine learning transforms](console-machine-learning-transforms.md). 

**Note**  
AWS Glue version 2.0 FindMatches jobs use the Amazon S3 bucket `aws-glue-temp-<accountID>-<region>` to store temporary files while the transform is processing data. You can delete this data after the run has completed, either manually or by setting an Amazon S3 Lifecycle rule.

## Types of machine learning transforms
<a name="machine-learning-transforms"></a>

You can create machine learning transforms to cleanse your data. You can call these transforms from your ETL script. Your data passes from transform to transform in a data structure called a *DynamicFrame*, which is an extension to an Apache Spark SQL `DataFrame`. The `DynamicFrame` contains your data, and you reference its schema to process your data.

The following types of machine learning transforms are available:

*Find matches*  
Finds duplicate records in the source data. You teach this machine learning transform by labeling example datasets to indicate which rows match. The machine learning transform learns which rows should be matches the more you teach it with example labeled data. Depending on how you configure the transform, the output is one of the following:  
+ A copy of the input table plus a `match_id` column filled in with values that indicate matching sets of records. The `match_id` column is an arbitrary identifier. Any records which have the same `match_id` have been identified as matching to each other. Records with different `match_id`'s do not match.
+ A copy of the input table with duplicate rows removed. If multiple duplicates are found, then the record with the lowest primary key is kept.

*Find incremental matches*  
The Find matches transform can also be configured to find matches across the existing and incremental frames and return as output a column containing a unique ID per match group.   
For more information, see: [Finding incremental matches](machine-learning-incremental-matches.md)

### Using the FindMatches transform
<a name="machine-learning-find-matches"></a>

You can use the `FindMatches` transform to find duplicate records in the source data. A labeling file is generated or provided to help teach the transform.

**Note**  
Currently, `FindMatches` transforms that use a custom encryption key aren't supported in the following Regions:  
Asia Pacific (Osaka) - `ap-northeast-3`

 To get started with the FindMatches transform, you can follow the steps below. For a more advanced and detailed example, see the **AWS Big Data Blog**: [ Harmonize data using AWS Glue and AWS Lake Formation FindMatches ML to build a customer 360 view ](https://aws.amazon.com/blogs/big-data/harmonize-data-using-aws-glue-and-aws-lake-formation-findmatches-ml-to-build-a-customer-360-view/). 

#### Getting started using the Find Matches transform
<a name="machine-learning-find-mathes-workflow"></a>

Follow these steps to get started with the `FindMatches` transform:

1. Create a table in the AWS Glue Data Catalog for the source data that is to be cleaned. For information about how to create a crawler, see [Working with Crawlers on the AWS Glue Console](https://docs.aws.amazon.com/glue/latest/dg/console-crawlers.html).

   If your source data is a text-based file such as a comma-separated values (CSV) file, consider the following: 
   + Keep your input record CSV file and labeling file in separate folders. Otherwise, the AWS Glue crawler might consider them as multiple parts of the same table and create tables in the Data Catalog incorrectly. 
   + Unless your CSV file includes ASCII characters only, ensure that UTF-8 without BOM (byte order mark) encoding is used for the CSV files. Microsoft Excel often adds a BOM in the beginning of UTF-8 CSV files. To remove it, open the CSV file in a text editor, and resave the file as **UTF-8 without BOM**. 

1. On the AWS Glue console, create a job, and choose the **Find matches** transform type.
**Important**  
The data source table that you choose for the job can't have more than 100 columns.

1. Tell AWS Glue to generate a labeling file by choosing **Generate labeling file**. AWS Glue takes the first pass at grouping similar records for each `labeling_set_id` so that you can review those groupings. You label matches in the `label` column.
   + If you already have a labeling file, that is, an example of records that indicate matching rows, upload the file to Amazon Simple Storage Service (Amazon S3). For information about the format of the labeling file, see [Labeling file format](#machine-learning-labeling-file). Proceed to step 4.

1. Download the labeling file and label the file as described in the [Labeling](#machine-learning-labeling) section.

1. Upload the corrected labelled file. AWS Glue runs tasks to teach the transform how to find matches.

   On the **Machine learning transforms** list page, choose the **History** tab. This page indicates when AWS Glue performs the following tasks:
   + **Import labels**
   + **Export labels**
   + **Generate labels**
   + **Estimate quality**

1. To create a better transform, you can iteratively download, label, and upload the labelled file. In the initial runs, a lot more records might be mismatched. But AWS Glue learns as you continue to teach it by verifying the labeling file.

1. Evaluate and tune your transform by evaluating performance and results of finding matches. For more information, see [Tuning machine learning transforms in AWS Glue](add-job-machine-learning-transform-tuning.md).

#### Labeling
<a name="machine-learning-labeling"></a>

When `FindMatches` generates a labeling file, records are selected from your source table. Based on previous training, `FindMatches` identifies the most valuable records to learn from.

The act of *labeling* is editing a labeling file (we suggest using a spreadsheet such as Microsoft Excel) and adding identifiers, or labels, into the `label` column that identifies matching and nonmatching records. It is important to have a clear and consistent definition of a match in your source data. `FindMatches` learns from which records you designate as matches (or not) and uses your decisions to learn how to find duplicate records.

When a labeling file is generated by `FindMatches`, approximately 100 records are generated. These 100 records are typically divided into 10 *labeling sets*, where each labeling set is identified by a unique `labeling_set_id` generated by `FindMatches`. Each labeling set should be viewed as a separate labeling task independent of the other labeling sets. Your task is to identify matching and non-matching records within each labeling set.

##### Tips for editing labeling files in a spreadsheet
<a name="machine-learning-labeling-tips"></a>

When editing the labeling file in a spreadsheet application, consider the following:
+ The file might not open with column fields fully expanded. You might need to expand the `labeling_set_id` and `label` columns to see content in those cells.
+ If the primary key column is a number, such as a `long` data type, the spreadsheet might interpret it as a number and change the value. This key value must be treated as text. To correct this problem, format all the cells in the primary key column as **Text data**.

#### Labeling file format
<a name="machine-learning-labeling-file"></a>

The labeling file that is generated by AWS Glue to teach your `FindMatches` transform uses the following format. If you generate your own file for AWS Glue, it must follow this format as well:
+ It is a comma-separated values (CSV) file. 
+ It must be encoded in `UTF-8`. If you edit the file using Microsoft Windows, it might be encoded with `cp1252`.
+ It must be in an Amazon S3 location to pass it to AWS Glue.
+ Use a moderate number of rows for each labeling task. 10–20 rows per task are recommended, although 2–30 rows per task are acceptable. Tasks larger than 50 rows are not recommended and may cause poor results or system failure.
+ If you have already-labeled data consisting of pairs of records labeled as a "match" or a "no-match", this is fine. These labeled pairs can be represented as labeling sets of size 2. In this case label both records with, for instance, a letter "A" if they match, but label one as "A" and one as "B" if they do not match.
**Note**  
 Because it has additional columns, the labeling file has a different schema from a file that contains your source data. Place the labeling file in a different folder from any transform input CSV file so that the AWS Glue crawler does not consider it when it creates tables in the Data Catalog. Otherwise, the tables created by the AWS Glue crawler might not correctly represent your data. 
+ The first two columns (`labeling_set_id`, `label`) are required by AWS Glue. The remaining columns must match the schema of the data that is to be processed.
+ For each `labeling_set_id`, you identify all matching records by using the same label. A label is a unique string placed in the `label` column. We recommend using labels that contain simple characters, such as A, B, C, and so on. Labels are case sensitive and are entered in the `label` column.
+ Rows that contain the same `labeling_set_id` and the same label are understood to be labeled as a match.
+ Rows that contain the same `labeling_set_id` and a different label are understood to be labeled as *not* a match
+ Rows that contain a different `labeling_set_id` are understood to be conveying no information for or against matching.

  The following is an example of labeling the data:    
<a name="table-labeling-data"></a>[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/machine-learning.html)
+ In the above example we identify John/Johnny/Jon Doe as being a match and we teach the system that these records do not match Jane Smith. Separately, we teach the system that Richard and Rich Jones are the same person, but that these records are not a match to Sarah Jones/Jones-Walker and Richie Jones Jr.
+ As you can see, the scope of the labels is limited to the `labeling_set_id`. So labels do not cross `labeling_set_id` boundaries. For example, a label "A" in `labeling_set_id` 1 does not have any relation to label "A" in `labeling_set_id` 2.
+ If a record does not have any matches within a labeling set, then assign it a unique label. For instance, Jane Smith does not match any record in labeling set ABC123, so it is the only record in that labeling set with the label of B.
+ The labeling set "GHI678" shows that a labeling set can consist of just two records which are given the same label to show that they match. Similarly, "XYZABC" shows two records given different labels to show that they do not match.
+ Note that sometimes a labeling set may contain no matches (that is, you give every record in the labeling set a different label) or a labeling set might all be "the same" (you gave them all the same label). This is okay as long as your labeling sets collectively contain examples of records that are and are not "the same" by your criteria.

**Important**  
Confirm that the IAM role that you pass to AWS Glue has access to the Amazon S3 bucket that contains the labeling file. By convention, AWS Glue policies grant permission to Amazon S3 buckets or folders whose names are prefixed with **aws-glue-**. If your labeling files are in a different location, add permission to that location in the IAM role.

# Tuning machine learning transforms in AWS Glue
<a name="add-job-machine-learning-transform-tuning"></a>

You can tune your machine learning transforms in AWS Glue to improve the results of your data-cleansing jobs to meet your objectives. To improve your transform, you can teach it by generating a labeling set, adding labels, and then repeating these steps several times until you get your desired results. You can also tune by changing some machine learning parameters. 

For more information about machine learning transforms, see [Record matching with AWS Lake Formation FindMatches](machine-learning.md).

**Topics**
+ [Machine learning measurements](machine-learning-terminology.md)
+ [Deciding between precision and recall](machine-learning-precision-recall-tradeoff.md)
+ [Deciding Between accuracy and cost](machine-learning-accuracy-cost-tradeoff.md)
+ [Estimating the quality of matches using match confidence scores](match-scoring.md)
+ [Teaching the Find Matches transform](machine-learning-teaching.md)

# Machine learning measurements
<a name="machine-learning-terminology"></a>

To understand the measurements that are used to tune your machine learning transform, you should be familiar with the following terminology:

**True positive (TP)**  
A match in the data that the transform correctly found, sometimes called a *hit*.

**True negative (TN)**  
A nonmatch in the data that the transform correctly rejected.

**False positive (FP)**  
A nonmatch in the data that the transform incorrectly classified as a match, sometimes called a *false alarm*.

**False negative (FN)**  
A match in the data that the transform didn't find, sometimes called a *miss*.

For more information about the terminology that is used in machine learning, see [Confusion matrix](https://en.wikipedia.org/wiki/Confusion_matrix) in Wikipedia.

To tune your machine learning transforms, you can change the value of the following measurements in the **Advanced properties** of the transform.
+ **Precision** measures how well the transform finds true positives among the total number of records that it identifies as positive (true positives and false positives). For more information, see [Precision and recall](https://en.wikipedia.org/wiki/Precision_and_recall) in Wikipedia.
+ **Recall** measures how well the transform finds true positives from the total records in the source data. For more information, see [Precision and recall](https://en.wikipedia.org/wiki/Precision_and_recall) in Wikipedia.
+ **Accuracy ** measures how well the transform finds true positives and true negatives. Increasing accuracy requires more machine resources and cost. But it also results in increased recall. For more information, see [Accuracy and precision](https://en.wikipedia.org/wiki/Accuracy_and_precision#In_information_systems) in Wikipedia.
+ **Cost** measures how many compute resources (and thus money) are consumed to run the transform.

# Deciding between precision and recall
<a name="machine-learning-precision-recall-tradeoff"></a>

Each `FindMatches` transform contains a `precision-recall` parameter. You use this parameter to specify one of the following:
+ If you are more concerned about the transform falsely reporting that two records match when they actually don't match, then you should emphasize *precision*. 
+ If you are more concerned about the transform failing to detect records that really do match, then you should emphasize *recall*.

You can make this trade-off on the AWS Glue console or by using the AWS Glue machine learning API operations.

**When to favor precision**  
Favor precision if you are more concerned about the risk that `FindMatches` results in a pair of records matching when they don't actually match. To favor precision, choose a *higher* precision-recall trade-off value. With a higher value, the `FindMatches` transform requires more evidence to decide that a pair of records should be matched. The transform is tuned to bias toward saying that records do not match.

For example, suppose that you're using `FindMatches` to detect duplicate items in a video catalog, and you provide a higher precision-recall value to the transform. If your transform incorrectly detects that *Star Wars: A New Hope* is the same as *Star Wars: The Empire Strikes Back*, a customer who wants *A New Hope* might be shown *The Empire Strikes Back*. This would be a poor customer experience. 

However, if the transform fails to detect that *Star Wars: A New Hope* and *Star Wars: Episode IV—A New Hope* are the same item, the customer might be confused at first but might eventually recognize them as the same. It would be a mistake, but not as bad as the previous scenario.

**When to favor recall**  
Favor recall if you are more concerned about the risk that the `FindMatches` transform results might fail to detect a pair of records that actually do match. To favor recall, choose a *lower* precision-recall trade-off value. With a lower value, the `FindMatches` transform requires less evidence to decide that a pair of records should be matched. The transform is tuned to bias toward saying that records do match.

For example, this might be a priority for a security organization. Suppose that you are matching customers against a list of known defrauders, and it is important to determine whether a customer is a defrauder. You are using `FindMatches` to match the defrauder list against the customer list. Every time `FindMatches` detects a match between the two lists, a human auditor is assigned to verify that the person is, in fact, a defrauder. Your organization might prefer to choose recall over precision. In other words, you would rather have the auditors manually review and reject some cases when the customer is not a defrauder than fail to identify that a customer is, in fact, on the defrauder list.

**How to favor both precision and recall**  
The best way to improve both precision and recall is to label more data. As you label more data, the overall accuracy of the `FindMatches` transform improves, thus improving both precision and recall. Nevertheless, even with the most accurate transform, there is always a gray area where you need to experiment with favoring precision or recall, or choose a value in the middle. 

# Deciding Between accuracy and cost
<a name="machine-learning-accuracy-cost-tradeoff"></a>

Each `FindMatches` transform contains an `accuracy-cost` parameter. You can use this parameter to specify one of the following:
+ If you are more concerned with the transform accurately reporting that two records match, then you should emphasize *accuracy*.
+ If you are more concerned about the cost or speed of running the transform, then you should emphasize *lower cost*.

You can make this trade-off on the AWS Glue console or by using the AWS Glue machine learning API operations.

**When to favor accuracy**  
Favor accuracy if you are more concerned about the risk that the `find matches` results won't contain matches. To favor accuracy, choose a *higher* accuracy-cost trade-off value. With a higher value, the `FindMatches` transform requires more time to do a more thorough search for correctly matching records. Note that this parameter doesn't make it less likely to falsely call a nonmatching record pair a match. The transform is tuned to bias towards spending more time finding matches.

**When to favor cost**  
Favor cost if you are more concerned about the cost of running the `find matches` transform and less about how many matches are found. To favor cost, choose a *lower* accuracy-cost trade-off value. With a lower value, the `FindMatches` transform requires fewer resources to run. The transform is tuned to bias towards finding fewer matches. If the results are acceptable when favoring lower cost, use this setting.

**How to favor both accuracy and lower cost**  
It takes more machine time to examine more pairs of records to determine whether they might be matches. If you want to reduce cost without reducing quality, here are some steps you can take: 
+ Eliminate records in your data source that you aren't concerned about matching.
+ Eliminate columns from your data source that you are sure aren't useful for making a match/no-match decision. A good way of deciding this is to eliminate columns that you don't think affect your own decision about whether a set of records is “the same.”

# Estimating the quality of matches using match confidence scores
<a name="match-scoring"></a>

Match confidence scores provide an estimate of the quality of matches found by FindMatches to distinguish between matched records in which the machine learning model is highly confident, uncertain, or unlikely. A match confidence score will be between 0 and 1, where a higher score means higher similarity. Examining match confidence scores lets you distinguish between clusters of matches in which the system is highly confident (which you may decide to merge), clusters about which the system is uncertain (which you may decide to have reviewed by a human), and clusters that the system deems to be unlikely (which you may decide to reject).

You may want to adjust your training data in situations where you see a high match confidence score, but determine there are not matches, or where you see a low score but determine there are, in fact, matches.

Confidence scores are particularly useful when there are large sized industrial datasets, where it is infeasible to review every FindMatches decision.

Match confidence scores are available in AWS Glue version 2.0 or later.

## Generating match confidence scores
<a name="specifying-match-scoring"></a>

You can generate match confidence scores by setting the Boolean value of `computeMatchConfidenceScores` to True when calling the `FindMatches` or `FindIncrementalMatches` API.

AWS Glue adds a new `column match_confidence_score` to the output.

## Match scoring examples
<a name="match-scoring-examples"></a>

For example, consider the following matched records:

**Score >= 0.9**  
Summary of matched records:

```
  primary_id  |   match_id  | match_confidence_score

3281355037663    85899345947   0.9823658302132061
1546188247619    85899345947   0.9823658302132061
```

Details:

![\[An example of a route table with an internet gateway.\]](http://docs.aws.amazon.com/glue/latest/dg/images/match_score1.png)


From this example, we can see that two records are very similar and share `display_position`, `primary_name`, and `street name`. 

**Score >= 0.8 and score < 0.9**  
Summary of matched records:

```
  primary_id  |   match_id  | match_confidence_score

309237680432     85899345928   0.8309852373674638
3590592666790    85899345928   0.8309852373674638
343597390617     85899345928   0.8309852373674638
249108124906     85899345928   0.8309852373674638
463856477937     85899345928   0.8309852373674638
```

Details:

![\[An example of a route table with an internet gateway.\]](http://docs.aws.amazon.com/glue/latest/dg/images/match_score2.png)


From this example, we can see that these records share the same `primary_name`, and `country`.

**Score >= 0.6 and score < 0.7**  
Summary of matched records:

```
  primary_id  |   match_id  | match_confidence_score

2164663519676    85899345930   0.6971099896480333
 317827595278    85899345930   0.6971099896480333
 472446424341    85899345930   0.6971099896480333
3118146262932    85899345930   0.6971099896480333
 214748380804    85899345930   0.6971099896480333
```

Details:

![\[An example of a route table with an internet gateway.\]](http://docs.aws.amazon.com/glue/latest/dg/images/match_score3.png)


From this example, we can see that these records share only the same `primary_name`.

For more information, see:
+ [Step 5: Add and run a job with your machine learning transform](machine-learning-transform-tutorial.md#ml-transform-tutorial-add-job)
+ PySpark: [FindMatches class](aws-glue-api-crawler-pyspark-transforms-findmatches.md)
+ PySpark: [FindIncrementalMatches class](aws-glue-api-crawler-pyspark-transforms-findincrementalmatches.md)
+ Scala: [FindMatches class](glue-etl-scala-apis-glue-ml-findmatches.md)
+ Scala: [FindIncrementalMatches class](glue-etl-scala-apis-glue-ml-findincrementalmatches.md)

# Teaching the Find Matches transform
<a name="machine-learning-teaching"></a>

Each `FindMatches` transform must be taught what should be considered a match and what should not be considered a match. You teach your transform by adding labels to a file and uploading your choices to AWS Glue. 

You can orchestrate this labeling on the AWS Glue console or by using the AWS Glue machine learning API operations.

**How many times should I add labels? How many labels do I need?**  
The answers to these questions are mostly up to you. You must evaluate whether `FindMatches` is delivering the level of accuracy that you need and whether you think the extra labeling effort is worth it for you. The best way to decide this is to look at the “Precision,” “Recall,” and “Area under the precision recall curve” metrics that you can generate when you choose **Estimate quality** on the AWS Glue console. After you label more sets of tasks, rerun these metrics and verify whether they have improved. If, after labeling a few sets of tasks, you don't see improvement on the metric that you are focusing on, the transform quality might have reached a plateau.

**Why are both true positive and true negative labels needed?**  
The `FindMatches` transform needs both positive and negative examples to learn what you think is a match. If you are labeling `FindMatches`-generated training data (for example, using the **I do not have labels** option), `FindMatches` tries to generate a set of “label set ids” for you. Within each task, you give the same “label” to some records and different “labels” to other records. In other words, the tasks generally are not either all the same or all different (but it's okay if a particular task is all “the same” or all “not the same”).

If you are teaching your `FindMatches` transform using the **Upload labels from S3** option, try to include both examples of matching and nonmatching records. It's acceptable to have only one type. These labels help you build a more accurate `FindMatches` transform, but you still need to label some records that you generate using the **Generate labeling file** option.

**How can I enforce that the transform matches exactly as I taught it?**  
The `FindMatches` transform learns from the labels that you provide, so it might generate records pairs that don't respect the provided labels. To enforce that the `FindMatches` transform respects your labels, select **EnforceProvidedLabels** in **FindMatchesParameter**.

**What techniques can you use when an ML transform identifies items as matches that are not true matches?**  
You can use the following techniques:
+ Increase the `precisionRecallTradeoff` to a higher value. This eventually results in finding fewer matches, but it should also break up your big cluster when it reaches a high enough value. 
+ Take the output rows corresponding to the incorrect results and reformat them as a labeling set (removing the `match_id` column and adding a `labeling_set_id` and `label` column). If necessary, break up (subdivide) into multiple labeling sets to ensure that the labeler can keep each labeling set in mind while assigning labels. Then, correctly label the matching sets and upload the label file and append it to your existing labels. This might teach your transformer enough about what it is looking for to understand the pattern. 
+ (Advanced) Finally, look at that data to see if there is a pattern that you can detect that the system is not noticing. Preprocess that data using standard AWS Glue functions to *normalize* the data. Highlight what you want the algorithm to learn from by separating data that you know to be differently important into their own columns. Or construct combined columns from columns whose data you know to be related. 

# Working with machine learning transforms
<a name="console-machine-learning-transforms"></a>

You can use AWS Glue to create custom machine learning transforms that can be used to cleanse your data. You can use these transforms when you create a job on the AWS Glue console. 

For information about how to create a machine learning transform, see [Record matching with AWS Lake Formation FindMatches](machine-learning.md).

**Topics**
+ [Transform properties](#console-machine-learning-properties)
+ [Adding and editing machine learning transforms](#console-machine-learning-transforms-actions)
+ [Viewing transform details](#console-machine-learning-transforms-details)
+ [Teach transforms using labels](#console-machine-learning-transforms-teaching-transforms)

## Transform properties
<a name="console-machine-learning-properties"></a>

To view an existing machine learning transform, sign in to the AWS Management Console, and open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/). In the navigation pane under **Data Integration and ETL**, choose **Data classification tools > Record Matching**.

The properties for each transform:

**Transform name**  
The unique name you gave the transform when you created it.

**ID**  
A unique identifier of the transform. 

**Label count**  
The number of labels in the labeling file that was provided to help teach the transform. 

**Status**  
Indicates whether the transform is **Ready** or **Needs training**. To run a machine learning transform successfully in a job, it must be **Ready**. 

**Created**  
The date the transform was created.

**Modified**  
The date the transform was last updated.

**Description**  
The description supplied for the transform, if one was provided.

**AWS Glue version**  
The version of AWS Glue used.

**Run ID**  
The unique name you gave the transform when you created it.

**Task type**  
The type of machine learning transform; for example, **Find matching records**.

**Status**  
Indicates the status of the task run. Possible statuses include:  
+ Starting
+ Running
+ Stopping
+ Stopped
+ Succeeded
+ Failed
+ Timeout

**Error**  
If the status is Failed, an error message is displayed describing the reason for the failure.

## Adding and editing machine learning transforms
<a name="console-machine-learning-transforms-actions"></a>

 You can view, delete, set up and teach, or tune a transform on the AWS Glue console. Select the check box next to the transform in the list, choose **Action**, and then choose the action that you want to take. 

### Creating a new ML transform
<a name="w2aac37c11c24c23c11b5"></a>

 To add a new machine learning transform, choose **Create transform**. Follow the instructions in the **Add job** wizard. For more information, see [Record matching with AWS Lake Formation FindMatches](machine-learning.md). 

#### Step 1. Set transform properties.
<a name="w2aac37c11c24c23c11b5b7"></a>

1. Enter the name and description (optional).

1. Optionally, set security configuration. See [Using data encryption with machine learning transforms](#ml_transform_sec_config). 

1. Optionally, set Task execution settings. Task execution settings allow you to customize how the task is run. Select the Worker type, number of workers, task timeout (in minutes), the number of retries, and the AWS Glue version.

1. Optionally, set Tags. Tags are labels that you can assign to an AWS resource. Each tag consists of a key and an optional value. Tags can be used to search and filter your resource or track your AWS costs.

#### Step 2. Choose table and primary key.
<a name="w2aac37c11c24c23c11b5b9"></a>

1. Choose the AWS Glue Catalog database and table.

1. Choose a primary key from the selected table. The primary key column typically contains a unique identifier for every record in the data source. 

#### Step 3. Select tuning options.
<a name="w2aac37c11c24c23c11b5c11"></a>

1.  For **Recall vs. precision**, choose the tuning value to tune the transform to favor recall or precision. By default, **Balanced** is selected, but you can choose to favor recall or favor precision, or choose **Custom** and enter a value between 0.0 and 1.0 (inclusive). 

1.  For **Lower cost vs. accuracy**, choose the tuning value to favor lower cost or accuracy, or choose **Custom** and enter a value between 0.0 and 1.0 (inclusive). 

1.  For **Match enforcement**, choose **Force output to match labels** if you want to teach the ML transform by forcing the output to match the labels used. 

#### Step 4. Review and create.
<a name="w2aac37c11c24c23c11b5c13"></a>

1.  Review the options for steps 1 – 3. 

1.  Choose **Edit** for any step that needs to be modified. Choose **Create transform** to complete the create transform wizard. 

### Using data encryption with machine learning transforms
<a name="ml_transform_sec_config"></a>

When adding a machine learning transform to AWS Glue, you can optionally specify a security configuration that is associated with the data source or data target. If the Amazon S3 bucket used to store the data is encrypted with a security configuration, specify the same security configuration when creating the transform.

You can also choose to use server-side encryption with AWS KMS (SSE-KMS) to encrypt the model and labels to prevent unauthorized persons from inspecting it. If you choose this option, you're prompted to choose the AWS KMS key by name, or you can choose **Enter a key ARN**. If you choose to enter the ARN for the KMS key, a second field appears where you can enter the KMS key ARN.

**Note**  
Currently, ML transforms that use a custom encryption key aren't supported in the following Regions:  
Asia Pacific (Osaka) - `ap-northeast-3`

## Viewing transform details
<a name="console-machine-learning-transforms-details"></a>

### Viewing transform properties
<a name="console-machine-learning-transforms-details"></a>

The **Transform properties** page includes attributes of your transform. It shows you the details about the transform definition, including the following:
+ **Transform name** shows the name of the transform.
+ **Type** lists the type of the transform.
+ **Status** displays whether the transform is ready to be used in a script or job.
+ **Force output to match labels** displays whether the transform forces the output to match the labels provided by the user.
+ **Spark version** is related to the AWS Glue version you chose in the **Task run properties** when adding the transform. AWS Glue 1.0 and Spark 2.4 is recommended for most customers. For more information, see [AWS Glue Versions](https://docs.aws.amazon.com/glue/latest/dg/release-notes.html#release-notes-versions).

### History, Estimate quality and Tags tabs
<a name="w2aac37c11c24c23c13b5"></a>

 Transform details include the information that you defined when you created the transform. To view the details of a transform, select the transform in the **Machine learning transforms** list, and review the information on the following tabs: 
+ History
+ Estimate quality
+ Tags

#### History
<a name="console-machine-learning-transforms-history"></a>

The **History** tab shows your transform task run history. Several types of tasks are run to teach a transform. For each task, the run metrics include the following:
+ **Run ID** is an identifier created by AWS Glue for each run of this task.
+ **Task type** shows the type of task run.
+ **Status** shows the success of each task listed with the most recent run at the top.
+ **Error** shows the details of an error message if the run was not successful.
+ **Start time** shows the date and time (local time) that the task started.
+ **End time** shows the date and time (local time) that the task ended.
+ **Logs** links to the logs written to `stdout` for this job run.

  The **Logs** link takes you to Amazon CloudWatch Logs. There you can view the details about the tables that were created in the AWS Glue Data Catalog and any errors that were encountered. You can manage your log retention period on the CloudWatch console. The default log retention is `Never Expire`. For more information about how to change the retention period, see [Change Log Data Retention in CloudWatch Logs](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/Working-with-log-groups-and-streams.html#SettingLogRetention) in the *Amazon CloudWatch Logs User Guide*.
+ **Label file** shows a link to Amazon S3 for a generated labeling file.

#### Estimate quality
<a name="console-machine-learning-transforms-metrics"></a>

 The **Estimate quality** tab shows the metrics that you use to measure the quality of the transform. Estimates are calculated by comparing the transform match predictions using a subset of your labeled data against the labels you have provided. These estimates are approximate. You can invoke an **Estimate quality** task run from this tab.

The **Estimate quality** tab shows the metrics from the last **Estimate quality** run including the following properties:
+ **Area under the Precision-Recall curve** is a single number estimating the upper bound of the overall quality of the transform. It is independent of the choice made for the precision-recall parameter. Higher values indicate that you have a more attractive precision-recall tradeoff. 
+ **Precision** estimates how often the transform is correct when it predicts a match.
+ **Recall upper limit** estimates that for an actual match, how often the transform predicts the match.
+ **F1** estimates the transform's accuracy between 0 and 1, where 1 is the best accuracy. For more information, see [F1 score](https://en.wikipedia.org/wiki/F1_score) in Wikipedia.
+ The **Column importance** table show the column names and importance score for each column. Column importance helps you understand how columns contribute to your model, by identifying which columns in your records are being used the most to do the matching. This data may prompt you to add to or change your labelset to raise or lower the importance of columns.

  The Importance column provides a numerical score for each column, as a decimal not greater than 1.0.

For information about understanding quality estimates versus true quality, see [Quality estimates versus end-to-end (true) quality](#console-machine-learning-quality-estimates-true-quality).

For more information about tuning your transform, see [Tuning machine learning transforms in AWS Glue](add-job-machine-learning-transform-tuning.md).

#### Quality estimates versus end-to-end (true) quality
<a name="console-machine-learning-quality-estimates-true-quality"></a>

AWS Glue estimates the quality of your transform by presenting the internal machine-learned model with a number of pairs of records that you provided matching labels for but that the model has not seen before. These quality estimates are a function of the quality of the machine-learned model (which is influenced by the number of records that you label to “teach” the transform). The end-to-end, or *true* recall (which is not automatically calculated by the `ML transform`) is also influenced by the `ML transform` filtering mechanism that proposes a wide variety of possible matches to the machine-learned model. 

You can tune this filtering method primarily by specifying the **Lower Cost-Accuracy** tuning value. As the tuning value gets closer to favor **Accuracy**, the system does a more thorough and expensive search for pairs of records that might be matches. More pairs of records are fed to your machine-learned model, and your `ML transform`'s end-to-end or true recall gets closer to the estimated recall metric. As a result, changes in the end-to-end quality of your matches as a result of changes in the cost/accuracy tradeoff for your matches will typically not be reflected in the quality estimate.

#### Tags
<a name="w2aac37c11c24c23c13b5c13"></a>

 Tags are labels that you can assign to an AWS resource. Each tag consists of a key and an optional value. Tags can be used to search and filter your resource or track your AWS costs. 

## Teach transforms using labels
<a name="console-machine-learning-transforms-teaching-transforms"></a>

 You can teach your ML transform using labels (examples) by choosing **Teach transform** from the ML transform details page. When you teach your machine learning algorithm by providing examples (called labels), you can choose existing labels to use, or create a labeling file. 

![\[The screenshot shows a wizard screen for Teach the transform using labels.\]](http://docs.aws.amazon.com/glue/latest/dg/images/machine-learning-teach-transform.png)

+  **Labeling** – If you have labels, choose **I have labels**. If you do not have labels, you can still continue with the next step in generating a labeling file. 
+  **Generate labeling file** – AWS Glue extracts records from your source data and suggest potential matching records. You choose the Amazon S3 bucket to store the generated label file. Choose **Generate labeling file** to start the process. When done, choose **Download labeling file**. The downloaded file will have a column for labels where you can fill in the labels. 
+  **Upload labels from Amazon S3** – Choose the completed labeling file from the Amazon S3 bucket where the label file is stored. Then, choose to either append the labels to your existing labels or to overwrite your existing labels. Choose **Upload labeling file from Amazon S3**. 

# Tutorial: Creating a machine learning transform with AWS Glue
<a name="machine-learning-transform-tutorial"></a>

This tutorial guides you through the actions to create and manage a machine learning (ML) transform using AWS Glue. Before using this tutorial, you should be familiar with using the AWS Glue console to add crawlers and jobs and edit scripts. You should also be familiar with finding and downloading files on the Amazon Simple Storage Service (Amazon S3) console.

In this example, you create a `FindMatches` transform to find matching records, teach it how to identify matching and nonmatching records, and use it in an AWS Glue job. The AWS Glue job writes a new Amazon S3 file with an additional column named `match_id`. 

The source data used by this tutorial is a file named `dblp_acm_records.csv`. This file is a modified version of academic publications (DBLP and ACM) available from the original [DBLP ACM dataset](https://doi.org/10.3886/E100843V2). The `dblp_acm_records.csv` file is a comma-separated values (CSV) file in UTF-8 format with no byte-order mark (BOM). 

A second file, `dblp_acm_labels.csv`, is an example labeling file that contains both matching and nonmatching records used to teach the transform as part of the tutorial. 

**Topics**
+ [Step 1: Crawl the source data](#ml-transform-tutorial-crawler)
+ [Step 2: Add a machine learning transform](#ml-transform-tutorial-create)
+ [Step 3: Teach your machine learning transform](#ml-transform-tutorial-teach)
+ [Step 4: Estimate the quality of your machine learning transform](#ml-transform-tutorial-estimate-quality)
+ [Step 5: Add and run a job with your machine learning transform](#ml-transform-tutorial-add-job)
+ [Step 6: Verify output data from Amazon S3](#ml-transform-tutorial-data-output)

## Step 1: Crawl the source data
<a name="ml-transform-tutorial-crawler"></a>

First, crawl the source Amazon S3 CSV file to create a corresponding metadata table in the Data Catalog.

**Important**  
To direct the crawler to create a table for only the CSV file, store the CSV source data in a different Amazon S3 folder from other files.

1. Sign in to the AWS Management Console and open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/).

1. In the navigation pane, choose **Crawlers**, **Add crawler**. 

1. Follow the wizard to create and run a crawler named `demo-crawl-dblp-acm` with output to database `demo-db-dblp-acm`. When running the wizard, create the database `demo-db-dblp-acm` if it doesn't already exist. Choose an Amazon S3 include path to sample data in the current AWS Region. For example, for `us-east-1`, the Amazon S3 include path to the source file is `s3://ml-transforms-public-datasets-us-east-1/dblp-acm/records/dblp_acm_records.csv`. 

   If successful, the crawler creates the table `dblp_acm_records_csv` with the following columns: id, title, authors, venue, year, and source.

## Step 2: Add a machine learning transform
<a name="ml-transform-tutorial-create"></a>

Next, add a machine learning transform that is based on the schema of your data source table created by the crawler named `demo-crawl-dblp-acm`.

1. On the AWS Glue console, in the navigation pane under **Data Integration and ETL**, choose **Data classification tools > Record Matching**, then **Add transform**. Follow the wizard to create a `Find matches` transform with the following properties. 

   1. For **Transform name**, enter **demo-xform-dblp-acm**. This is the name of the transform that is used to find matches in the source data.

   1. For **IAM role**, choose an IAM role that has permission to the Amazon S3 source data, labeling file, and AWS Glue API operations. For more information, see [Create an IAM Role for AWS Glue](https://docs.aws.amazon.com/glue/latest/dg/create-an-iam-role.html) in the *AWS Glue Developer Guide*.

   1. For **Data source**, choose the table named **dblp\$1acm\$1records\$1csv** in database **demo-db-dblp-acm**.

   1. For **Primary key**, choose the primary key column for the table, **id**.

1. In the wizard, choose **Finish** and return to the **ML transforms** list.

## Step 3: Teach your machine learning transform
<a name="ml-transform-tutorial-teach"></a>

Next, you teach your machine learning transform using the tutorial sample labeling file.

You can't use a machine language transform in an extract, transform, and load (ETL) job until its status is **Ready for use**. To get your transform ready, you must teach it how to identify matching and nonmatching records by providing examples of matching and nonmatching records. To teach your transform, you can **Generate a label file**, add labels, and then **Upload label file**. In this tutorial, you can use the example labeling file named `dblp_acm_labels.csv`. For more information about the labeling process, see [Labeling](machine-learning.md#machine-learning-labeling).

1. On the AWS Glue console, in the navigation pane, choose **Record Matching**.

1. Choose the `demo-xform-dblp-acm` transform, and then choose **Action**, **Teach**. Follow the wizard to teach your `Find matches` transform. 

1. On the transform properties page, choose **I have labels**. Choose an Amazon S3 path to the sample labeling file in the current AWS Region. For example, for `us-east-1`, upload the provided labeling file from the Amazon S3 path `s3://ml-transforms-public-datasets-us-east-1/dblp-acm/labels/dblp_acm_labels.csv` with the option to **overwrite** existing labels. The labeling file must be located in Amazon S3 in the same Region as the AWS Glue console.

   When you upload a labeling file, a task is started in AWS Glue to add or overwrite the labels used to teach the transform how to process the data source.

1. On the final page of the wizard, choose **Finish**, and return to the **ML transforms** list.

## Step 4: Estimate the quality of your machine learning transform
<a name="ml-transform-tutorial-estimate-quality"></a>

Next, you can estimate the quality of your machine learning transform. The quality depends on how much labeling you have done. For more information about estimating quality, see [Estimate quality](console-machine-learning-transforms.md#console-machine-learning-transforms-metrics).

1. On the AWS Glue console, in the navigation pane under **Data Integration and ETL**, choose **Data classification tools > Record Matching**. 

1. Choose the `demo-xform-dblp-acm` transform, and choose the **Estimate quality** tab. This tab displays the current quality estimates, if available, for the transform. 

1. Choose **Estimate quality** to start a task to estimate the quality of the transform. The accuracy of the quality estimate is based on the labeling of the source data.

1. Navigate to the **History** tab. In this pane, task runs are listed for the transform, including the **Estimating quality** task. For more details about the run, choose **Logs**. Check that the run status is **Succeeded** when it finishes.

## Step 5: Add and run a job with your machine learning transform
<a name="ml-transform-tutorial-add-job"></a>

In this step, you use your machine learning transform to add and run a job in AWS Glue. When the transform `demo-xform-dblp-acm` is **Ready for use**, you can use it in an ETL job.

1. On the AWS Glue console, in the navigation pane, choose **Jobs**.

1. Choose **Add job**, and follow the steps in the wizard to create an ETL Spark job with a generated script. Choose the following property values for your transform:

   1. For **Name**, choose the example job in this tutorial, **demo-etl-dblp-acm**.

   1. For **IAM role**, choose an IAM role with permission to the Amazon S3 source data, labeling file, and AWS Glue API operations. For more information, see [Create an IAM Role for AWS Glue](https://docs.aws.amazon.com/glue/latest/dg/create-an-iam-role.html) in the *AWS Glue Developer Guide*.

   1. For **ETL language**, choose **Scala**. This is the programming language in the ETL script.

   1. For **Script file name**, choose **demo-etl-dblp-acm**. This is the file name of the Scala script (same as the job name).

   1. For **Data source**, choose **dblp\$1acm\$1records\$1csv**. The data source you choose must match the machine learning transform data source schema.

   1. For **Transform type**, choose **Find matching records** to create a job using a machine learning transform.

   1. Clear **Remove duplicate records**. You don't want to remove duplicate records because the output records written have an additional `match_id` field added. 

   1. For **Transform**, choose **demo-xform-dblp-acm**, the machine learning transform used by the job.

   1. For **Create tables in your data target**, choose to create tables with the following properties:
      + **Data store type** — **Amazon S3**
      + **Format** — **CSV**
      + **Compression type** — **None**
      + **Target path** — The Amazon S3 path where the output of the job is written (in the current console AWS Region)

1. Choose **Save job and edit script** to display the script editor page.

1. Edit the script to add a statement to cause the job output to the **Target path** to be written to a single partition file. Add this statement immediately following the statement that runs the `FindMatches` transform. The statement is similar to the following.

   ```
   val single_partition = findmatches1.repartition(1) 
   ```

   You must modify the `.writeDynamicFrame(findmatches1)` statement to write the output as `.writeDynamicFrame(single_partion)`. 

1. After you edit the script, choose **Save**. The modified script looks similar to the following code, but customized for your environment.

   ```
   import com.amazonaws.services.glue.GlueContext
   import com.amazonaws.services.glue.errors.CallSite
   import com.amazonaws.services.glue.ml.FindMatches
   import com.amazonaws.services.glue.util.GlueArgParser
   import com.amazonaws.services.glue.util.Job
   import com.amazonaws.services.glue.util.JsonOptions
   import org.apache.spark.SparkContext
   import scala.collection.JavaConverters._
   
   object GlueApp {
     def main(sysArgs: Array[String]) {
       val spark: SparkContext = new SparkContext()
       val glueContext: GlueContext = new GlueContext(spark)
       // @params: [JOB_NAME]
       val args = GlueArgParser.getResolvedOptions(sysArgs, Seq("JOB_NAME").toArray)
       Job.init(args("JOB_NAME"), glueContext, args.asJava)
       // @type: DataSource
       // @args: [database = "demo-db-dblp-acm", table_name = "dblp_acm_records_csv", transformation_ctx = "datasource0"]
       // @return: datasource0
       // @inputs: []
       val datasource0 = glueContext.getCatalogSource(database = "demo-db-dblp-acm", tableName = "dblp_acm_records_csv", redshiftTmpDir = "", transformationContext = "datasource0").getDynamicFrame()
       // @type: FindMatches
       // @args: [transformId = "tfm-123456789012", emitFusion = false, survivorComparisonField = "<primary_id>", transformation_ctx = "findmatches1"]
       // @return: findmatches1
       // @inputs: [frame = datasource0]
       val findmatches1 = FindMatches.apply(frame = datasource0, transformId = "tfm-123456789012", transformationContext = "findmatches1", computeMatchConfidenceScores = true)
     
     
       // Repartition the previous DynamicFrame into a single partition. 
       val single_partition = findmatches1.repartition(1)    
    
       
       // @type: DataSink
       // @args: [connection_type = "s3", connection_options = {"path": "s3://aws-glue-ml-transforms-data/sal"}, format = "csv", transformation_ctx = "datasink2"]
       // @return: datasink2
       // @inputs: [frame = findmatches1]
       val datasink2 = glueContext.getSinkWithFormat(connectionType = "s3", options = JsonOptions("""{"path": "s3://aws-glue-ml-transforms-data/sal"}"""), transformationContext = "datasink2", format = "csv").writeDynamicFrame(single_partition)
       Job.commit()
     }
   }
   ```

1. Choose **Run job** to start the job run. Check the status of the job in the jobs list. When the job finishes, in the **ML transform**, **History** tab, there is a new **Run ID** row added of type **ETL job**.

1. Navigate to the **Jobs**, **History** tab. In this pane, job runs are listed. For more details about the run, choose **Logs**. Check that the run status is **Succeeded** when it finishes.

## Step 6: Verify output data from Amazon S3
<a name="ml-transform-tutorial-data-output"></a>

In this step, you check the output of the job run in the Amazon S3 bucket that you chose when you added the job. You can download the output file to your local machine and verify that matching records were identified.

1. Open the Amazon S3 console at [https://console.aws.amazon.com/s3/](https://console.aws.amazon.com/s3/).

1. Download the target output file of the job `demo-etl-dblp-acm`. Open the file in a spreadsheet application (you might need to add a file extension `.csv` for the file to properly open).

   The following image shows an excerpt of the output in Microsoft Excel.  
![\[Excel spreadsheet showing the output of the transform.\]](http://docs.aws.amazon.com/glue/latest/dg/images/demo_output_dblp_acm.png)

   The data source and target file both have 4,911 records. However, the `Find matches` transform adds another column named `match_id` to identify matching records in the output. Rows with the same `match_id` are considered matching records. The `match_confidence_score` is a number between 0 and 1 that provides an estimate of the quality of matches found by `Find matches`.

1. Sort the output file by `match_id` to easily see which records are matches. Compare the values in the other columns to see if you agree with the results of the `Find matches` transform. If you don't, you can continue to teach the transform by adding more labels. 

   You can also sort the file by another field, such as `title`, to see if records with similar titles have the same `match_id`. 

# Finding incremental matches
<a name="machine-learning-incremental-matches"></a>

The Find matches feature allows you to identify duplicate or matching records in your dataset, even when the records don’t have a common unique identifier and no fields match exactly. The initial release of the Find matches transform identified matching records within a single dataset. When you add new data to the dataset, you had to merge it with the existing clean dataset and rerun matching against the complete merged dataset.

The incremental matching feature makes it simpler to match to incremental records against existing matched datasets. Suppose that you want to match prospects data with existing customer datasets. The incremental match capability provides you the flexibility to match hundreds of thousands of new prospects with an existing database of prospects and customers by merging the results into a single database or table. By matching only between the new and existing datasets, the find incremental matches optimization reduces computation time, which also reduces cost.

The usage of incremental matching is similar to Find matches as described in [Tutorial: Creating a machine learning transform with AWS Glue](machine-learning-transform-tutorial.md). This topic identifies only the differences with incremental matching.

For more information, see the blog post on [Incremental data matching](https://aws.amazon.com/blogs/big-data/incremental-data-matching-using-aws-lake-formation/).

## Running an incremental matching job
<a name="machine-learning-incremental-matches-add"></a>

For the following procedure, suppose the following: 
+ You have crawled the existing dataset into the table *first\$1records*. The *first\$1records* dataset must be a matched dataset, or the output of the matched job.
+ You have created and trained a Find matches transform with AWS Glue version 2.0. This is the only version of AWS Glue that supports incremental matches.
+ The ETL language is Scala. Note that Python is also supported.
+ The model already generated is called `demo-xform`.

1. Crawl the incremental dataset to the table *second\$1records*.

1. On the AWS Glue console, in the navigation pane, choose **Jobs**.

1. Choose **Add job**, and follow the steps in the wizard to create an ETL Spark job with a generated script. Choose the following property values for your transform:

   1. For **Name**, choose **demo-etl**.

   1. For **IAM role**, choose an IAM role with permission to the Amazon S3 source data, labeling file, and [AWS Glue API operations](https://docs.aws.amazon.com/glue/latest/dg/create-an-iam-role.html).

   1. For **ETL language**, choose **Scala**.

   1. For **Script file name**, choose **demo-etl**. This is the file name of the Scala script.

   1. For **Data source**, choose **first\$1records**. The data source you choose must match the machine learning transform data source schema.

   1. For **Transform type**, choose **Find matching records** to create a job using a machine learning transform.

   1. Select the incremental matching option, and for **Data Source** select the table named **second\$1records**.

   1. For **Transform**, choose **demo-xform**, the machine learning transform used by the job.

   1. Choose **Create tables in your data target** or **Use tables in the data catalog and update your data target**.

1. Choose **Save job and edit script** to display the script editor page.

1. Choose **Run job** to start the job run.

# Using FindMatches in a visual job
<a name="find-matches-visual-job"></a>

 To use the **FindMatches** transform in AWS Glue Studio, you can use the **Custom Transform** node that invokes the FindMatches API. For more information on how to use a custom transform, see [Creating a custom transformation](https://docs.aws.amazon.com/glue/latest/ug/transforms-custom.html) 

**Note**  
 Currently, the FindMatches API only works with `Glue 2.0`. In order to run a job with the Custom transform that invokes the FindMatches API, ensure the AWS Glue version is `Glue 2.0` in the **Job details** tab. If the version of AWS Glue is not `Glue 2.0`, the job will fail at runtime with the following error message: “cannot import name 'FindMatches' from 'awsglueml.transforms'”. 

## Prerequisites
<a name="adding-find-matches-to-a-visual-job-prerequisites"></a>
+  In order to use the **Find Matches** transform, open the AWS Glue Studio console at [https://console.aws.amazon.com/gluestudio/](https://console.aws.amazon.com/gluestudio/). 
+  Create a machine learning transform. When created, a transformId is generated. You will need this ID for the steps below. For more information on how to create a machine learning transform, see [Adding and editing machine learning transforms](https://docs.aws.amazon.com/glue/latest/dg/console-machine-learning-transforms.html#console-machine-learning-transforms-actions). 

## Adding a FindMatches transform
<a name="adding-find-matches-to-a-visual-job"></a>

**To add a FindMatches transform:**

1.  In the AWS Glue Studio job editor, open the Resource panel by clicking on the cross symbol in the upper left-hand corner of the visual job graph and choose a Data source by choosing the **Data tab**. This is the data source you want to check for matches.   
![\[The screenshot shows a cross symbol inside a circle. When you click on this in the visual job editor, the resource panel opens.\]](http://docs.aws.amazon.com/glue/latest/dg/images/resource-panel-blank-canvas.png)

1.  Choose the data source node, then open the Resource panel by clicking on the cross symbol in the upper left-hand corner of the visual job graph and search for 'custom transform'. Choose the **Custom Transform** node to add it to the graph. The **Custom Transform** is linked to the data source node. If it is not, you can click on the **Custom Transform ** node and choose the **Node properties** tab, then under **Node parents**, choose the data source. 

1.  Click the **Custom Transform** node in the visual graph, then choose the **Node properties** tab and name the custom transform. It is recommended that you rename the transform so that the transform name is easily identifiable in the visual graph. 

1.  Choose the **Transform** tab, where you can edit the code block. This is where the code to invoke the FindMatches API can be added.   
![\[The screenshot shows the code block in the Transform tab when the Custom Transform node is selected.\]](http://docs.aws.amazon.com/glue/latest/dg/images/custom-transform-code-block.png)

    The code block contains pre-populated code to get you started. Overwrite the pre-populated code with the template below. The template has a placeholder for the **transformId**, which you can provide. 

   ```
   def MyTransform (glueContext, dfc) -> DynamicFrameCollection:
       dynf = dfc.select(list(dfc.keys())[0])
       from awsglueml.transforms import FindMatches
       findmatches = FindMatches.apply(frame = dynf, transformId = "<your id>")
       return(DynamicFrameCollection({"FindMatches": findmatches}, glueContext))
   ```

1.  Click the **Custom Transform** node in the visual graph, then open the Resource panel by clicking on the cross symbol in the upper left-hand corner of the visual job graph and search for 'Select From Collection'. There is no need to change the default selection since there is only one DynamicFrame in the collection. 

1.  You can continue adding transformations or store the result, which is now enriched with the find matches additional columns. If you want to reference those new columns in downstream transforms, you need to add them to the transform output schema. the easiest way to do that is to choose the **Data preview** tab and then in the schema tab choose “Use datapreview schema”. 

1.  To customize FindMatches, you can add additional parameters to pass to the 'apply' method. See [FindMatches class](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-transforms-findmatches.html). 

## Adding a FindMatches incrementally transformation
<a name="find-matches-incrementally-visual-job"></a>

 In the case of incremental matches, the process is the same as **Adding a FindMatches transformation** with the following differences: 
+  Instead of a parent node for the custom transform, you need two parent nodes. 
+  The first parent node should be the dataset. 
+  The second parent node should be the incremental dataset. 

   Replace the `transformId` with your `transformId` in the template code block: 

  ```
  def MyTransform (glueContext, dfc) -> DynamicFrameCollection:
      dfs = list(dfc.values())
      dynf = dfs[0]
      inc_dynf = dfs[1]
      from awsglueml.transforms import FindIncrementalMatches
      findmatches = FindIncrementalMatches.apply(existingFrame = dynf, incrementalFrame = inc_dynf,
                                      transformId = "<your id>")
      return(DynamicFrameCollection({"FindMatches": findmatches}, glueContext))
  ```
+  For optional parameters, see [FindIncrementalMatches class](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-transforms-findincrementalmatches.html). 

# Migrate Apache Spark programs to AWS Glue
<a name="glue-author-migrate-apache-spark"></a>

Apache Spark is an open-source platform for distributed computing workloads performed on large datasets. AWS Glue leverages Spark's capabilities to provide an optimized experience for ETL. You can migrate Spark programs to AWS Glue to take advantage of our features. AWS Glue provides the same performance enhancements you would expect from Apache Spark on Amazon EMR.

## Run Spark code
<a name="glue-author-migrate-apache-spark-run"></a>

Native Spark code can be run in a AWS Glue environment out of the box. Scripts are often developed by iteratively changing a piece of code, a workflow suited for an Interactive Session. However, existing code is more suited to run in a AWS Glue job, which allows you to schedule and consistently get logs and metrics for each script run. You can upload and edit an existing script through the console. 

1. Acquire the source to your script. For this example, you will use an example script from the Apache Spark repository. [Binarizer Example](https://github.com/apache/spark/blob/master/examples/src/main/python/ml/binarizer_example.py) 

1. In the AWS Glue Console, expand the left-side navigation pane and select **ETL** > **Jobs** 

   In the **Create job** panel, select **Spark script editor**. An **Options** section will appear. Under **Options**, select **Upload and edit an existing script**.

   A **File upload** section will appear. Under **File upload**, click **Choose file**. Your system file chooser will appear. Navigate to the location where you saved `binarizer_example.py`, select it and confirm your selection.

   A **Create** button will appear on the header for the **Create job** panel. Click it.  
![\[The AWS Glue Studio Jobs page with Spark script editor pane selected.\]](http://docs.aws.amazon.com/glue/latest/dg/images/migrate-apache-spark-01-upload-job.png)

1. Your browser will navigate to the script editor. On the header, click the **Job details** tab. Set the **Name** and **IAM Role**. For guidance around AWS Glue IAM roles, consult [Setting up IAM permissions for AWS Glue](set-up-iam.md).

   Optionally - set **Requested number of workers** to `2` and **Number of retries** to `1`. These options are valuable when running production jobs, but turning them down will streamline your experience while testing out a feature.

   In the title bar, click **Save**, then **Run**  
![\[The job details page with options set as instructed.\]](http://docs.aws.amazon.com/glue/latest/dg/images/migrate-apache-spark-02-job-details.png)

1. Navigate to the **Runs** tab. You will see a panel corresponding to your job run. Wait a few minutes and the page should automatically refresh to show **Succeeded** under **Run status**.  
![\[The job runs page with a successful job run.\]](http://docs.aws.amazon.com/glue/latest/dg/images/migrate-apache-spark-03-job-runs.png)

1. You will want to examine your output to confirm that the Spark script ran as intended. This Apache Spark sample script should write a string to the output stream. You can find that by navigating to **Output logs** under **Cloudwatch logs** in the panel for the successful job run. Note the job run id, a generated id under the **Id** label beginning with `jr_`.

   This will open the CloudWatch console, set to visualize the contents of the default AWS Glue log group `/aws-glue/jobs/output`, filtered to the contents of the log streams for the job run id. Each worker will have generated a log stream, shown as rows under the **Log streams** . One worker should have run the requested code. You will need to open all the log streams to identify the correct worker. Once you find the right worker, you should see the output of the script, as seen in the following image:   
![\[The CloudWatch console page with the Spark program output.\]](http://docs.aws.amazon.com/glue/latest/dg/images/migrate-apache-spark-04-log-output.png)

## Common procedures needed for migrating Spark programs
<a name="glue-author-migrate-apache-spark-migrate"></a>

### Assess Spark version support
<a name="glue-author-migrate-apache-spark-migrate-versions"></a>

 AWS Glue release versions define the version of Apache Spark and Python available to the AWS Glue job. You can find our AWS Glue versions and what they support at [AWS Glue versions](release-notes.md#release-notes-versions). You may need to update your Spark program to be compatible with a newer version of Spark in order to access certain AWS Glue features.

### Include third-party libraries
<a name="glue-author-migrate-apache-spark-third-party-libraries"></a>

Many existing Spark programs will have dependencies, both on private and public artifacts. AWS Glue supports JAR style dependencies for Scala Jobs as well as Wheel and source pure-Python dependencies for Python jobs.

**Python** - For information about Python dependencies, see [Using Python libraries with AWS Glue](aws-glue-programming-python-libraries.md)

Common Python dependencies are provided in the AWS Glue environment, including the commonly requested [Pandas](https://pandas.pydata.org/) library. Dependencies are included in AWS Glue Version 2.0\$1. For more information about provided modules, see [Python modules already provided in AWS Glue](aws-glue-programming-python-libraries.md#glue-modules-provided). If you need to supply a Job with a different version of a dependency included by default, you can use `--additional-python-modules`. For information about job arguments, see [Using job parameters in AWS Glue jobs](aws-glue-programming-etl-glue-arguments.md).

You can supply additional Python dependencies with the `--extra-py-files` job argument. If you are migrating a job from a Spark program, this parameter is a good option because it is functionally equivalent to the `--py-files` flag in PySpark, and is subject to the same limitations. For more information about the `--extra-py-files` parameter, see [Including Python files with PySpark native features](aws-glue-programming-python-libraries.md#extra-py-files-support)

For new jobs, you can manage Python dependencies with the `--additional-python-modules` job argument. Using this argument allows for a more thorough dependency management experience. This parameter supports Wheel style dependencies, including those with native code bindings compatible with Amazon Linux 2.

**Scala**

You can supply additional Scala dependencies with the `--extra-jars` Job Argument. Dependencies must be hosted in Amazon S3 and the argument value should be a comma delimited list of Amazon S3 paths with no spaces. You may find it easier to manage your configuration by rebundling your dependencies before hosting and configuring them. AWS Glue JAR dependencies contain Java bytecode, which can be generated from any JVM language. You can use other JVM languages, such as Java, to write custom dependencies.

### Manage data source credentials
<a name="glue-author-migrate-apache-spark-credential-management"></a>

Existing Spark programs may come with complex or custom configuration to pull data from their datasources. Common datasource auth flows are supported by AWS Glue connections. For more information about AWS Glue connections, see [Connecting to data](glue-connections.md).

AWS Glue connections facilitate connecting your Job to a variety of types of data stores in two primary ways: through method calls to our libraries and setting the **Additional network connection** in the AWS console. You may also call the AWS SDK from within your job to retrieve information from a connection. 

 **Method calls** – AWS Glue Connections are tightly integrated with the AWS Glue Data Catalog, a service that allows you to curate information about your datasets, and the methods available to interact with AWS Glue connections reflect that. If you have an existing auth configuration you would like to reuse, for JDBC connections, you can access your AWS Glue connection configuration through the `extract_jdbc_conf` method on the `GlueContext`. For more information, see [extract\$1jdbc\$1conf](aws-glue-api-crawler-pyspark-extensions-glue-context.md#aws-glue-api-crawler-pyspark-extensions-glue-context-extract_jdbc_conf) 

**Console configuration** – AWS Glue Jobs use associated AWS Glue connections to configure connections to Amazon VPC subnets. If you directly manage your security materials, you may need to provide a `NETWORK` type **Additional network connection** in the AWS console to configure routing. For more information about the AWS Glue connection API, see [Connections API](aws-glue-api-catalog-connections.md)

If your Spark programs has a custom or uncommon auth flow, you may need to manage your security materials in a hands-on fashion. If AWS Glue connections do not seem like a good fit, you can securely host security materials in Secrets Manager and access them through the boto3 or AWS SDK, which are provided in the job.

### Configure Apache Spark
<a name="glue-author-migrate-apache-spark-spark-configuration"></a>

Complex migrations often alter Spark configuration to acommodate their workloads. Modern versions of Apache Spark allow runtime configuration to be set with the `SparkSession`. AWS Glue 3.0\$1 Jobs are provided a `SparkSession`, which can be modified to set runtime configuration. [Apache Spark Configuration](https://spark.apache.org/docs/latest/configuration.html). Tuning Spark is complex, and AWS Glue does not guarantee support for setting all Spark configuration. if your migration requires substantial Spark-level configuration, contact support.

### Set custom configuration
<a name="glue-author-migrate-apache-spark-custom-configuration"></a>

Migrated Spark programs may be designed to take custom configuration. AWS Glue allows configuration to be set on the job and job run level, through the job arguments. For information about job arguments, see [Using job parameters in AWS Glue jobs](aws-glue-programming-etl-glue-arguments.md). You can access job arguments within the context of a job through our libraries. AWS Glue provides a utility function to provide a consistent view between arguments set on the job and arguments set on the job run. See [Accessing parameters using `getResolvedOptions`](aws-glue-api-crawler-pyspark-extensions-get-resolved-options.md) in Python and [AWS Glue Scala GlueArgParser APIs](glue-etl-scala-apis-glue-util-glueargparser.md) in Scala.

### Migrate Java code
<a name="glue-author-migrate-apache-spark-java-code"></a>

As explained in [Include third-party libraries](#glue-author-migrate-apache-spark-third-party-libraries), your dependencies can contain classes generated by JVM languages, such as Java or Scala. Your dependencies can include a `main` method. You can use a `main` method in a dependency as the entrypoint for a AWS Glue Scala job. This allows you to write your `main` method in Java, or reuse a `main` method packaged to your own library standards. 

To use a `main` method from a dependency, perform the following: Clear the contents of the editing pane providing the default `GlueApp` object. Provide the fully qualified name of a class in a dependency as a job argument with the key `--class`. You should then be able to trigger a Job run.

You cannot configure the order or structure of the arguments AWS Glue passes to the `main` method. If your existing code needs to read configuration set in AWS Glue, this will likely cause incompatibility with prior code. If you use `getResolvedOptions`, you will also not have a good place to call this method. Consider invoking your dependency directly from a main method generated by AWS Glue. The following AWS Glue ETL script shows an example of this.

```
import com.amazonaws.services.glue.util.GlueArgParser

object GlueApp {
  def main(sysArgs: Array[String]) {
    val args = GlueArgParser.getResolvedOptions(sysArgs, Seq("JOB_NAME").toArray)
    
    // Invoke static method from JAR. Pass some sample arguments as a String[], one defined inline and one taken from the job arguments, using getResolvedOptions
    com.mycompany.myproject.MyClass.myStaticPublicMethod(Array("string parameter1", args("JOB_NAME")))
    
    // Alternatively, invoke a non-static public method.
    (new com.mycompany.myproject.MyClass).someMethod()
  }
}
```