# Working with jobs in AWS Glue
<a name="author-glue-job"></a>

The following sections provide information on ETL and Ray jobs in AWS Glue.

**Topics**
+ [AWS Glue versions](release-notes.md)
+ [Working with Spark jobs in AWS Glue](etl-jobs-section.md)
+ [Working with Ray jobs in AWS Glue](ray-jobs-section.md)
+ [Configuring job properties for Python shell jobs in AWS Glue](add-job-python.md)
+ [Monitoring AWS Glue](monitor-glue.md)

# AWS Glue versions
<a name="release-notes"></a>

You can configure the AWS Glue version parameter when you add or update a job. The AWS Glue version determines the versions of Apache Spark and Python that AWS Glue supports. The Python version indicates the version that's supported for jobs of type Spark. The following table lists the available AWS Glue versions, the corresponding Spark and Python versions, and other changes in functionality.

You can use the [Generative AI upgrades for Apache Spark](upgrade-analysis.md) to upgrade your Glue ETL jobs from older Glue versions (≥ 2.0) to the latest Glue version.

## AWS Glue versions
<a name="release-notes-versions"></a>

<a name="table-glue-versions"></a>[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/release-notes.html)

**Note**  
The following Glue versions support these versions of PythonShell:  
PythonShell v3.6 is supported in Glue version 1.0.
PythonShell v3.9 is supported in Glue version 3.0.
Additionally, dev endpoints are supported only in Glue version 1.0, and 0.9.

# AWS Glue version support policy
<a name="glue-version-support-policy"></a>

AWS Glue is a serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning, and application development. An *AWS Glue job* contains the business logic that performs the data integration work in AWS Glue. There are three types of jobs in AWS Glue: *Spark (batch and streaming)*, *Ray* and *Python shell*. When you define your job, you specify the AWS Glue version, which configures versions in the underlying Spark, Ray or Python runtime environment. For example: an AWS Glue version 5.0 Spark job supports Spark 3.5.4 and Python 3.11.

## Support policy
<a name="glue-version-support-policy-milestones"></a>

AWS Glue versions are built around a combination of operating system, programming language, and software libraries that are subject to maintenance and security updates. AWS Glue's version support policy is to end support for a version when any major component of the version reaches the end of community long-term support (LTS) and security updates are no longer available. AWS Glue's version support policy includes the following statuses: 

**End of Support (EOS) -** When an AWS Glue version reaches EOS:
+ AWS Glue will no longer apply security patches or other updates to EOS versions.
+ AWS Glue jobs on EOS versions are not eligible for technical support.
+ AWS Glue may not honor SLAs when jobs are run on EOS versions.

**End of Life (EOL) -** When an AWS Glue version reaches EOL:
+ You can no longer create new AWS Glue jobs or interactive sessions on EOL versions.
+ You can no longer start job runs on these AWS Glue versions.
+ AWS Glue will stop existing job runs and interactive sessions on EOL versions.
+ EOL versions will be removed from AWS Glue SDKs and APIs.

The following AWS Glue versions have reached end of support and will no longer be available after the end of life date. Changes to a version’s support status start at midnight (Pacific time zone) on the specified date.


| **Type** | **Glue version** | **End of support** | **End of life** | 
| --- | --- | --- | --- | 
| **Type** | **Python version** | **End of support** | **End of life** | 
| --- | --- | --- | --- | 
| **Type** | **Notebook version** | **End of support** | **End of life** | 
| --- | --- | --- | --- | 
| Spark | Glue version 0.9 (Spark 2.2, Scala 2, Python 2) | 6/1/2022 | 4/1/2026 | 
| Spark | Glue version 1.0 (Spark 2.4, Python 2) | 6/1/2022 | 4/1/2026 | 
| Spark | Glue version 1.0 (Spark 2.4, Scala 2, Python 3) | 9/30/2022 | 4/1/2026 | 
| Spark | Glue version 2.0 (Spark 2.4, Python 3) | 1/31/2024 | 4/1/2026 | 
| Python shell | Python 2 (AWS Glue Version 1.0) | 6/1/2022 | 4/1/2026 | 
| Python shell | PythonShell 3.6 (Glue version 1.0) | 3/31/2026 | NA | 
| Development endpoint | Zeppelin notebook | 9/30/2022 | NA | 

**Note**  
 Creating new AWS Glue Python Shell 3.6 jobs will not be allowed once end of support is reached on March 31, 2026, but you can continue to update and run existing jobs. However, jobs running on discontinued versions are not eligible for technical support. AWS Glue will not apply security patches or other updates to discontinued versions. AWS Glue will also not honor SLAs when jobs are run on discontinued versions. 

AWS strongly recommends that you migrate your jobs to supported versions.

For information on migrating your Spark jobs to the latest AWS Glue version, see [Migrating AWS Glue jobs to AWS Glue version 5.1](https://docs.aws.amazon.com/glue/latest/dg/migrating-version-51.html). 

For migrating your Python shell jobs to the latest AWS Glue version:
+ In the console, choose `Python 3 (Glue Version 4.0)`.
+ In the [CreateJob](https://docs.aws.amazon.com/glue/latest/webapi/API_CreateJob.html)/[UpdateJob](https://docs.aws.amazon.com/glue/latest/webapi/API_UpdateJob.html) API, set the `GlueVersion` parameter to `2.0`, and the `PythonVersion` to `3` under the `Command` parameter. The `GlueVersion` configuration does not affect the behavior of Python shell jobs, so there is no advantage to incrementing `GlueVersion`.
+ You need to make your job script compatible with Python 3.

# Migrating AWS Glue for Spark jobs to AWS Glue version 5.1
<a name="migrating-version-51"></a>

This topic describes the changes between AWS Glue versions 0.9, 1.0, 2.0, 3.0, 4.0 and 5.0 to allow you to migrate your Spark applications and ETL jobs to AWS Glue 5.1. It also describes the features in AWS Glue 5.1 and the advantages of using it. 

To use this feature with your AWS Glue ETL jobs, choose **5.1** for the `Glue version` when creating your jobs.

**Topics**
+ [New features](#migrating-version-51-features)
+ [Actions to migrate to AWS Glue 5.1](#migrating-version-51-actions)
+ [Migration checklist](#migrating-version-51-checklist)
+ [Migrating from AWS Glue 5.0 to AWS Glue 5.1](#migrating-version-51-from-50)
+ [Migrating from Older AWS Glue Versions to AWS Glue 5.1](#migrating-older-versions-to-51)
+ [Connector and JDBC driver migration for AWS Glue 5.1](#migrating-version-51-connector-driver-migration)

## New features
<a name="migrating-version-51-features"></a>

This section describes new features and advantages of AWS Glue version 5.1.
+ Apache Spark update from 3.5.4 in AWS Glue 5.0 to 3.5.6 in AWS Glue 5.1.
+ Open Table Formats (OTF) updated to Hudi 1.0.2, Iceberg 1.10.0, and Delta Lake 3.3.2
+ **Iceberg Materialized Views** - Create and manage Iceberg Materialized Views (MV). For more information, see [blog post](https://aws.amazon.com/blogs/big-data/introducing-apache-iceberg-materialized-views-in-aws-glue-data-catalog/) 
+ **Iceberg format version 3.0** - Extends data types and existing metadata structures to add new capabilities. For more information, see the [Iceberg Table Spec](https://iceberg.apache.org/spec/). 
+ **Hudi Full Table Access** - Full Table Access (FTA) control for Apache Hudi in Apache Spark based on your policies defined in AWS Lake Formation. This feature enables read and write operations from your AWS Glue ETL jobs on AWS Lake Formation registered tables when the job role has full table access.
+ **Spark native fine-grained access control (FGAC) support using AWS Lake Formation** - DDL/DML operations (like CREATE, ALTER, DELETE, DROP) with fine grained access control for Apache Hive, Apache Iceberg and Delta Lake tables registered in AWS Lake Formation.
+ **Audit context for Spark jobs** - Audit context for AWS Glue ETL jobs will be available for AWS Glue and AWS Lake Formation API calls in the AWS CloudTrail logs.

**Known Issues and Limitations**  
Note the following known issues and limitations:
+ Limited support for view SQL clause for creation of materialized views, query rewrite and incremental refresh. More details can be found in the [Iceberg Materialized Views feature documentation page](https://docs.aws.amazon.com/lake-formation/latest/dg/materialized-views.html#materialized-views-considerations-limitations) 
+ **Hudi FTA writes** require using HoodieCredentialedHadoopStorage for credential vending during job execution. Set the following configuration when running Hudi jobs:

  `hoodie.storage.class=org.apache.spark.sql.hudi.storage.HoodieCredentialedHadoopStorage` 
+ Hudi FTA write support works only with the default Hudi configurations. Custom or non-default Hudi settings may not be fully supported and could result in unexpected behavior. Clustering for Hudi Merge-On-Read (MOR) tables is also not supported under FTA write mode.

**Breaking changes**  
Note the following breaking changes:
+  S3A filesystem has replaced EMRFS as the default S3 connector. For information on how to migrate, see [Migrating from AWS Glue 5.0 to AWS Glue 5.1](#migrating-version-51-from-50). 

## Actions to migrate to AWS Glue 5.1
<a name="migrating-version-51-actions"></a>

For existing jobs, change the `Glue version` from the previous version to `Glue 5.1` in the job configuration.
+ In AWS Glue Studio, choose `Glue 5.1 - Supports Spark 3.5.6, Scala 2, Python 3` in `Glue version`.
+ In the API, choose **5.1** in the `GlueVersion` parameter in the [https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-jobs-job.html#aws-glue-api-jobs-job-UpdateJob](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-jobs-job.html#aws-glue-api-jobs-job-UpdateJob) API operation.

For new jobs, choose `Glue 5.1` when you create a job.
+ In the console, choose `Spark 3.5.6, Python 3 (Glue Version 5.1) or Spark 3.5.6, Scala 2 (Glue Version 5.1)` in `Glue version`.
+ In AWS Glue Studio, choose `Glue 5.1 - Supports Spark 3.5.6, Scala 2, Python 3` in `Glue version`.
+ In the API, choose **5.1** in the `GlueVersion` parameter in the [https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-jobs-job.html#aws-glue-api-jobs-job-CreateJob](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-jobs-job.html#aws-glue-api-jobs-job-CreateJob) API operation.

To view Spark event logs of AWS Glue 5.1 coming from AWS Glue 2.0 or earlier, [launch an upgraded Spark history server for AWS Glue 5.1 using CloudFormation or Docker](https://docs.aws.amazon.com/glue/latest/dg/monitor-spark-ui-history.html).

## Migration checklist
<a name="migrating-version-51-checklist"></a>

Review this checklist for migration:
+ [Python] Update boto references from 1.34 to 1.40.

## Migrating from AWS Glue 5.0 to AWS Glue 5.1
<a name="migrating-version-51-from-50"></a>

All existing job parameters and major features that exist in AWS Glue 5.0 will exist in AWS Glue 5.1. Note the following changes when migrating:
+ In AWS Glue 5.1, S3A filesystem has replaced EMRFS as the default S3 connector. If both `spark.hadoop.fs.s3a.endpoint` and `spark.hadoop.fs.s3a.endpoint.region` are not set, the default region used by S3A is `us-east-2`. This can cause issues, such as S3 upload timeout errors, especially for VPC jobs. To mitigate the issues caused by this change, set the `spark.hadoop.fs.s3a.endpoint.region` Spark configuration when using the S3A file system in AWS Glue 5.1.
+ To continue using EMRFS instead of S3A, set the following spark configurations:

  ```
      --conf spark.hadoop.fs.s3.impl=com.amazon.ws.emr.hadoop.fs.EmrFileSystem
      --conf spark.hadoop.fs.s3n.impl=com.amazon.ws.emr.hadoop.fs.EmrFileSystem
      --conf spark.hadoop.fs.AbstractFileSystem.s3.impl=org.apache.hadoop.fs.s3.EMRFSDelegate
  ```

Refer to the Spark migration documentation:
+ [Migration Guide: Spark Core](https://spark.apache.org/docs/3.5.6/core-migration-guide.html)
+ [Migration Guide: SQL, Datasets and DataFrame](https://spark.apache.org/docs/3.5.6/sql-migration-guide.html)
+ [Migration Guide: Structured Streaming](https://spark.apache.org/docs/3.5.6/ss-migration-guide.html)
+ [Upgrading PySpark](https://spark.apache.org/docs/3.5.6/api/python/migration_guide/pyspark_upgrade.html)

## Migrating from Older AWS Glue Versions to AWS Glue 5.1
<a name="migrating-older-versions-to-51"></a>
+ For migration steps related to AWS Glue 4.0 to AWS Glue 5.0, see [Migrating from AWS Glue 4.0 to AWS Glue 5.0](https://docs.aws.amazon.com/glue/latest/dg/migrating-version-50.html#migrating-version-50-from-40).
+ For migration steps related to AWS Glue 3.0 to AWS Glue 5.0, see [Migrating from AWS Glue 3.0 to AWS Glue 5.0](https://docs.aws.amazon.com/glue/latest/dg/migrating-version-50.html#migrating-version-50-from-30).
+ For migration steps related to AWS Glue 2.0 to AWS Glue 5.0 and a list of migration differences between AWS Glue version 2.0 and 4.0, see [Migrating from AWS Glue 2.0 to AWS Glue 5.0](https://docs.aws.amazon.com/glue/latest/dg/migrating-version-50.html#migrating-version-50-from-20). 

## Connector and JDBC driver migration for AWS Glue 5.1
<a name="migrating-version-51-connector-driver-migration"></a>

For the versions of JDBC and data lake connectors that were upgraded, see:
+ [Appendix B: JDBC driver upgrades](#migrating-version-51-appendix-jdbc-driver)
+ [Appendix C: Connector upgrades](#migrating-version-51-appendix-connector)
+ [Appendix D: Open table format upgrades](#migrating-version-51-appendix-open-table-formats)

The following changes apply to the OTF version upgrades identified in [Appendix D: Open table format upgrades](#migrating-version-51-appendix-open-table-formats) for AWS Glue 5.1.

**Apache Hudi**  
Note the following changes:
+ Support FTA read and write access on Lake Formation registered tables.

**Apache Iceberg**  
Note the following changes:
+ Support Iceberg format version 3. The following features are supported:
  + Multi-argument transforms for partitioning and sorting.
  + Row Lineage tracking.
  + Deletion vectors. Learn more in [ blog post](https://aws.amazon.com/blogs/big-data/unlock-the-power-of-apache-iceberg-v3-deletion-vectors-on-amazon-emr/) 
  + Table encryption keys.
  + Default value support for columns.
+ Support Spark-native FGAC writes on AWS Lake Formation registered tables.
+ Athena SQL compatibility - Cannot read Iceberg V3 tables created by EMR Spark due to error: `GENERIC_INTERNAL_ERROR: Cannot read unsupported version 3`

**Delta Lake**  
Note the following changes:
+ Support FTA read and write access on Lake Formation registered tables.

### Appendix A: Notable dependency upgrades
<a name="migrating-version-51-appendix-dependencies"></a>

The following are dependency upgrades:


| Dependency | Version in AWS Glue 5.1 | Version in AWS Glue 5.0 | Version in AWS Glue 4.0 | Version in AWS Glue 3.0 | Version in AWS Glue 2.0 | Version in AWS Glue 1.0 | 
| --- | --- | --- | --- | --- | --- | --- | 
| Java | 17 | 17 | 8 | 8 | 8 | 8 | 
| Spark | 3.5.6 | 3.5.4 | 3.3.0-amzn-1 | 3.1.1-amzn-0 | 2.4.3 | 2.4.3 | 
| Hadoop | 3.4.1 | 3.4.1 | 3.3.3-amzn-0 | 3.2.1-amzn-3 | 2.8.5-amzn-5 | 2.8.5-amzn-1 | 
| Scala | 2.12.18 | 2.12.18 | 2.12 | 2.12 | 2.11 | 2.11 | 
| Jackson | 2.15.2 | 2.15.2 | 2.12 | 2.12 | 2.11 | 2.11 | 
| Hive | 2.3.9-amzn-4 | 2.3.9-amzn-4 | 2.3.9-amzn-2 | 2.3.7-amzn-4 | 1.2 | 1.2 | 
| EMRFS | 2.73.0 | 2.69.0 | 2.54.0 | 2.46.0 | 2.38.0 | 2.30.0 | 
| Json4s | 3.7.0-M11 | 3.7.0-M11 | 3.7.0-M11 | 3.6.6 | 3.5.x | 3.5.x | 
| Arrow | 12.0.1 | 12.0.1 | 7.0.0 | 2.0.0 | 0.10.0 | 0.10.0 | 
| AWS Glue Data Catalog client | 4.9.0 | 4.5.0 | 3.7.0 | 3.0.0 | 1.10.0 | N/A | 
| AWS SDK for Java | 2.35.5 | 2.29.52 | 1.12 | 1.12 |  |  | 
| Python | 3.11 | 3.11 | 3.10 | 3.7 | 2.7 & 3.6 | 2.7 & 3.6 | 
| Boto | 1.40.61 | 1.34.131 | 1.26 | 1.18 | 1.12 | N/A | 
| EMR DynamoDB connector | 5.7.0 | 5.6.0 | 4.16.0 |  |  |  | 

### Appendix B: JDBC driver upgrades
<a name="migrating-version-51-appendix-jdbc-driver"></a>

The following are JDBC driver upgrades:


| Driver | JDBC driver version in AWS Glue 5.1 | JDBC driver version in AWS Glue 5.0 | JDBC driver version in AWS Glue 4.0 | JDBC driver version in AWS Glue 3.0 | JDBC driver version in past AWS Glue versions | 
| --- | --- | --- | --- | --- | --- | 
| MySQL | 8.0.33 | 8.0.33 | 8.0.23 | 8.0.23 | 5.1 | 
| Microsoft SQL Server | 10.2.0 | 10.2.0 | 9.4.0 | 7.0.0 | 6.1.0 | 
| Oracle Databases | 23.3.0.23.09 | 23.3.0.23.09 | 21.7 | 21.1 | 11.2 | 
| PostgreSQL | 42.7.3 | 42.7.3 | 42.3.6 | 42.2.18 | 42.1.0 | 
| Amazon Redshift |  redshift-jdbc42-2.1.0.29  |  redshift-jdbc42-2.1.0.29  |  redshift-jdbc42-2.1.0.16  |  redshift-jdbc41-1.2.12.1017   |  redshift-jdbc41-1.2.12.1017   | 
| SAP Hana | 2.20.17 | 2.20.17 | 2.17.12 |  |  | 
| Teradata | 20.00.00.33 | 20.00.00.33 | 20.00.00.06 |  |  | 

### Appendix C: Connector upgrades
<a name="migrating-version-51-appendix-connector"></a>

The following are connector upgrades:


| Driver | Connector version in AWS Glue 5.1 | Connector version in AWS Glue 5.0 | Connector version in AWS Glue 4.0 | Connector version in AWS Glue 3.0 | 
| --- | --- | --- | --- | --- | 
| EMR DynamoDB connector | 5.7.0 | 5.6.0 | 4.16.0 |  | 
| Amazon Redshift | 6.4.2 | 6.4.0 | 6.1.3 |  | 
| OpenSearch | 1.2.0 | 1.2.0 | 1.0.1 |  | 
| MongoDB | 10.3.0 | 10.3.0 | 10.0.4 | 3.0.0 | 
| Snowflake | 3.1.1 | 3.0.0 | 2.12.0 |  | 
| Google BigQuery | 0.32.2 | 0.32.2 | 0.32.2 |  | 
| AzureCosmos | 4.33.0 | 4.33.0 | 4.22.0 |  | 
| AzureSQL | 1.3.0 | 1.3.0 | 1.3.0 |  | 
| Vertica | 3.3.5 | 3.3.5 | 3.3.5 |  | 

### Appendix D: Open table format upgrades
<a name="migrating-version-51-appendix-open-table-formats"></a>

The following are open table format upgrades:


| OTF | Connector version in AWS Glue 5.1 | Connector version in AWS Glue 5.0 | Connector version in AWS Glue 4.0 | Connector version in AWS Glue 3.0 | 
| --- | --- | --- | --- | --- | 
| Hudi | 1.0.2 | 0.15.0 | 0.12.1 | 0.10.1 | 
| Delta Lake | 3.3.2 | 3.3.0 | 2.1.0 | 1.0.0 | 
| Iceberg | 1.10.0 | 1.7.1 | 1.0.0 | 0.13.1 | 

# Migrating AWS Glue for Spark jobs to AWS Glue version 5.0
<a name="migrating-version-50"></a>

This topic describes the changes between AWS Glue versions 0.9, 1.0, 2.0, 3.0, and 4.0 to allow you to migrate your Spark applications and ETL jobs to AWS Glue 5.0. It also describes the features in AWS Glue 5.0 and the advantages of using it. 

To use this feature with your AWS Glue ETL jobs, choose **5.0** for the `Glue version` when creating your jobs.

**Topics**
+ [New features](#migrating-version-50-features)
+ [Actions to migrate to AWS Glue 5.0](#migrating-version-50-actions)
+ [Migration checklist](#migrating-version-50-checklist)
+ [AWS Glue 5.0 features](#migrating-version-50-features)
+ [Migrating from AWS Glue 4.0 to AWS Glue 5.0](#migrating-version-50-from-40)
+ [Migrating from AWS Glue 3.0 to AWS Glue 5.0](#migrating-version-50-from-30)
+ [Migrating from AWS Glue 2.0 to AWS Glue 5.0](#migrating-version-50-from-20)
+ [Logging behavior changes in AWS Glue 5.0](#enable-continous-logging-changes-glue-50)
+ [Connector and JDBC driver migration for AWS Glue 5.0](#migrating-version-50-connector-driver-migration)

## New features
<a name="migrating-version-50-features"></a>

This section describes new features and advantages of AWS Glue version 5.0.
+ Apache Spark update from 3.3.0 in AWS Glue 4.0 to 3.5.4 in AWS Glue 5.0. See [Major enhancements from Spark 3.3.0 to Spark 3.5.4](#migrating-version-50-features-spark). 
+ Spark-native fine-grained access control (FGAC) using Lake Formation. This includes FGAC for Iceberg, Delta and Hudi tables. For more information, see [ Using AWS Glue with AWS Lake Formation for fine-grained access control](https://docs.aws.amazon.com/glue/latest/dg/security-lf-enable.html). 

  Note the following considerations or limitations for Spark-native FGAC:
  + Currently data writes are not supported
  + Writing into Iceberg through `GlueContext` using Lake Formation requires use of IAM access control instead

  For a complete list of limitations and considerations when using Spark-native FGAC, see [Considerations and limitations](security-lf-enable-considerations.md).
+ Support for Amazon S3 Access Grants as a scalable access control solution to your Amazon S3 data from AWS Glue. For more information, see [Using Amazon S3 Access Grants with AWS Glue](security-s3-access-grants.md).
+ Open Table Formats (OTF) updated to Hudi 0.15.0, Iceberg 1.7.1, and Delta Lake 3.3.0
+ Amazon SageMaker Unified Studio support.
+ Amazon SageMaker Lakehouse and data abstraction integration. For more information, see [Querying metastore data catalogs from AWS Glue ETL](#migrating-version-50-features-metastore).
+ Support to install additional Python libraries using `requirements.txt`. For more information, see [Installing additional Python libraries in AWS Glue 5.0 or above using requirements.txt](aws-glue-programming-python-libraries.md#addl-python-modules-requirements-txt).
+ AWS Glue 5.0 supports data lineage in Amazon DataZone. You can configure AWS Glue to automatically collect lineage information during Spark job runs and send the lineage events to be visualized in Amazon DataZone. For more information, see [Data lineage in Amazon DataZone](https://docs.aws.amazon.com/datazone/latest/userguide/datazone-data-lineage.html).

  To configure this on the AWS Glue console, turn on **Generate lineage events**, and enter your Amazon DataZone domain ID on the **Job details** tab.  
![\[The screenshot shows turning on Amazon DataZone date lineage for AWS Glue.\]](http://docs.aws.amazon.com/glue/latest/dg/images/glue-50-data-lineage.png)

  Alternatively, you can provide the following job parameter (provide your DataZone domain ID):
  + Key: `--conf`
  + Value:

    ```
    extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener
    —conf spark.openlineage.transport.type=amazon_datazone_api
    -conf spark.openlineage.transport.domainId=<your-domain-ID>
    ```
+ Connector and JDBC driver updates. For more information, see [Appendix B: JDBC driver upgrades](#migrating-version-50-appendix-jdbc-driver) and [Appendix C: Connector upgrades](#migrating-version-50-appendix-connector).
+ Java update from 8 to 17.
+ Increased storage for AWS Glue `G.1X` and `G.2X` workers with disk space increasing to 94GB and 138GB respectively. Additionally, new worker types `G.12X`, `G.16X`, and memory-optimized `R.1X`, `R.2X`, `R.4X`, `R.8X` are available in AWS Glue 4.0 and later versions. For more information, see [Jobs](aws-glue-api-jobs-job.md) 
+ **Support for AWS SDK for Java, version 2** - AWS Glue 5.0 jobs can use the for Java versions [1.12.569](https://github.com/aws/aws-sdk-java/tree/1.12.569) or [2.28.8](https://github.com/aws/aws-sdk-java-v2/tree/2.28.8) if the job supports v2. The AWS SDK for Java 2.x is a major rewrite of the version 1.x code base. It’s built on top of Java 8\$1 and adds several frequently requested features. These include support for non-blocking I/O, and the ability to plug in a different HTTP implementation at runtime. For more information, including a Migration Guide from SDK for Java v1 to v2, see the [AWS SDK for Java, version 2](https://docs.aws.amazon.com/sdk-for-java) guide.

**Breaking changes**  
Note the following breaking changes:
+  In AWS Glue 5.0, when using the S3A file system and if both `fs.s3a.endpoint` and `fs.s3a.endpoint.region` are not set, the default region used by S3A is `us-east-2`. This can cause issues, such as S3 upload timeout errors, especially for VPC jobs. To mitigate the issues caused by this change, set the `fs.s3a.endpoint.region` Spark configuration when using the S3A file system in AWS Glue 5.0. 
+ Lake Formation Fine-grained Access Control (FGAC)
  + AWS Glue 5.0 only supports the new Spark-native FGAC using Spark DataFrames. It does not support FGAC using AWS Glue DynamicFrames.
    + Use of FGAC in 5.0 requires migrating from AWS Glue DynamicFrames to Spark DataFrames
    + If you don't need FGAC, then it is not necessary to migrate to Spark DataFrame and GlueContext features, like job bookmarks and push down predicates, will continue to work.
  + Jobs with Spark-native FGAC require a minimum of 4 workers: one user driver, one system driver, one system executor, and one standby user executor.
  + For more information, see [ Using AWS Glue with AWS Lake Formation for fine-grained access control](https://docs.aws.amazon.com/glue/latest/dg/security-lf-enable.html). 
+ Lake Formation Full Table Access (FTA)
  + AWS Glue 5.0 supports FTA with Spark-native DataFrames (new) and GlueContext DynamicFrames (legacy, with limitations)
  + Spark-native FTA
    + If 4.0 script uses GlueContext, migrate to using native spark. 
    + This feature is limited to hive and iceberg tables
    + For more info on configuring a 5.0 job to use Spark native FTA, see 
  + GlueContext DynamicFrame FTA
    + No code change necessary
    + This feature is limited to non-OTF tables - it will not work with Iceberg, Delta Lake, and Hudi.
+ [Vectorized SIMD CSV reader](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-format-csv-home.html#aws-glue-programming-etl-format-simd-csv-reader) is not supported.
+ [Continuous logging](https://docs.aws.amazon.com/glue/latest/dg/monitor-continuous-logging-enable.html) to output log group is not supported. Use `error` log group instead.
+ The AWS Glue job run insights `job-insights-rule-driver` has been deprecated. The `job-insights-rca-driver` log stream is now located in the error log group.
+ Athena-based custom/marketplace connectors are not supported.
+ Adobe Marketo Engage, Facebook Ads, Google Ads, Google Analytics 4, Google Sheets, Hubspot, Instagram Ads, Intercom, Jira Cloud, Oracle NetSuite, Salesforce, Salesforce Marketing Cloud, Salesforce Marketing Cloud Account Engagement, SAP OData, ServiceNow, Slack, Snapchat Ads, Stripe, Zendesk and Zoho CRM connectors are not supported.
+ Custom log4j properties are not supported in AWS Glue 5.0.

### Major enhancements from Spark 3.3.0 to Spark 3.5.4
<a name="migrating-version-50-features-spark"></a>

Note the following enhancements:
+ Python client for Spark Connect ([SPARK-39375](https://issues.apache.org/jira/browse/SPARK-39375)).
+ Implement support for DEFAULT values for columns in tables ([SPARK-38334](https://issues.apache.org/jira/browse/SPARK-38334)).
+ Support "Lateral Column Alias References" ([SPARK-27561](https://issues.apache.org/jira/browse/SPARK-27561)).
+ Harden SQLSTATE usage for error classes ([SPARK-41994](https://issues.apache.org/jira/browse/SPARK-41994)).
+ Enable Bloom filter Joins by default ([SPARK-38841](https://issues.apache.org/jira/browse/SPARK-38841)).
+ Better Spark UI scalability and driver stability for large applications ([SPARK-41053](https://issues.apache.org/jira/browse/SPARK-41053)).
+ Async Progress Tracking in Structured Streaming ([SPARK-39591](https://issues.apache.org/jira/browse/SPARK-39591)).
+ Python arbitrary stateful processing in structured streaming ([SPARK-40434](https://issues.apache.org/jira/browse/SPARK-40434)).
+ Pandas API coverage improvements ([SPARK-42882](https://issues.apache.org/jira/browse/SPARK-42882)) and NumPy input support in PySpark ([SPARK-39405](https://issues.apache.org/jira/browse/SPARK-39405)).
+ Provide a memory profiler for PySpark user-defined functions ([SPARK-40281](https://issues.apache.org/jira/browse/SPARK-40281)).
+ Implement PyTorch distributor ([SPARK-41589](https://issues.apache.org/jira/browse/SPARK-41589)).
+ Publish SBOM artifacts ([SPARK-41893](https://issues.apache.org/jira/browse/SPARK-41893)).
+ Support IPv6-only environment ([SPARK-39457](https://issues.apache.org/jira/browse/SPARK-39457)).
+ Customized K8s scheduler (Apache YuniKorn and Volcano) GA ([SPARK-42802](https://issues.apache.org/jira/browse/SPARK-42802)).
+ Scala and Go client support in Spark Connect ([SPARK-42554](https://issues.apache.org/jira/browse/SPARK-42554)) and ([SPARK-43351](https://issues.apache.org/jira/browse/SPARK-43351)).
+ PyTorch-based distributed ML Support for Spark Connect ([SPARK-42471](https://issues.apache.org/jira/browse/SPARK-42471)).
+ Structured streaming support for Spark Connect in Python and Scala ([SPARK-42938](https://issues.apache.org/jira/browse/SPARK-42938)).
+ Pandas API support for the Python Spark Connect Client ([SPARK-42497](https://issues.apache.org/jira/browse/SPARK-42497)).
+ Introduce Arrow Python UDFs ([SPARK-40307](https://issues.apache.org/jira/browse/SPARK-40307)).
+ Support Python user-defined table functions ([SPARK-43798](https://issues.apache.org/jira/browse/SPARK-43798)).
+ Migrate PySpark errors onto error classes ([SPARK-42986](https://issues.apache.org/jira/browse/SPARK-42986)).
+ PySpark test framework ([SPARK-44042](https://issues.apache.org/jira/browse/SPARK-44042)).
+ Add support for Datasketches HllSketch ([SPARK-16484](https://issues.apache.org/jira/browse/SPARK-16484)).
+ Built-in SQL function improvement ([SPARK-41231](https://issues.apache.org/jira/browse/SPARK-41231)).
+ IDENTIFIER clause ([SPARK-43205](https://issues.apache.org/jira/browse/SPARK-43205)).
+ Add SQL functions into Scala, Python and R API ([SPARK-43907](https://issues.apache.org/jira/browse/SPARK-43907)).
+ Add named argument support for SQL functions ([SPARK-43922](https://issues.apache.org/jira/browse/SPARK-43922)).
+ Avoid unnecessary task rerun on decommissioned executor lost if shuffle data migrated ([SPARK-41469](https://issues.apache.org/jira/browse/SPARK-41469)).
+ Distributed ML <> spark connect ([SPARK-42471](https://issues.apache.org/jira/browse/SPARK-42471)).
+ DeepSpeed distributor ([SPARK-44264](https://issues.apache.org/jira/browse/SPARK-44264)).
+ Implement changelog checkpointing for RocksDB state store ([SPARK-43421](https://issues.apache.org/jira/browse/SPARK-43421)).
+ Introduce watermark propagation among operators ([SPARK-42376](https://issues.apache.org/jira/browse/SPARK-42376)).
+ Introduce dropDuplicatesWithinWatermark ([SPARK-42931](https://issues.apache.org/jira/browse/SPARK-42931)).
+ RocksDB state store provider memory management enhancements ([SPARK-43311](https://issues.apache.org/jira/browse/SPARK-43311)).

## Actions to migrate to AWS Glue 5.0
<a name="migrating-version-50-actions"></a>

For existing jobs, change the `Glue version` from the previous version to `Glue 5.0` in the job configuration.
+ In AWS Glue Studio, choose `Glue 5.0 - Supports Spark 3.5.4, Scala 2, Python 3` in `Glue version`.
+ In the API, choose **5.0** in the `GlueVersion` parameter in the [https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-jobs-job.html#aws-glue-api-jobs-job-UpdateJob](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-jobs-job.html#aws-glue-api-jobs-job-UpdateJob) API operation.

For new jobs, choose `Glue 5.0` when you create a job.
+ In the console, choose `Spark 3.5.4, Python 3 (Glue Version 5.0) or Spark 3.5.4, Scala 2 (Glue Version 5.0)` in `Glue version`.
+ In AWS Glue Studio, choose `Glue 5.0 - Supports Spark 3.5.4, Scala 2, Python 3` in `Glue version`.
+ In the API, choose **5.0** in the `GlueVersion` parameter in the [https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-jobs-job.html#aws-glue-api-jobs-job-CreateJob](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-jobs-job.html#aws-glue-api-jobs-job-CreateJob) API operation.

To view Spark event logs of AWS Glue 5.0 coming from AWS Glue 2.0 or earlier, [launch an upgraded Spark history server for AWS Glue 5.0 using CloudFormation or Docker](https://docs.aws.amazon.com/glue/latest/dg/monitor-spark-ui-history.html).

## Migration checklist
<a name="migrating-version-50-checklist"></a>

Review this checklist for migration:
+ Java 17 updates
+ [Scala] Upgrade AWS SDK calls from v1 to v2
+ Python 3.10 to 3.11 migration
+ [Python] Update boto references from 1.26 to 1.34

## AWS Glue 5.0 features
<a name="migrating-version-50-features"></a>

This section describes AWS Glue features in more detail.

### Querying metastore data catalogs from AWS Glue ETL
<a name="migrating-version-50-features-metastore"></a>

You can register your AWS Glue job to access the AWS Glue Data Catalog, which makes tables and other metastore resources available to disparate consumers. The Data Catalog supports a multi-catalog hierarchy, which unifies all your data across Amazon S3 data lakes. It also provides both a Hive metastore API and an open-source Apache Iceberg API for accessing the data. These features are available to AWS Glue and other data-oriented services like Amazon EMR, Amazon Athena and Amazon Redshift.

When you create resources in the Data Catalog, you can access them from any SQL engine that supports the Apache Iceberg REST API. AWS Lake Formation manages permissions. Following configuration, you can leverage AWS Glue's capabilities to query disparate data by querying these metastore resources with familiar applications. These include Apache Spark and Trino.

#### How metadata resources are organized
<a name="migrating-version-50-features-metastore-organized"></a>

Data is organized in a logical hierarchy of catalogs, databases and tables, using the AWS Glue Data Catalog:
+ Catalog – A logical container that holds objects from a data store, such as schemas or tables.
+ Database – Organizes data objects such as tables and views in a catalog.
+ Tables and views – Data objects in a database that provide an abstraction layer with an understandable schema. They make it easy to access underlying data, which could be in various formats and in various locations.

## Migrating from AWS Glue 4.0 to AWS Glue 5.0
<a name="migrating-version-50-from-40"></a>

All existing job parameters and major features that exist in AWS Glue 4.0 will exist in AWS Glue 5.0, except machine learning transforms.

The following new parameters were added:
+ `--enable-lakeformation-fine-grained-access`: Enables the fine-grained access control (FGAC) feature in AWS Lake Formation tables.

Refer to the Spark migration documentation:
+ [Migration Guide: Spark Core](https://spark.apache.org/docs/3.5.6/core-migration-guide.html)
+ [Migration Guide: SQL, Datasets and DataFrame](https://spark.apache.org/docs/3.5.6/sql-migration-guide.html)
+ [Migration Guide: Structured Streaming](https://spark.apache.org/docs/3.5.6/ss-migration-guide.html)
+ [Upgrading PySpark](https://spark.apache.org/docs/3.5.6/api/python/migration_guide/pyspark_upgrade.html)

## Migrating from AWS Glue 3.0 to AWS Glue 5.0
<a name="migrating-version-50-from-30"></a>

**Note**  
For migration steps related to AWS Glue 4.0, see [Migrating from AWS Glue 3.0 to AWS Glue 4.0](migrating-version-40.md#migrating-version-40-from-30).

All existing job parameters and major features that exist in AWS Glue 3.0 will exist in AWS Glue 5.0, except machine learning transforms.

## Migrating from AWS Glue 2.0 to AWS Glue 5.0
<a name="migrating-version-50-from-20"></a>

**Note**  
For migration steps related to AWS Glue 4.0 and a list of migration differences between AWS Glue version 3.0 and 4.0, see [Migrating from AWS Glue 3.0 to AWS Glue 4.0](migrating-version-40.md#migrating-version-40-from-30).

Also note the following migration differences between AWS Glue versions 3.0 and 2.0:
+ All existing job parameters and major features that exist in AWS Glue 2.0 will exist in AWS Glue 5.0, except machine learning transforms.
+ Several Spark changes alone may require revision of your scripts to ensure removed features are not being referenced. For example, Spark 3.1.1 and later does not enable Scala-untyped UDFs but Spark 2.4 does allow them.
+ Python 2.7 is not supported.
+ Any extra jars supplied in existing AWS Glue 2.0 jobs may bring in conflicting dependencies since there were upgrades in several dependencies. You can avoid classpath conflicts with the `--user-jars-first` job parameter.
+ Changes to the behavior of loading/saving of timestamps from/to parquet files. For more details, see Upgrading from Spark SQL 3.0 to 3.1.
+ Different Spark task parallelism for driver/executor configuration. You can adjust task parallelism by passing the `--executor-cores` job argument.

## Logging behavior changes in AWS Glue 5.0
<a name="enable-continous-logging-changes-glue-50"></a>

 The following are changes in logging behavior in AWS Glue 5.0. For more information, see [Logging for AWS Glue jobs](https://docs.aws.amazon.com/glue/latest/dg/monitor-continuous-logging.html). 
+  All logs (system logs, Spark daemon logs, user logs, and Glue Logger logs) are now written to the `/aws-glue/jobs/error` log group by default. 
+  The `/aws-glue/jobs/logs-v2` log group used for continuous logging in previous versions is no longer used. 
+  You can no longer rename or customize the log group or log stream names using the removed continuous logging arguments. Instead, see the new job arguments in AWS Glue 5.0. 

### Two new job arguments are introduced in AWS Glue 5.0
<a name="enable-continous-logging-new-arguments-glue-50"></a>
+  `––custom-logGroup-prefix`: Allows you to specify a custom prefix for the `/aws-glue/jobs/error` and `/aws-glue/jobs/output` log groups. 
+  `––custom-logStream-prefix`: Allows you to specify a custom prefix for the log stream names within the log groups. 

   Validation rules and limitations for custom prefixes include: 
  +  The entire log stream name must be between 1 and 512 characters. 
  +  The custom prefix for log stream names is limited to 400 characters. 
  +  Allowed characters in prefixes include alphanumeric characters, underscores (`\$1`), hyphens (`-`), and forward slashes (`/`). 

### Deprecated continuous logging arguments in AWS Glue 5.0
<a name="enabling-continuous-logging-deprecated-arguments"></a>

 The following job arguments for continuous logging have been deprecated in AWS Glue 5.0 
+  `––enable-continuous-cloudwatch-log` 
+  `––continuous-log-logGroup` 
+  `––continuous-log-logStreamPrefix` 
+  `––continuous-log-conversionPattern` 
+  `––enable-continuous-log-filter` 

## Connector and JDBC driver migration for AWS Glue 5.0
<a name="migrating-version-50-connector-driver-migration"></a>

For the versions of JDBC and data lake connectors that were upgraded, see:
+ [Appendix B: JDBC driver upgrades](#migrating-version-50-appendix-jdbc-driver)
+ [Appendix C: Connector upgrades](#migrating-version-50-appendix-connector)
+ [Appendix D: Open table format upgrades](#migrating-version-50-appendix-open-table-formats)

The following changes apply to the connector or driver versions identified in the appendices for Glue 5.0.

**Amazon Redshift**  
Note the following changes:
+ Adds support for three-part table names to allow connector to query Redshift data sharing tables.
+ Corrects mapping of Spark `ShortType` to use Redshift `SMALLINT` instead of `INTEGER` to better match expected data size.
+ Added support for Custom Cluster Names (CNAME) for Amazon Redshift Serverless.

**Apache Hudi**  
Note the following changes:
+ Support record level index.
+ Support auto generation of record keys. Now you don’t have to specify the record key field.

**Apache Iceberg**  
Note the following changes:
+ Support fine-grained access control with AWS Lake Formation.
+ Support branching and tagging which are named references to snapshots with their own independent lifecycles.
+ Added a changelog view procedure which generates a view that contains the changes made to a table over a specified period or between specific snapshots.

**Delta Lake**  
Note the following changes:
+ Support Delta Universal Format (UniForm) which enables seamless access through Apache Iceberg and Apache Hudi.
+ Support Deletion Vectors that implements a Merge-on-Read paradigm.

**AzureCosmos**  
Note the following changes:
+ Added hierarchical partition key support.
+ Added option to use custom Schema with StringType (raw json) for a nested property.
+ Added config option `spark.cosmos.auth.aad.clientCertPemBase64` to allow using SPN (ServicePrincipal name) authentication with certificate instead of client secret.

For more information, see the [Azure Cosmos DB Spark connector change log](https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/cosmos/azure-cosmos-spark_3-2_2-12/CHANGELOG.md).

**Microsoft SQL Server**  
Note the following changes:
+ TLS encryption is enabled by default.
+ When encrypt = false but the server requires encryption, the certificate is validated based on the `trustServerCertificate` connection setting.
+ `aadSecurePrincipalId` and `aadSecurePrincipalSecret` deprecated.
+ `getAADSecretPrincipalId` API removed.
+ Added CNAME resolution when realm is specified.

**MongoDB**  
Note the following changes:
+ Support for micro-batch mode with Spark Structured Streaming.
+ Support for BSON data types.
+ Added support for reading multiple collections when using micro-batch or continuous streaming modes.
  + If the name of a collection used in your `collection` configuration option contains a comma, the Spark Connector treats it as two different collections. To avoid this, you must escape the comma by preceding it with a backslash (\$1).
  + If the name of a collection used in your `collection` configuration option is "\$1", the Spark Connector interprets it as a specification to scan all collections. To avoid this, you must escape the asterisk by preceding it with a backslash (\$1).
  + If the name of a collection used in your `collection` configuration option contains a backslash (\$1), the Spark Connector treats the backslash as an escape character, which might change how it interprets the value. To avoid this, you must escape the backslash by preceding it with another backslash.

For more information, see the [MongoDB connector for Spark release notes](https://www.mongodb.com/docs/spark-connector/current/release-notes/).

**Snowflake**  
Note the following changes:
+ Introduced a new `trim_space` parameter that you can use to trim values of `StringType` columns automatically when saving to a Snowflake table. Default: `false`.
+ Disabled the `abort_detached_query` parameter at the session level by default.
+ Removed the requirement of the `SFUSER` parameter when using OAUTH.
+ Removed the Advanced Query Pushdown feature. Alternatives to the feature are available. For example, instead of loading data from Snowflake tables, users can directly load data from Snowflake SQL queries.

For more information, see the [Snowflake Connector for Spark release notes](https://docs.snowflake.com/en/release-notes/clients-drivers/spark-connector-2024).

### Appendix A: Notable dependency upgrades
<a name="migrating-version-50-appendix-dependencies"></a>

The following are dependency upgrades:


| Dependency | Version in AWS Glue 5.0 | Version in AWS Glue 4.0 | Version in AWS Glue 3.0 | Version in AWS Glue 2.0 | Version in AWS Glue 1.0 | 
| --- | --- | --- | --- | --- | --- | 
| Java | 17 | 8 | 8 | 8 | 8 | 
| Spark | 3.5.4 | 3.3.0-amzn-1 | 3.1.1-amzn-0 | 2.4.3 | 2.4.3 | 
| Hadoop | 3.4.1 | 3.3.3-amzn-0 | 3.2.1-amzn-3 | 2.8.5-amzn-5 | 2.8.5-amzn-1 | 
| Scala | 2.12.18 | 2.12 | 2.12 | 2.11 | 2.11 | 
| Jackson | 2.15.2 | 2.12 | 2.12 | 2.11 | 2.11 | 
| Hive | 2.3.9-amzn-4 | 2.3.9-amzn-2 | 2.3.7-amzn-4 | 1.2 | 1.2 | 
| EMRFS | 2.69.0 | 2.54.0 | 2.46.0 | 2.38.0 | 2.30.0 | 
| Json4s | 3.7.0-M11 | 3.7.0-M11 | 3.6.6 | 3.5.x | 3.5.x | 
| Arrow | 12.0.1 | 7.0.0 | 2.0.0 | 0.10.0 | 0.10.0 | 
| AWS Glue Data Catalog client | 4.5.0 | 3.7.0 | 3.0.0 | 1.10.0 | N/A | 
| AWS SDK for Java | 2.29.52 | 1.12 | 1.12 |  |  | 
| Python | 3.11 | 3.10 | 3.7 | 2.7 & 3.6 | 2.7 & 3.6 | 
| Boto | 1.34.131 | 1.26 | 1.18 | 1.12 | N/A | 
| EMR DynamoDB connector | 5.6.0 | 4.16.0 |  |  |  | 

### Appendix B: JDBC driver upgrades
<a name="migrating-version-50-appendix-jdbc-driver"></a>

The following are JDBC driver upgrades:


| Driver | JDBC driver version in AWS Glue 5.0 | JDBC driver version in AWS Glue 4.0 | JDBC driver version in AWS Glue 3.0 | JDBC driver version in past AWS Glue versions | 
| --- | --- | --- | --- | --- | 
| MySQL | 8.0.33 | 8.0.23 | 8.0.23 | 5.1 | 
| Microsoft SQL Server | 10.2.0 | 9.4.0 | 7.0.0 | 6.1.0 | 
| Oracle Databases | 23.3.0.23.09 | 21.7 | 21.1 | 11.2 | 
| PostgreSQL | 42.7.3 | 42.3.6 | 42.2.18 | 42.1.0 | 
| Amazon Redshift |  redshift-jdbc42-2.1.0.29  |  redshift-jdbc42-2.1.0.16  |  redshift-jdbc41-1.2.12.1017   |  redshift-jdbc41-1.2.12.1017   | 
| SAP Hana | 2.20.17 | 2.17.12 |  |  | 
| Teradata | 20.00.00.33 | 20.00.00.06 |  |  | 

### Appendix C: Connector upgrades
<a name="migrating-version-50-appendix-connector"></a>

The following are connector upgrades:


| Driver | Connector version in AWS Glue 5.0 | Connector version in AWS Glue 4.0 | Connector version in AWS Glue 3.0 | 
| --- | --- | --- | --- | 
| EMR DynamoDB connector | 5.6.0 | 4.16.0 |  | 
| Amazon Redshift | 6.4.0 | 6.1.3 |  | 
| OpenSearch | 1.2.0 | 1.0.1 |  | 
| MongoDB | 10.3.0 | 10.0.4 | 3.0.0 | 
| Snowflake | 3.0.0 | 2.12.0 |  | 
| Google BigQuery | 0.32.2 | 0.32.2 |  | 
| AzureCosmos | 4.33.0 | 4.22.0 |  | 
| AzureSQL | 1.3.0 | 1.3.0 |  | 
| Vertica | 3.3.5 | 3.3.5 |  | 

### Appendix D: Open table format upgrades
<a name="migrating-version-50-appendix-open-table-formats"></a>

The following are open table format upgrades:


| OTF | Connector version in AWS Glue 5.0 | Connector version in AWS Glue 4.0 | Connector version in AWS Glue 3.0 | 
| --- | --- | --- | --- | 
| Hudi | 0.15.0 | 0.12.1 | 0.10.1 | 
| Delta Lake | 3.3.0 | 2.1.0 | 1.0.0 | 
| Iceberg | 1.7.1 | 1.0.0 | 0.13.1 | 

# Migrating AWS Glue for Spark jobs to AWS Glue version 4.0
<a name="migrating-version-40"></a>

This topic describes the changes between AWS Glue versions 0.9, 1.0, 2.0, and 3.0 to allow you to migrate your Spark applications and ETL jobs to AWS Glue 4.0. It also describes the features in AWS Glue 4.0 and the advantages of using it. 

To use this feature with your AWS Glue ETL jobs, choose **4.0** for the `Glue version` when creating your jobs.

**Topics**
+ [New features supported](#migrating-version-40-features)
+ [Actions to migrate to AWS Glue 4.0](#migrating-version-40-actions)
+ [Migration checklist](#migrating-version-40-checklist)
+ [Migrating from AWS Glue 3.0 to AWS Glue 4.0](#migrating-version-40-from-30)
+ [Migrating from AWS Glue 2.0 to AWS Glue 4.0](#migrating-version-40-from-20)
+ [Migrating from AWS Glue 1.0 to AWS Glue 4.0](#migrating-version-40-from-10)
+ [Migrating from AWS Glue 0.9 to AWS Glue 4.0](#migrating-version-40-from-09)
+ [Connector and JDBC driver migration for AWS Glue 4.0](#migrating-version-40-connector-driver-migration)
+ [Appendix A: Notable dependency upgrades](#migrating-version-40-appendix-dependencies)
+ [Appendix B: JDBC driver upgrades](#migrating-version-40-appendix-jdbc-driver)
+ [Appendix C: Connector upgrades](#migrating-version-40-appendix-connector)

## New features supported
<a name="migrating-version-40-features"></a>

This section describes new features and advantages of AWS Glue version 4.0.
+ It is based on Apache Spark 3.3.0, but includes optimizations in AWS Glue, and Amazon EMR, such as adaptive query runs, vectorized readers, and optimized shuffles and partition coalescing. 
+ Upgraded JDBC drivers for all AWS Glue native sources including MySQL, Microsoft SQL Server, Oracle, PostgreSQL, MongoDB, and upgraded Spark libraries and dependencies brought in by Spark 3.3.0.
+ Updated with a new Amazon Redshift connector and JDBC driver.
+ Optimized Amazon S3 access with upgraded EMR File System (EMRFS) and enabled Amazon S3-optimized output committers, by default.
+ Optimized Data Catalog access with partition indexes, pushdown predicates, partition listing, and an upgraded Hive metastore client.
+ Integration with Lake Formation for governed catalog tables with cell-level filtering and data lake transactions.
+ Reduced startup latency to improve overall job completion times and interactivity.
+ Spark jobs are billed in 1-second increments with a 10x lower minimum billing duration—from a 10-minute minimum to a 1-minute minimum.
+ Native support for open-data lake frameworks with Apache Hudi, Delta Lake, and Apache Iceberg.
+ Native support for the Amazon S3-based Cloud Shuffle Storage Plugin (an Apache Spark plugin) to use Amazon S3 for shuffling and elastic storage capacity.

**Major enhancements from Spark 3.1.1 to Spark 3.3.0**  
Note the following enhancements:
+ Row-level runtime filtering ([SPARK-32268](https://issues.apache.org/jira/browse/SPARK-32268)).
+ ANSI enhancements ([SPARK-38860](https://issues.apache.org/jira/browse/SPARK-38860)).
+ Error message improvements ([SPARK-38781](https://issues.apache.org/jira/browse/SPARK-38781)).
+ Support complex types for Parquet vectorized reader ([SPARK-34863](https://issues.apache.org/jira/browse/SPARK-34863)).
+ Hidden file metadata support for Spark SQL ([SPARK-37273](https://issues.apache.org/jira/browse/SPARK-37273)).
+ Provide a profiler for Python/Pandas UDFs ([SPARK-37443](https://issues.apache.org/jira/browse/SPARK-37443)).
+ Introduce Trigger.AvailableNow for running streaming queries like Trigger.Once in multiple batches ([SPARK-36533](https://issues.apache.org/jira/browse/SPARK-36533)).
+ More comprehensive Datasource V2 pushdown capabilities ([SPARK-38788](https://issues.apache.org/jira/browse/SPARK-38788)).
+ Migrating from log4j 1 to log4j 2 ([SPARK-37814](https://issues.apache.org/jira/browse/SPARK-37814)).

**Other notable changes**  
Note the following changes:
+ Breaking changes
  + Drop references to Python 3.6 support in docs and Python/docs ([SPARK-36977](https://issues.apache.org/jira/browse/SPARK-36977)).
  + Remove named tuple hack by replacing built-in pickle to cloudpickle ([SPARK-32079](https://issues.apache.org/jira/browse/SPARK-32079)).
  + Bump minimum pandas version to 1.0.5 ([SPARK-37465](https://issues.apache.org/jira/browse/SPARK-37465)).

## Actions to migrate to AWS Glue 4.0
<a name="migrating-version-40-actions"></a>

For existing jobs, change the `Glue version` from the previous version to `Glue 4.0` in the job configuration.
+ In AWS Glue Studio, choose `Glue 4.0 - Supports Spark 3.3, Scala 2, Python 3` in `Glue version`.
+ In the API, choose **4.0** in the `GlueVersion` parameter in the [https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-jobs-job.html#aws-glue-api-jobs-job-UpdateJob](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-jobs-job.html#aws-glue-api-jobs-job-UpdateJob) API operation.

For new jobs, choose `Glue 4.0` when you create a job.
+ In the console, choose `Spark 3.3, Python 3 (Glue Version 4.0) or Spark 3.3, Scala 2 (Glue Version 3.0)` in `Glue version`.
+ In AWS Glue Studio, choose `Glue 4.0 - Supports Spark 3.3, Scala 2, Python 3` in `Glue version`.
+ In the API, choose **4.0** in the `GlueVersion` parameter in the [https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-jobs-job.html#aws-glue-api-jobs-job-CreateJob](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-jobs-job.html#aws-glue-api-jobs-job-CreateJob) API operation.

To view Spark event logs of AWS Glue 4.0 coming from AWS Glue 2.0 or earlier, [launch an upgraded Spark history server for AWS Glue 4.0 using CloudFormation or Docker](https://docs.aws.amazon.com/glue/latest/dg/monitor-spark-ui-history.html).

## Migration checklist
<a name="migrating-version-40-checklist"></a>
+ Do your job's external Python libraries depend on Python 2.7/3.6?
  + Update the dependent libraries from Python 2.7/3.6 to Python 3.10 as Spark 3.3.0 completely removed Python 2.7 and 3.6 support.

## Migrating from AWS Glue 3.0 to AWS Glue 4.0
<a name="migrating-version-40-from-30"></a>

Note the following changes when migrating:
+ All existing job parameters and major features that exist in AWS Glue 3.0 will exist in AWS Glue 4.0.
+ AWS Glue 3.0 uses Amazon EMR-optimized Spark 3.1.1, and AWS Glue 4.0 uses Amazon EMR-optimized Spark 3.3.0.

  Several Spark changes alone might require revision of your scripts to ensure that removed features are not being referenced.
+ AWS Glue 4.0 also features an update to EMRFS and Hadoop. For the specific version, see [Appendix A: Notable dependency upgrades](#migrating-version-40-appendix-dependencies).
+ The AWS SDK provided in ETL jobs is now upgraded from 1.11 to 1.12.
+ All Python jobs will be using Python version 3.10. Previously, Python 3.7 was used in AWS Glue 3.0.

  As a result, some pymodules brought out-of-the-box by AWS Glue are upgraded.
+ Log4j has been upgraded to Log4j2.
  + For information on the Log4j2 migration path, see the [Log4j documentation](https://logging.apache.org/log4j/2.x/manual/migration.html#Log4j2API).
  + You must rename any custom log4j.properties file as a log4j2.properties file instead, with the appropriate log4j2 properties.
+ For migrating certain connectors, see [Connector and JDBC driver migration for AWS Glue 4.0](#migrating-version-40-connector-driver-migration).
+ The AWS Encryption SDK is upgraded from 1.x to 2.x. AWS Glue jobs using AWS Glue security configurations and jobs dependent on the AWS Encryption SDK dependency provided in runtime are affected. See the instructions for AWS Glue job migration.

  You can safely upgrade an AWS Glue 2.0/3.0 job to an AWS Glue 4.0 job because AWS Glue 2.0/3.0 already contains the AWS Encryption SDK bridge version.

Refer to the Spark migration documentation:
+ [Upgrading from Spark SQL 3.1 to 3.2](https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-31-to-32)
+ [Upgrading from Spark SQL 3.2 to 3.3](https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-32-to-33)

## Migrating from AWS Glue 2.0 to AWS Glue 4.0
<a name="migrating-version-40-from-20"></a>

Note the following changes when migrating:

**Note**  
For migration steps related to AWS Glue 3.0, see [Migrating from AWS Glue 3.0 to AWS Glue 4.0](#migrating-version-40-from-30).
+ All existing job parameters and major features that exist in AWS Glue 2.0 will exist in AWS Glue 4.0.
+ The EMRFS S3-optimized committer for writing Parquet data into Amazon S3 is enabled by default since AWS Glue 3.0. However, you can still disable it by setting `--enable-s3-parquet-optimized-committer` to `false`.
+ AWS Glue 2.0 uses open-source Spark 2.4 and AWS Glue 4.0 uses Amazon EMR-optimized Spark 3.3.0.
  + Several Spark changes alone may require revision of your scripts to ensure that removed features are not being referenced.
  + For example, Spark 3.3.0 does not enable Scala-untyped UDFs, but Spark 2.4 does allow them.
+ The AWS SDK provided in ETL jobs is now upgraded from 1.11 to 1.12.
+ AWS Glue 4.0 also features an update to EMRFS, updated JDBC drivers, and inclusions of additional optimizations onto Spark itself provided by AWS Glue.
+ Scala is updated to 2.12 from 2.11, and Scala 2.12 is not backward compatible with Scala 2.11.
+ Python 3.10 is the default version used for Python scripts, as AWS Glue 2.0 was only using Python 3.7 and 2.7.
  + Python 2.7 is not supported with Spark 3.3.0. Any job requesting Python 2 in the job configuration will fail with an IllegalArgumentException.
  + A new mechanism of installing additional Python modules is available since AWS Glue 2.0.
+ Several dependency updates, highlighted in [Appendix A: Notable dependency upgrades](#migrating-version-40-appendix-dependencies).
+ Any extra JAR files supplied in existing AWS Glue 2.0 jobs might bring in conflicting dependencies because there were upgrades in several dependencies in 4.0 from 2.0. You can avoid classpath conflicts in AWS Glue 4.0 with the `--user-jars-first` AWS Glue job parameter.
+ AWS Glue 4.0 uses Spark 3.3. Starting with Spark 3.1, there was a change in the behavior of loading/saving of timestamps from/to parquet files. For more details, see [Upgrading from Spark SQL 3.0 to 3.1](https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-30-to-31).

  We recommend to set the following parameters when reading/writing parquet data that contains timestamp columns. Setting those parameters can resolve the calendar incompatibility issue that occurs during the Spark 2 to Spark 3 upgrade, for both the AWS Glue Dynamic Frame and Spark Data Frame. Use the CORRECTED option to read the datetime value as it is; and the LEGACY option to rebase the datetime values with regard to the calendar difference during reading.

  ```
  - Key: --conf
  - Value: spark.sql.legacy.parquet.int96RebaseModeInRead=[CORRECTED|LEGACY] --conf spark.sql.legacy.parquet.int96RebaseModeInWrite=[CORRECTED|LEGACY] --conf spark.sql.legacy.parquet.datetimeRebaseModeInRead=[CORRECTED|LEGACY]
  ```
+ For migrating certain connectors, see [Connector and JDBC driver migration for AWS Glue 4.0](#migrating-version-40-connector-driver-migration).
+ The AWS Encryption SDK is upgraded from 1.x to 2.x. AWS Glue jobs using AWS Glue security configurations and jobs dependent on the AWS Encryption SDK dependency provided in runtime are affected. See these instructions for AWS Glue job migration:
  + You can safely upgrade an AWS Glue 2.0 job to an AWS Glue 4.0 job because AWS Glue 2.0 already contains the AWS Encryption SDK bridge version.

Refer to the Spark migration documentation:
+ [Upgrading from Spark SQL 2.4 to 3.0](https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-24-to-30)
+ [Upgrading from Spark SQL 3.1 to 3.2](https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-31-to-32)
+ [Upgrading from Spark SQL 3.2 to 3.3](https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-32-to-33)
+ [Changes in Datetime behavior to be expected since Spark 3.0.](https://issues.apache.org/jira/browse/SPARK-31408)

## Migrating from AWS Glue 1.0 to AWS Glue 4.0
<a name="migrating-version-40-from-10"></a>

Note the following changes when migrating:
+ AWS Glue 1.0 uses open-source Spark 2.4 and AWS Glue 4.0 uses Amazon EMR-optimized Spark 3.3.0.
  + Several Spark changes alone may require revision of your scripts to ensure that removed features are not being referenced.
  + For example, Spark 3.3.0 does not enable Scala-untyped UDFs, but Spark 2.4 does allow them.
+ All jobs in AWS Glue 4.0 will be run with significantly improved startup times. Spark jobs will be billed in 1-second increments with a 10x lower minimum billing duration since startup latency will go from 10 minutes maximum to 1 minute maximum.
+ Logging behavior has changed significantly in AWS Glue 4.0, Spark 3.3.0 has a minimum requirement of Log4j2.
+ Several dependency updates, highlighted in the appendix.
+ Scala is also updated to 2.12 from 2.11, and Scala 2.12 is not backward compatible with Scala 2.11.
+ Python 3.10 is also the default version used for Python scripts, as AWS Glue 0.9 was only using Python 2.

  Python 2.7 is not supported with Spark 3.3.0. Any job requesting Python 2 in the job configuration will fail with an IllegalArgumentException.
+ A new mechanism of installing additional Python modules through pip is available since AWS Glue 2.0. For more information, see [Installing additional Python modules with pip in AWS Glue 2.0\$1](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-libraries.html#addl-python-modules-support).
+ AWS Glue 4.0 does not run on Apache YARN, so YARN settings do not apply.
+ AWS Glue 4.0 does not have a Hadoop Distributed File System (HDFS).
+ Any extra JAR files supplied in existing AWS Glue 1.0 jobs might bring in conflicting dependencies because there were upgrades in several dependencies in 4.0 from 1.0. We enable AWS Glue 4.0 with the `--user-jars-first` AWS Glue job parameter by default, to avoid this problem.
+ AWS Glue 4.0 supports auto scaling. Therefore, the ExecutorAllocationManager metric will be available when auto scaling is enabled.
+ In AWS Glue version 4.0 jobs, you specify the number of workers and worker type, but do not specify a `maxCapacity`.
+ AWS Glue 4.0 does not yet support machine learning transforms.
+ For migrating certain connectors, see [Connector and JDBC driver migration for AWS Glue 4.0](#migrating-version-40-connector-driver-migration).
+ The AWS Encryption SDK is upgraded from 1.x to 2.x. AWS Glue jobs using AWS Glue security configurations and jobs dependent on the AWS Encryption SDK dependency provided in runtime are affected. See these instructions for AWS Glue job migration.
  + You cannot migrate an AWS Glue 0.9/1.0 job to an AWS Glue 4.0 job directly. This is because when upgrading directly to version 2.x or later and enabling all new features immediately, the AWS Encryption SDK won't be able to decrypt the ciphertext encrypted under earlier versions of the AWS Encryption SDK. 
  + To safely upgrade, we first recommend that you migrate to an AWS Glue 2.0/3.0 job that contains the AWS Encryption SDK bridge version. Run the job once to utilize the AWS Encryption SDK bridge version.
  + Upon completion, you can safely migrate the AWS Glue 2.0/3.0 job to AWS Glue 4.0.

Refer to the Spark migration documentation:
+ [Upgrading from Spark SQL 2.4 to 3.0](https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-24-to-30)
+ [Upgrading from Spark SQL 3.0 to 3.1](https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-30-to-31)
+ [Upgrading from Spark SQL 3.1 to 3.2](https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-31-to-32)
+ [Upgrading from Spark SQL 3.2 to 3.3](https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-32-to-33)
+ [Changes in Datetime behavior to be expected since Spark 3.0.](https://issues.apache.org/jira/browse/SPARK-31408)

## Migrating from AWS Glue 0.9 to AWS Glue 4.0
<a name="migrating-version-40-from-09"></a>

Note the following changes when migrating:
+ AWS Glue 0.9 uses open-source Spark 2.2.1 and AWS Glue 4.0 uses Amazon EMR-optimized Spark 3.3.0.
  + Several Spark changes alone might require revision of your scripts to ensure that removed features are not being referenced.
  + For example, Spark 3.3.0 does not enable Scala-untyped UDFs, but Spark 2.2 does allow them.
+ All jobs in AWS Glue 4.0 will be run with significantly improved startup times. Spark jobs will be billed in 1-second increments with a 10x lower minimum billing duration because startup latency will go from 10 minutes maximum to 1 minute maximum.
+ Logging behavior has changed significantly since AWS Glue 4.0, Spark 3.3.0 has a minimum requirement of Log4j2 as mentioned here (https://spark.apache.org/docs/latest/core-migration-guide.html\$1upgrading-from-core-32-to-33).
+ Several dependency updates, highlighted in the appendix.
+ Scala is also updated to 2.12 from 2.11, and Scala 2.12 is not backward compatible with Scala 2.11.
+ Python 3.10 is also the default version used for Python scripts, as AWS Glue 0.9 was only using Python 2.
  + Python 2.7 is not supported with Spark 3.3.0. Any job requesting Python 2 in the job configuration will fail with an IllegalArgumentException.
  + A new mechanism of installing additional Python modules through pip is available.
+ AWS Glue 4.0 does not run on Apache YARN, so YARN settings do not apply.
+ AWS Glue 4.0 does not have a Hadoop Distributed File System (HDFS).
+ Any extra JAR files supplied in existing AWS Glue 0.9 jobs might bring in conflicting dependencies because there were upgrades in several dependencies in 3.0 from 0.9. You can avoid classpath conflicts in AWS Glue 3.0 with the `--user-jars-first` AWS Glue job parameter.
+ AWS Glue 4.0 supports auto scaling. Therefore, the ExecutorAllocationManager metric will be available when auto scaling is enabled.
+ In AWS Glue version 4.0 jobs, you specify the number of workers and worker type, but do not specify a `maxCapacity`.
+ AWS Glue 4.0 does not yet support machine learning transforms.
+ For migrating certain connectors, see [Connector and JDBC driver migration for AWS Glue 4.0](#migrating-version-40-connector-driver-migration).
+ The AWS Encryption SDK is upgraded from 1.x to 2.x. AWS Glue jobs using AWS Glue security configurations and jobs dependent on the AWS Encryption SDK dependency provided in runtime are affected. See these instructions for AWS Glue job migration.
  + You cannot migrate an AWS Glue 0.9/1.0 job to an AWS Glue 4.0 job directly. This is because when upgrading directly to version 2.x or later and enabling all new features immediately, the AWS Encryption SDK won't be able to decrypt the ciphertext encrypted under earlier versions of the AWS Encryption SDK. 
  + To safely upgrade, we first recommend that you migrate to an AWS Glue 2.0/3.0 job that contains the AWS Encryption SDK bridge version. Run the job once to utilize the AWS Encryption SDK bridge version.
  + Upon completion, you can safely migrate the AWS Glue 2.0/3.0 job to AWS Glue 4.0.

Refer to the Spark migration documentation:
+ [Upgrading from Spark SQL 2.2 to 2.3](https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-22-to-23)
+ [Upgrading from Spark SQL 2.3 to 2.4](https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-23-to-24)
+ [Upgrading from Spark SQL 2.4 to 3.0](https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-24-to-30)
+ [Upgrading from Spark SQL 3.0 to 3.1](https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-30-to-31)
+ [Upgrading from Spark SQL 3.1 to 3.2](https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-31-to-32)
+ [Upgrading from Spark SQL 3.2 to 3.3](https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-32-to-33)
+ [Changes in Datetime behavior to be expected since Spark 3.0](https://issues.apache.org/jira/browse/SPARK-31408).

## Connector and JDBC driver migration for AWS Glue 4.0
<a name="migrating-version-40-connector-driver-migration"></a>

For the versions of JDBC and data lake connectors that were upgraded, see:
+ [Appendix B: JDBC driver upgrades](#migrating-version-40-appendix-jdbc-driver)
+ [Appendix C: Connector upgrades](#migrating-version-40-appendix-connector)

### Hudi
<a name="migrating-version-40-connector-driver-migration-hudi"></a>
+ Spark SQL support improvements:
  + Through the `Call Procedure` command, there is added support for upgrade, downgrade, bootstrap, clean, and repair. `Create/Drop/Show/Refresh Index` syntax is possible in Spark SQL.
  + A performance gap has been closed between usage through a Spark DataSource as opposed to Spark SQL. Datasource writes in the past used to be faster than SQL.
  + All built-in key generators implement more performant Spark-specific API operations.
  + Replaced UDF transformation in the bulk `insert` operation with RDD transformations to cut down on costs of using SerDe.
  + Spark SQL with Hudi requires a `primaryKey` to be specified by `tblproperites` or options in the SQL statement. For update and delete operations, the `preCombineField` is required as well.
+ Any Hudi table created before version 0.10.0 without a `primaryKey` needs to be recreated with a `primaryKey` field since version 0.10.0.

### PostgreSQL
<a name="migrating-version-40-connector-driver-migration-postgresql"></a>
+ Several vulnerabilities (CVEs) were addressed.
+ Java 8 is natively supported.
+ If the job is using Arrays of Arrays, with the exception of byte arrays, this scenario can be treated as multidimensional arrays.

### MongoDB
<a name="migrating-version-40-connector-driver-migration-mongodb"></a>
+ The current MongoDB connector supports Spark version 3.1 or later and MongoDB version 4.0 or later.
+ Due to the connector upgrade, a few property names changed. For example, the URI property name changed to `connection.uri`. For more information on the current options, see the [MongoDB Spark Connector blog](https://www.mongodb.com/docs/spark-connector/current/configuration/).
+ Using MongoDB 4.0 hosted by Amazon DocumentDB has some functional differences. For more information, see these topics:
  + [Functional Differences: Amazon DocumentDB and MongoDB](https://docs.aws.amazon.com/documentdb/latest/developerguide/functional-differences.html)
  +  [Supported MongoDB APIs, Operations, and Data Types](https://docs.aws.amazon.com/documentdb/latest/developerguide/mongo-apis.html).
+ The "partitioner" option is restricted to `ShardedPartitioner`, `PaginateIntoPartitionsPartitioner`, and `SinglePartitionPartitioner`. It cannot use default `SamplePartitioner` and `PaginateBySizePartitioner` for Amazon DocumentDB because the stage operator does not support the MongoDB API. For more information, see [Supported MongoDB APIs, Operations, and Data Types](https://docs.aws.amazon.com/documentdb/latest/developerguide/mongo-apis.html).

### Delta Lake
<a name="migrating-version-40-connector-driver-migration-delta"></a>
+ Delta Lake now supports [time travel in SQL](https://docs.delta.io/2.1.0/delta-batch.html#query-an-older-snapshot-of-a-table-time-travel) to query older data easily. With this update, time travel is now available both in Spark SQL and through the DataFrame API. Support has been added for the current version of TIMESTAMP in SQL.
+ Spark 3.3 introduces [Trigger.AvailableNow](https://issues.apache.org/jira/browse/SPARK-36533) for running streaming queries as an equivalent to `Trigger.Once` for batch queries. This support is also available when using Delta tables as a streaming source.
+ Support for SHOW COLUMNS to return the list of columns in a table.
+ Support for [DESCRIBE DETAIL](https://docs.delta.io/2.1.0/delta-utility.html#retrieve-delta-table-details) in the Scala and Python DeltaTable API. It retrieves detailed information about a Delta table using either the DeltaTable API or Spark SQL.
+ Support for returning operation metrics from SQL [Delete](https://github.com/delta-io/delta/pull/1328), [Merge](https://github.com/delta-io/delta/pull/1327), and [Update](https://github.com/delta-io/delta/pull/1331) commands. Previously these SQL commands returned an empty DataFrame, now they return a DataFrame with useful metrics about the operation performed.
+ Optimize performance improvements:
  + Set the configuration option `spark.databricks.delta.optimize.repartition.enabled=true` to use `repartition(1)` instead of `coalesce(1)` in the Optimize command for better performance when compacting many small files.
  + [Improved performance](https://github.com/delta-io/delta/pull/1315) by using a queue-based approach to parallelize compaction jobs.
+ Other notable changes:
  + [Support for using variables](https://github.com/delta-io/delta/issues/1267) in the VACUUM and OPTIMIZE SQL commands.
  + Improvements for CONVERT TO DELTA with catalog tables including:
    + [Autofill the partition schema](https://github.com/delta-io/delta/commit/18d4d12ed06f973006501f6c39c8785db51e2b1f) from the catalog when it's not provided.
    + [Use partition information](https://github.com/delta-io/delta/commit/ebff29904f3ababb889897343f8f8f7a010a1f71) from the catalog to find the data files to commit instead of doing a full directory scan. Instead of committing all data files in the table directory, only data files under the directories of active partitions will be committed.
  + [Support for Change Data Feed (CDF) batch reads](https://github.com/delta-io/delta/issues/1349) on column mapping enabled tables when DROP COLUMN and RENAME COLUMN have not been used. For more information, see the [Delta Lake documentation](https://docs.delta.io/2.1.0/delta-change-data-feed.html#known-limitations).
  + [Improve Update command performance](https://github.com/delta-io/delta/pull/1202) by enabling schema pruning in the first pass.

### Apache Iceberg
<a name="migrating-version-40-connector-driver-migration-iceberg"></a>
+ Added several [performance improvements](https://iceberg.apache.org/releases/#performance-improvements) for scan planning and Spark queries.
+ Added a common REST catalog client that uses change-based commits to resolve commit conflicts on the service side.
+ `AS OF` syntax for SQL time travel queries is supported.
+ Added merge-on-read support for MERGE and UPDATE queries.
+ Added support to rewrite partitions using Z-order.
+ Added a spec and implementation for Puffin, a format for large stats and index blobs, like [Theta sketches](https://datasketches.apache.org/docs/Theta/InverseEstimate.html) or bloom filters.
+ Added new interfaces for consuming data incrementally (both append and changelog scans).
+ Added support for bulk operations and ranged reads to FileIO interfaces.
+ Added more metadata tables to show delete files in the metadata tree.
+ The drop table behavior changed. In Iceberg 0.13.1, running `DROP TABLE` removes the table from the catalog and deletes the table contents as well. In Iceberg 1.0.0, `DROP TABLE` only removes the table from the catalog. To delete the table contents use `DROP TABLE PURGE`.
+ Parquet vectorized reads are enabled by default in Iceberg 1.0.0. If you want to disable vectorized reads, set `read.parquet.vectorization.enabled` to `false`.

### Oracle
<a name="migrating-version-40-connector-driver-migration-oracle"></a>

Changes are minor.

### MySQL
<a name="migrating-version-40-connector-driver-migration-mysql"></a>

Changes are minor.

### Amazon Redshift
<a name="migrating-version-40-connector-driver-migration-redshift"></a>

AWS Glue 4.0 features a new Amazon Redshift connector with a new JDBC driver. For information about the enhancements and how to migrate from previous AWS Glue versions, see [Redshift connections](aws-glue-programming-etl-connect-redshift-home.md).

## Appendix A: Notable dependency upgrades
<a name="migrating-version-40-appendix-dependencies"></a>

The following are dependency upgrades:


| Dependency | Version in AWS Glue 4.0 | Version in AWS Glue 3.0 | Version in AWS Glue 2.0 | Version in AWS Glue 1.0 | 
| --- | --- | --- | --- | --- | 
| Spark | 3.3.0-amzn-1 | 3.1.1-amzn-0 | 2.4.3 | 2.4.3 | 
| Hadoop | 3.3.3-amzn-0 | 3.2.1-amzn-3 | 2.8.5-amzn-5 | 2.8.5-amzn-1 | 
| Scala | 2.12 | 2.12 | 2.11 | 2.11 | 
| Jackson | 2.13.3 | 2.10.x | 2.7.x | 2.7.x | 
| Hive | 2.3.9-amzn-2 | 2.3.7-amzn-4 | 1.2 | 1.2 | 
| EMRFS | 2.54.0 | 2.46.0 | 2.38.0 | 2.30.0 | 
| Json4s | 3.7.0-M11 | 3.6.6 | 3.5.x | 3.5.x | 
| Arrow | 7.0.0 | 2.0.0 | 0.10.0 | 0.10.0 | 
| AWS Glue Data Catalog client | 3.7.0 | 3.0.0 | 1.10.0 | N/A | 
| Python | 3.10 | 3.7 | 2.7 & 3.6 | 2.7 & 3.6 | 
| Boto | 1.26 | 1.18 | 1.12 | N/A | 

## Appendix B: JDBC driver upgrades
<a name="migrating-version-40-appendix-jdbc-driver"></a>

The following are JDBC driver upgrades:


| Driver | JDBC driver version in past AWS Glue versions | JDBC driver version in AWS Glue 3.0 | JDBC driver version in AWS Glue 4.0 | 
| --- | --- | --- | --- | 
| MySQL | 5.1 | 8.0.23 | 8.0.23 | 
| Microsoft SQL Server | 6.1.0 | 7.0.0 | 9.4.0 | 
| Oracle Databases | 11.2 | 21.1 | 21.7 | 
| PostgreSQL | 42.1.0 | 42.2.18 | 42.3.6 | 
| MongoDB | 2.0.0 | 4.0.0 | 4.7.2 | 
| Amazon Redshift |  redshift-jdbc41-1.2.12.1017   |  redshift-jdbc41-1.2.12.1017   |  redshift-jdbc42-2.1.0.16  | 

## Appendix C: Connector upgrades
<a name="migrating-version-40-appendix-connector"></a>

The following are connector upgrades:


| Driver | Connector version in AWS Glue 3.0 | Connector version in AWS Glue 4.0 | 
| --- | --- | --- | 
| MongoDB | 3.0.0 | 10.0.4 | 
| Hudi | 0.10.1 | 0.12.1 | 
| Delta Lake | 1.0.0 | 2.1.0 | 
| Iceberg | 0.13.1 | 1.0.0 | 
| DynamoDB | 1.11 | 1.12 | 

# Generative AI upgrades for Apache Spark in AWS Glue
<a name="upgrade-analysis"></a>

 Spark Upgrades in AWS Glue enables data engineers and developers to upgrade and migrate their existing AWS Glue Spark jobs to the latest Spark releases using generative AI. Data engineers can use it to scan their AWS Glue Spark jobs, generate upgrade plans, execute plans, and validate outputs. It reduces the time and cost of Spark upgrades by automating the undifferentiated work of identifying and updating Spark scripts, configurations, dependencies, methods, and features. 

![\[The GIF shows an end to end implementation of a sample upgrade analysis workflow.\]](http://docs.aws.amazon.com/glue/latest/dg/images/demo_lumos.gif)


## How it works
<a name="upgrade-analysis-how-it-works"></a>

 When you use upgrade analysis, AWS Glue identifies differences between versions and configurations in your job's code to generate an upgrade plan. The upgrade plan details all code changes, and required migration steps. Next, AWS Glue builds and runs the upgraded application in an environment to validate changes and generates a list of code changes for you to migrate your job. You can view the updated script along with the summary that details the proposed changes. After running your own tests, accept the changes and the AWS Glue job will be updated automatically to the latest version with the new script. 

 The upgrade analysis process can take some time to complete, depending on the complexity of the job and the workload. The results of the upgrade analysis will be stored in the specified Amazon S3 path, which can be reviewed to understand the upgrade and any potential compatibility issues. After reviewing the upgrade analysis results, you can decide whether to proceed with the actual upgrade or make any necessary changes to the job before upgrading. 

## Prerequisites
<a name="upgrade-analysis-prerequisites"></a>

 The following prerequisites are required to use generative AI to upgrade jobs in AWS Glue: 
+  AWS Glue 2 PySpark jobs – only AWS Glue 2 jobs can be upgraded to AWS Glue 5. 
+  IAM permissions are required to start the analysis, review the results and upgrade your job. For more information, see the examples in the [Permissions](#auto-upgrade-permissions) section below. 
+  If using AWS KMS to encrypt analysis artifacts, then additional AWS AWS KMS permissions are needed. For more information, see the examples in the [AWS KMS policy](#auto-upgrade-kms-policy) section below. 

### Permissions
<a name="auto-upgrade-permissions"></a>

#### To start a new upgrade analysis, you need the following permissions:
<a name="collapsible-section-1"></a>

1.  Update the IAM policy of the caller with the following permission: 

------
#### [ JSON ]

****  

   ```
   {
       "Version":"2012-10-17",		 	 	 
       "Statement": [
           {
               "Effect": "Allow",
               "Action": [
                   "glue:StartJobUpgradeAnalysis",
                   "glue:StartJobRun",
                   "glue:GetJobRun",
                   "glue:GetJob",
                   "glue:BatchStopJobRun"
               ],
               "Resource": [
                   "arn:aws:glue:us-east-1:111122223333:job/jobName"
               ]
           },
           {
               "Effect": "Allow",
               "Action": [
                   "s3:GetObject"
               ],
               "Resource": [
              		 "arn:aws:s3:::amzn-s3-demo-bucket/script-location/*"
               ]
           },
           {
               "Effect": "Allow",
               "Action": [
                   "s3:PutObject",
                   "s3:GetObject"
               ],
               "Resource": [
               		"arn:aws:s3:::amzn-s3-demo-bucket/results/*"
               ]
           },
           {
               "Effect": "Allow",
               "Action": [
                   "kms:Decrypt",
                   "kms:GenerateDataKey"
               ],
               "Resource": "arn:aws:kms:us-east-1:111122223333:key/key-id"
           }
       ]
   }
   ```

------

1.  Update the Execution role of the job you are upgrading to include the following in-line policy: 

   ```
       {
         "Effect": "Allow",
         "Action": ["s3:GetObject"],    
         "Resource": [
           "ARN of the Amazon S3 path provided on API",
           "ARN of the Amazon S3 path provided on API/*"
         ]
       }
   ```

    For example, if you are using the Amazon S3 path `s3://amzn-s3-demo-bucket/upgraded-result`, then the policy will be: 

   ```
   {
         "Effect": "Allow",
         "Action": ["s3:GetObject"],
         "Resource": [
           "arn:aws:s3:::amzn-s3-demo-bucket/upgraded-result/",
           "arn:aws:s3:::amzn-s3-demo-bucket/upgraded-result/*"
         ]
       }
   ```

#### To retrieve the details of an analysis, you need the following permissions:
<a name="collapsible-section-2"></a>

------
#### [ JSON ]

****  

```
{
  "Version":"2012-10-17",		 	 	 
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "glue:GetJobUpgradeAnalysis"
      ],
      "Resource": [
        "arn:aws:glue:us-east-1:111122223333:job/jobName"
      ]
    }
  ]
}
```

------

#### To stop an in-progress analysis, you need the following permissions:
<a name="collapsible-section-3"></a>

------
#### [ JSON ]

****  

```
{
  "Version":"2012-10-17",		 	 	 
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "glue:StopJobUpgradeAnalysis",
        "glue:BatchStopJobRun"
      ],
      "Resource": [
        "arn:aws:glue:us-east-1:111122223333:job/jobName"
      ]
    }
  ]
}
```

------

#### To list all the analyses submitted on a specific job, you need the following permissions:
<a name="collapsible-section-4"></a>

------
#### [ JSON ]

****  

```
{
  "Version":"2012-10-17",		 	 	 
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "glue:ListJobUpgradeAnalyses"
      ],
      "Resource": [
        "arn:aws:glue:us-east-1:111122223333:job/jobName"
      ]
    }
  ]
}
```

------

#### To accept the changes from an analysis and upgrade a job, you need the following permissions:
<a name="collapsible-section-5"></a>

### AWS KMS policy
<a name="auto-upgrade-kms-policy"></a>

 To pass your own custom AWS KMS key when starting an analysis, please refer to the following section to configure appropriate permissions on the AWS KMS keys. 

#### Configuring the result artifact encryption using a AWS KMS key:
<a name="w2aac37b7c20c13c13b5b5"></a>

 This policy ensures that you have both the encryption and decryption permissions on the AWS KMS key. 

```
{
    "Effect": "Allow",
    "Principal":{
        "AWS": "<IAM Customer caller ARN>"
    },
    "Action": [
      "kms:Decrypt",
      "kms:GenerateDataKey",
    ],
    "Resource": "<key-arn-passed-on-start-api>"
}
```

## Running an upgrade analysis and applying the upgrade script
<a name="auto-upgrade-procedure"></a>

 You can run an upgrade analysis, which will generate an upgrade plan on a job you select from the **Jobs** view. 

1.  From **Jobs**, select a AWS Glue 2.0 job, then choose **Run upgrade analysis** from the **Actions** menu.   
![\[The screenshot shows the Upgrade analysis with AI from the action menu.\]](http://docs.aws.amazon.com/glue/latest/dg/images/upgrade-analysis-run-action-menu.png)

1.  In the modal, select a path to store your generated upgrade plan in the **Result path**. This must be an Amazon S3 bucket you can access and write to.   
![\[The screenshot shows the completed upgrade analysis. The button for Apply upgraded script is visible.\]](http://docs.aws.amazon.com/glue/latest/dg/images/upgrade-analysis-configuration-options.png)

1.  Configure additional options, if needed: 
   +  **Run configuration** – optional: The run configuration is an optional setting that allows you to customize various aspects of the validation runs performed during the upgrade analysis. This configuration is utilized to execute the upgraded script and allows you to select the compute environment properties (worker type, number of workers, etc). Note you should use your non-production developer accounts to run the validations on sample datasets before reviewing, accepting the changes and applying them to production environments. The run configuration includes the following customizable parameters: 
     + Worker type: You can specify the type of worker to be used for the validation runs, allowing you to choose appropriate compute resources based on your requirements.
     + Number of workers: You can define the number of workers to be provisioned for the validation runs, enabling you to scale resources according to your workload needs.
     + Job timeout (in minutes): This parameter allows you to set a time limit for the validation runs, ensuring that the jobs terminate after a specified duration to prevent excessive resource consumption.
     + Security configuration: You can configure security settings, such as encryption and access control, to ensure the protection of your data and resources during the validation runs.
     + Additional job parameters: If needed, you can add new job parameters to further customize the execution environment for the validation runs.

      By leveraging the run configuration, you can tailor the validation runs to suit your specific requirements. For example, you can configure the validation runs to use a smaller dataset, which allows the analysis to complete more quickly and optimizes costs. This approach ensures that the upgrade analysis is performed efficiently while minimizing resource utilization and associated costs during the validation phase. 
   +  **Encryption configuration** – optional: 
     + **Enable upgrade artifacts encryption**: Enable at-rest encryption when writing data to the result path. If you don't want to encrypt your upgrade artifacts, leave this option unchecked.

1.  Choose **Run** to start the upgrade analysis. While the analysis is running, you can view the results on the **Upgrade analysis** tab. The analysis details window will show you information about the analysis as well as links to the upgrade artifacts. 
   +  **Result path** – this is where the results summary and upgrade script are stored. 
   +  **Upgraded script in Amazon S3** – the location of the upgrade script in Amazon S3. You can view the script prior to applying the upgrade. 
   +  **Upgrade summary in Amazon S3** – the location of the upgrade summary in Amazon S3. You can view the upgrade summary prior to applying the upgrade. 

1.  When the upgrade analysis is completed successfully, you can apply the upgrade script to automatically upgrade your job by choosing **Apply upgraded script**. 

    Once applied, the AWS Glue version will be updated to 4.0. You can view the new script in the **Script** tab.   
![\[The screenshot shows the completed upgrade analysis. The button for Apply upgraded script is visible.\]](http://docs.aws.amazon.com/glue/latest/dg/images/upgrade-analysis-analysis-details-preview.png)

## Understanding your upgrade summary
<a name="auto-upgrade-analysis-summary"></a>

 This example demonstrates the process of upgrading a AWS Glue job from version 2.0 to version 4.0. The sample job reads product data from an Amazon S3 bucket, applies several transformations to the data using Spark SQL, and then saves the transformed results back to an Amazon S3 bucket. 

### Original code (AWS Glue 2.0) - before upgrade
<a name="w2aac37b7c20c21b5b1"></a>

```
from awsglue.transforms import *
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from pyspark.sql.types import *
from pyspark.sql.functions import *
from awsglue.job import Job
import json
from pyspark.sql.types import StructType

sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)

gdc_database = "s3://aws-glue-scripts-us-east-1-gamma/demo-database/"
schema_location = (
    "s3://aws-glue-scripts-us-east-1-gamma/DataFiles/"
)

products_schema_string = spark.read.text(
    f"{schema_location}schemas/products_schema"
).first()[0]

product_schema = StructType.fromJson(json.loads(products_schema_string))

products_source_df = (
    spark.read.option("header", "true")
    .schema(product_schema)
    .option(
        "path",
        f"{gdc_database}products/",
    )
    .csv(f"{gdc_database}products/")
)

products_source_df.show()
products_temp_view_name = "spark_upgrade_demo_product_view"
products_source_df.createOrReplaceTempView(products_temp_view_name)

query = f"select {products_temp_view_name}.*, format_string('%0$s-%0$s', category, subcategory) as unique_category from {products_temp_view_name}"
products_with_combination_df = spark.sql(query)
products_with_combination_df.show()

products_with_combination_df.createOrReplaceTempView(products_temp_view_name)
product_df_attribution = spark.sql(
    f"""
SELECT *,
unbase64(split(product_name, ' ')[0]) as product_name_decoded,
unbase64(split(unique_category, '-')[1]) as subcategory_decoded
FROM {products_temp_view_name}
"""
)
product_df_attribution.show()


product_df_attribution.write.mode("overwrite").option("header", "true").option(
    "path", f"{gdc_database}spark_upgrade_demo_product_agg/"
).saveAsTable("spark_upgrade_demo_product_agg", external=True)

spark_upgrade_demo_product_agg_table_df = spark.sql(
    f"SHOW TABLE EXTENDED in default like 'spark_upgrade_demo_product_agg'"
)
spark_upgrade_demo_product_agg_table_df.show()
job.commit()
```

### New code (Glue 4.0) - after upgrade
<a name="upgrade-analysis-example-new-code-glue-4"></a>

```
from awsglue.transforms import *
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from pyspark.sql.types import *
from pyspark.sql.functions import *
from awsglue.job import Job
import json
from pyspark.sql.types import StructType

sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
# change 1
spark.conf.set("spark.sql.adaptive.enabled", "false")
# change 2
spark.conf.set("spark.sql.legacy.pathOptionBehavior.enabled", "true")
job = Job(glueContext)

gdc_database = "s3://aws-glue-scripts-us-east-1-gamma/demo-database/"
schema_location = (
    "s3://aws-glue-scripts-us-east-1-gamma/DataFiles/"
)

products_schema_string = spark.read.text(
    f"{schema_location}schemas/products_schema"
).first()[0]

product_schema = StructType.fromJson(json.loads(products_schema_string))

products_source_df = (
    spark.read.option("header", "true")
    .schema(product_schema)
    .option(
        "path",
        f"{gdc_database}products/",
    )
    .csv(f"{gdc_database}products/")
)

products_source_df.show()
products_temp_view_name = "spark_upgrade_demo_product_view"
products_source_df.createOrReplaceTempView(products_temp_view_name)

# change 3
query = f"select {products_temp_view_name}.*, format_string('%1$s-%1$s', category, subcategory) as unique_category from {products_temp_view_name}"
products_with_combination_df = spark.sql(query)
products_with_combination_df.show()

products_with_combination_df.createOrReplaceTempView(products_temp_view_name)
# change 4
product_df_attribution = spark.sql(
    f"""
SELECT *,
try_to_binary(split(product_name, ' ')[0], 'base64') as product_name_decoded,
try_to_binary(split(unique_category, '-')[1], 'base64') as subcategory_decoded
FROM {products_temp_view_name}
"""
)
product_df_attribution.show()


product_df_attribution.write.mode("overwrite").option("header", "true").option(
    "path", f"{gdc_database}spark_upgrade_demo_product_agg/"
).saveAsTable("spark_upgrade_demo_product_agg", external=True)

spark_upgrade_demo_product_agg_table_df = spark.sql(
    f"SHOW TABLE EXTENDED in default like 'spark_upgrade_demo_product_agg'"
)
spark_upgrade_demo_product_agg_table_df.show()
job.commit()
```

### Explanation of analysis summary
<a name="upgrade-analysis-explanation-summary"></a>

![\[The screenshot shows the Upgrade analysis summary.\]](http://docs.aws.amazon.com/glue/latest/dg/images/upgrade-analysis-explanation-summary.png)


 Based on the summary, there are four changes proposed by AWS Glue in order to successfully upgrade the script from AWS Glue 2.0 to AWS Glue 4.0: 

1.  **Spark SQL configuration (spark.sql.adaptive.enabled)**: This change is to restore the application behavior as a new feature for Spark SQL adaptive query execution is introduced starting Spark 3.2. You can inspect this configuration change and can further enable or disable it as per their preference. 

1.  **DataFrame API change**: The path option cannot coexist with other DataFrameReader operations like `load()`. To retain the previous behavior, AWS Glue updated the script to add a new SQL configuration **(spark.sql.legacy.pathOptionBehavior.enabled)**. 

1.  **Spark SQL API change**: The behavior of `strfmt` in `format_string(strfmt, obj, ...)` has been updated to disallow `0$` as the first argument. To ensure compatibility, AWS Glue has modified the script to use `1$` as the first argument instead. 

1.  **Spark SQL API change**: The `unbase64` function does not allow malformed string inputs. To retain the previous behavior, AWS Glue updated the script to use the `try_to_binary` function. 

## Stopping an upgrade analysis in progress
<a name="auto-upgrade-stopping-analysis"></a>

 You can cancel an upgrade analysis in progress or just stop the analysis. 

1.  Choose the **Upgrade Analysis** tab. 

1.  Select the job that is running, then choose **Stop**. This will stop the analysis. You can then run another upgrade analysis on the same job.   
![\[The screenshot shows the Upgrade analysis tab with a job selected. The job is still running.\]](http://docs.aws.amazon.com/glue/latest/dg/images/upgrade-analysis-tab.png)

## Considerations
<a name="upgrade-analysis-considerations"></a>

 As you begin using Spark Upgrades, there are several important aspects to consider for optimal usage of the service. 
+  **Service Scope and Limitations**: The current release focuses on PySpark code upgrades from AWS Glue versions 2.0 to version 5.0. At this time, the service handles PySpark code that doesn't rely on additional library dependencies. You can run automated upgrades for up to 10 jobs concurrently in an AWS account, allowing you to efficiently upgrade multiple jobs while maintaining system stability. 
  +  Only PySpark jobs are supported. 
  +  Upgrade analysis will time out after 24 hours. 
  +  Only one active upgrade analysis can be run at a time for one job. On the account-level, up to 10 active upgrade analysis can be run at the same time. 
+  **Optimizing Costs During Upgrade Process**: Since Spark Upgrades use generative AI to validate the upgrade plan through multiple iterations, with each iteration running as a AWS Glue job in your account, it’s essential to optimize the validation job run configurations for cost efficiency. To achieve this, we recommend specifying a Run Configuration when starting an Upgrade Analysis as follows: 
  +  Use non-production developer accounts and select sample mock datasets that represent your production data but are smaller in size for validation with Spark Upgrades. 
  +  Using right-sized compute resources, such as G.1X workers, and selecting an appropriate number of workers for processing your sample data. 
  +  Enabling AWS Glue job auto-scaling when applicable to automatically adjust resources based on workload. 

   For example, if your production job processes terabytes of data with 20 G.2X workers, you might configure the upgrade job to process a few gigabytes of representative data with 2 G.2X workers and auto-scaling enabled for validation. 
+  **Best Practices**: We strongly recommend starting your upgrade journey with non-production jobs. This approach allows you to familiarize yourself with the upgrade workflow, and understand how the service handles different types of Spark code patterns. 
+  **Alarms and notifications**: When utilizing the Generative AI upgrades feature on a job, ensure that alarms/notifications for failed job runs are turned off. During the upgrade process, there may be up to 10 failed job runs in your account before the upgraded artifacts are provided. 
+  **Anomaly detection rules**: Turn off any anomaly detection rules on the Job that is being upgraded as well, as the data written to output folders during intermediate job runs might not be in the expected format while the upgrade validation is in progress. 
+  **Use upgrade analysis with idempotent jobs**: Use upgrade analysis with idempotent jobs to ensure each subsequent validation job run attempt is similar to the previous one, and doesn't run into issues. Idempotent jobs are jobs that can be run multiple times with the same input data, and they will produce the same output each time. When using the Generative AI upgrades for Apache Spark in AWS Glue, the service will run multiple iterations of your job as part of the validation process. During each iteration, it will make changes to your Spark code and configurations to validate the upgrade plan. If your Spark job is not idempotent, running it multiple times with the same input data could lead to issues. 

## Supported regions
<a name="upgrade-analysis-supported-regions"></a>

Generative AI upgrades for Apache Spark is available in the following regions:
+ **Asia Pacific**: Tokyo (ap-northeast-1), Seoul (ap-northeast-2), Mumbai (ap-south-1), Singapore (ap-southeast-1), and Sydney (ap-southeast-2)
+ **North America**: Canada (ca-central-1)
+ **Europe**: Frankfurt (eu-central-1), Stockholm (eu-north-1), Ireland (eu-west-1), London (eu-west-2), and Paris (eu-west-3)
+ **South America**: São Paulo (sa-east-1)
+ **United States**: North Virginia (us-east-1), Ohio (us-east-2), and Oregon (us-west-2)

## Cross-region inference in Spark Upgrades
<a name="w2aac37b7c20c37"></a>

 Spark Upgrades is powered by Amazon Bedrock and leverages cross-region inference (CRIS). With CRIS, Spark Upgrades will automatically select the optimal region within your geography (as described in more detail [here](https://docs.aws.amazon.com/bedrock/latest/userguide/cross-region-inference.html)) to process your inference request, maximizing available compute resources and model availability, and providing the best customer experience. There's no additional cost for using cross-region inference. 

 Cross-region inference requests are kept within the AWS Regions that are part of the geography where the data originally resides. For example, a request made within the US is kept within the AWS Regions in the US. Although the data remains stored only in the primary region, when using cross-region inference, your input prompts and output results may move outside of your primary region. All data will be transmitted encrypted across Amazon's secure network. 

# Working with Spark jobs in AWS Glue
<a name="etl-jobs-section"></a>

Provides information on AWS Glue for Spark ETL jobs.

**Topics**
+ [Using job parameters in AWS Glue jobs](aws-glue-programming-etl-glue-arguments.md)
+ [AWS Glue Spark and PySpark jobs](spark_and_pyspark.md)
+ [AWS Glue worker types](worker-types.md)
+ [Streaming ETL jobs in AWS Glue](add-job-streaming.md)
+ [Record matching with AWS Lake Formation FindMatches](machine-learning.md)
+ [Migrate Apache Spark programs to AWS Glue](glue-author-migrate-apache-spark.md)

# Using job parameters in AWS Glue jobs
<a name="aws-glue-programming-etl-glue-arguments"></a>

When creating a AWS Glue job, you set some standard fields, such as `Role` and `WorkerType`. You can provide additional configuration information through the `Argument` fields (**Job Parameters** in the console). In these fields, you can provide AWS Glue jobs with the arguments (parameters) listed in this topic. 

 For more information about the AWS Glue Job API, see [Jobs](aws-glue-api-jobs-job.md). 

**Note**  
 Job arguments have a maximum size limit of 260KB. A validation check will raise an error if the argument size is greater than 260KB. 


## Setting job parameters
<a name="w2aac37c11b8c11"></a>

You can configure a job through the console on the **Job details** tab, under the **Job Parameters** heading. You can also configure a job through the AWS CLI by setting `DefaultArguments` or `NonOverridableArguments` on a job, or setting `Arguments` on a job run. Arguments set on the job will be passed in every time the job is run, while arguments set on the job run will only be passed in for that individual run. 

For example, the following is the syntax for running a job using `--arguments` to set a job parameter.

```
$ aws glue start-job-run --job-name "CSV to CSV" --arguments='--scriptLocation="s3://my_glue/libraries/test_lib.py"'
```

## Accessing job parameters
<a name="w2aac37c11b8c13"></a>

When writing AWS Glue scripts, you may want to access job parameter values to alter the behavior of your own code. We provide helper methods to do so in our libraries. These methods resolve job run parameter values that override job parameter values. When resolving parameters set in multiple places, job `NonOverridableArguments` will override job run `Arguments`, which will override job `DefaultArguments`.

**In Python:**

In Python jobs, we provide a function named `getResolvedParameters`. For more information, see [Accessing parameters using `getResolvedOptions`](aws-glue-api-crawler-pyspark-extensions-get-resolved-options.md). Job parameters are available in the `sys.argv` variable.

**In Scala:**

In Scala jobs, we provide an object named `GlueArgParser`. For more information, see [AWS Glue Scala GlueArgParser APIs](glue-etl-scala-apis-glue-util-glueargparser.md). Job parameters are available in the `sysArgs` variable.

## Job parameter reference
<a name="job-parameter-reference"></a>

**AWS Glue recognizes the following argument names that you can use to set up the script environment for your jobs and job runs:**

**`--additional-python-modules`**  
 A comma delimited list representing a set of Python packages to be installed. You can install packages from PyPI or provide a custom distribution. A PyPI package entry will be in the format `package==version`, with the PyPI name and version of your target package. A custom distribution entry is the S3 path to the distribution.  
Entries use Python version matching to match package and version. This means you will need to use two equals signs, such as `==`. There are other version matching operators, for more information see [PEP 440](https://peps.python.org/pep-0440/#version-matching).   
To pass module installation options to `pip3`, use the [--python-modules-installer-option](#python-modules-installer-option) parameter.

**`--auto-scale-within-microbatch`**  
The default value is true. This parameter can only be used for AWS Glue streaming jobs, which process the streaming data in a series of micro batches, and auto scaling must be enabled. When setting this value to false, it computes the exponential moving average of batch duration for completed micro-batches and compares this value with the window size to determine whether to scale up or scale down the number of executors. Scaling only happens when a micro batch is completed. When setting this value to true, during a micro-batch, it scales up when the number of Spark tasks remains the same for 30 seconds, or the current batch processing is greater than the window size. The number of executors will drop if an executor has been idle for more than 60 seconds, or the exponential moving average of batch duration is low. 

**`--class`**  
The Scala class that serves as the entry point for your Scala script. This applies only if your `--job-language` is set to `scala`.

**`--continuous-log-conversionPattern`**  
Specifies a custom conversion log pattern for a job enabled for continuous logging. The conversion pattern applies only to driver logs and executor logs. It does not affect the AWS Glue progress bar.

**`--continuous-log-logGroup`**  
Specifies a custom Amazon CloudWatch log group name for a job enabled for continuous logging.

**`--continuous-log-logStreamPrefix`**  
 Specifies a custom CloudWatch log stream prefix for a job enabled for continuous logging.

**`--customer-driver-env-vars` and `--customer-executor-env-vars`**  
These parameters set environment variables on the operating system respectively for each worker (driver or executor). You can use these parameters when building platforms and custom frameworks on top of AWS Glue, to let your users write jobs on top of it. Enabling these two flags will allow you to set different environment variables on the driver and executor respectively without having to inject the same logic in the job script itself.   
**Example usage**  
The following is an example of using these parameters:

```
"—customer-driver-env-vars", "CUSTOMER_KEY1=VAL1,CUSTOMER_KEY2=\"val2,val2 val2\"",
"—customer-executor-env-vars", "CUSTOMER_KEY3=VAL3,KEY4=VAL4"
```
Setting these in the job run argument is equivalent to running the following commands:  
In the driver:  
+ export CUSTOMER\$1KEY1=VAL1
+ export CUSTOMER\$1KEY2="val2,val2 val2"
In the executor:  
+ export CUSTOMER\$1KEY3=VAL3
Then, in the job script itself, you can retrieve the environment variables using `os.environ.get("CUSTOMER_KEY1")` or `System.getenv("CUSTOMER_KEY1")`.   
**Enforced syntax**  
Observe the following standards when defining environment variables:
+ Each key must have the `CUSTOMER_ prefix`.

  For example: for `"CUSTOMER_KEY3=VAL3,KEY4=VAL4"`, `KEY4=VAL4` will be ignored and not set.
+ Each key and value pair must be delineated with a single comma.

  For example: `"CUSTOMER_KEY3=VAL3,CUSTOMER_KEY4=VAL4"`
+ If the "value" has spaces or commas, then it must be defined within quotations.

  For example: `CUSTOMER_KEY2=\"val2,val2 val2\"`
This syntax closely models the standards of setting bash environment variables.

**`--datalake-formats` **  
Supported in AWS Glue 3.0 and later versions.  
Specifies the data lake framework to use. AWS Glue adds the required JAR files for the frameworks that you specify into the `classpath`. For more information, see [Using data lake frameworks with AWS Glue ETL jobs](aws-glue-programming-etl-datalake-native-frameworks.md).  
You can specify one or more of the following values, separated by a comma:  
+ `hudi`
+ `delta`
+ `iceberg`
For example, pass the following argument to specify all three frameworks.  

```
'--datalake-formats': 'hudi,delta,iceberg'
```

**`--disable-proxy-v2`**  
 Disable the service proxy to allow AWS service calls to Amazon S3, CloudWatch, and AWS Glue originating from your script through your VPC. For more information, see [ Configuring AWS calls to go through your VPC ](https://docs.aws.amazon.com/glue/latest/dg/connection-VPC-disable-proxy.html). To disable the service proxy, set the value of this paramater to `true`.

**`--enable-auto-scaling`**  
Turns on auto scaling and per-worker billing when you set the value to `true`.

**`--enable-continuous-cloudwatch-log`**  
Enables real-time continuous logging for AWS Glue jobs. You can view real-time Apache Spark job logs in CloudWatch.

**`--enable-continuous-log-filter`**  
Specifies a standard filter (`true`) or no filter (`false`) when you create or edit a job enabled for continuous logging. Choosing the standard filter prunes out non-useful Apache Spark driver/executor and Apache Hadoop YARN heartbeat log messages. Choosing no filter gives you all the log messages.

**`--enable-glue-datacatalog`**  
Enables you to use the AWS Glue Data Catalog as an Apache Spark Hive metastore. To enable this feature, set the value to `true`.

**`--enable-job-insights`**  
Enables additional error analysis monitoring with AWS Glue job run insights. For details, see [Monitoring with AWS Glue job run insights](monitor-job-insights.md). By default, the value is set to `true` and job run insights are enabled.  
This option is available for AWS Glue version 2.0 and 3.0.

**`--enable-lakeformation-fine-grained-access`**  
Enables fine-grained access control for AWS Glue jobs. For more information, see [Using AWS Glue with AWS Lake Formation for fine-grained access control](security-lf-enable.md).

**`--enable-metrics`**  
Enables the collection of metrics for job profiling for this job run. These metrics are available on the AWS Glue console and the Amazon CloudWatch console. The value of this parameter is not relevant. To enable this feature, you can provide this parameter with any value, but `true` is recommended for clarity. To disable this feature, remove this parameter from your job configuration.

**`--enable-observability-metrics`**  
 Enables a set of Observability metrics to generate insights into what is happening inside each job run on Job Runs Monitoring page under AWS Glue console and the Amazon CloudWatch console. To enable this feature, set the value of this parameter to true. To disable this feature, set it to `false` or remove this parameter from your job configuration. 

**`--enable-rename-algorithm-v2`**  
Sets the EMRFS rename algorithm version to version 2. When a Spark job uses dynamic partition overwrite mode, there is a possibility that a duplicate partition is created. For instance, you can end up with a duplicate partition such as `s3://bucket/table/location/p1=1/p1=1`. Here, P1 is the partition that is being overwritten. Rename algorithm version 2 fixes this issue.  
This option is only available on AWS Glue version 1.0.

**`--enable-s3-parquet-optimized-committer`**  
Enables the EMRFS S3-optimized committer for writing Parquet data into Amazon S3. You can supply the parameter/value pair via the AWS Glue console when creating or updating an AWS Glue job. Setting the value to **true** enables the committer. By default, the flag is turned on in AWS Glue 3.0 and off in AWS Glue 2.0.  
For more information, see [Using the EMRFS S3-optimized Committer](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-s3-optimized-committer.html).

**`--enable-spark-ui`**  
When set to `true`, turns on the feature to use the Spark UI to monitor and debug AWS Glue ETL jobs.

**`--executor-cores`**  
Number of spark tasks that can run in parallel. This option is supported on AWS Glue 3.0\$1. The value should not exceed 2x the number of vCPUs on the worker type, which is 8 on `G.1X`, 16 on `G.2X`, 32 on `G.4X`, 64 on `G.8X`, 96 on `G.12X`, 128 on `G.16X`, and 8 on `R.1X`, 16 on `R.2X`, 32 on `R.4X`, 64 on `R.8X`. You should exercise caution while updating this configuration as it could impact job performance because increased task parallelism causes memory, disk pressure as well as it could throttle the source and target systems (for example: it would cause more concurrent connections on Amazon RDS).

**`--extra-files`**  
The Amazon S3 paths to additional files, such as configuration files that AWS Glue copies to the working directory of your script on the driver node before running it. Multiple values must be complete paths separated by a comma (`,`). The value can be individual files or directory locations. This option is not supported for Python Shell job types.

**`--extra-jars`**  
The Amazon S3 paths to additional files that AWS Glue copies to the driver and executors. AWS Glue also adds these files to the Java classpath before executing your script. Multiple values must be complete paths separated by a comma (`,`). The extension need not be `.jar`

**`--extra-py-files`**  
The Amazon S3 paths to additional Python modules that AWS Glue adds to the Python path on the driver node before running your script. Multiple values must be complete paths separated by a comma (`,`). Only individual files are supported, not a directory path.

**`--job-bookmark-option`**  
Controls the behavior of a job bookmark. The following option values can be set.    
****    
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-glue-arguments.html)
For example, to enable a job bookmark, pass the following argument.  

```
'--job-bookmark-option': 'job-bookmark-enable'
```

**`--job-language`**  
The script programming language. This value must be either `scala` or `python`. If this parameter is not present, the default is `python`.

**`--python-modules-installer-option`**  
A plaintext string that defines options to be passed to `pip3` when installing modules with [--additional-python-modules](#additional-python-modules). Provide options as you would in the command line, separated by spaces and prefixed by dashes. For more information about usage, see [Installing additional Python modules with pip in AWS Glue 2.0 or later](aws-glue-programming-python-libraries.md#addl-python-modules-support).  
This option is not supported for AWS Glue jobs when you use Python 3.9.

**`--scriptLocation`**  
The Amazon Simple Storage Service (Amazon S3) location where your ETL script is located (in the form `s3://path/to/my/script.py`). This parameter overrides a script location set in the `JobCommand` object.

**`--spark-event-logs-path`**  
Specifies an Amazon S3 path. When using the Spark UI monitoring feature, AWS Glue flushes the Spark event logs to this Amazon S3 path every 30 seconds to a bucket that can be used as a temporary directory for storing Spark UI events.

**`--TempDir`**  
Specifies an Amazon S3 path to a bucket that can be used as a temporary directory for the job.  
For example, to set a temporary directory, pass the following argument.  

```
'--TempDir': 's3-path-to-directory'
```
AWS Glue creates a temporary bucket for jobs if a bucket doesn't already exist in a Region. This bucket might permit public access. You can either modify the bucket in Amazon S3 to set the public access block, or delete the bucket later after all jobs in that Region have completed.

**`--use-postgres-driver`**  
When setting this value to `true`, it prioritizes the Postgres JDBC driver in the class path to avoid a conflict with the Amazon Redshift JDBC driver. This option is only available in AWS Glue version 2.0.

**`--user-jars-first`**  
When setting this value to `true`, it prioritizes the customer's extra JAR files in the classpath. This option is only available in AWS Glue version 2.0 or later.

**`--conf`**  
Controls Spark config parameters. It is for advanced use cases.

**`--encryption-type`**  
Legacy parameter. The corresponding behavior should be configured using security configurations. for more information about security configurations, see [Encrypting data written by AWS Glue](encryption-security-configuration.md).

AWS Glue uses the following arguments internally and you should never use them:
+ `--debug` — Internal to AWS Glue. Do not set.
+ `--mode` — Internal to AWS Glue. Do not set.
+ `--JOB_NAME` — Internal to AWS Glue. Do not set.
+ `--endpoint` — Internal to AWS Glue. Do not set.


## 
<a name="w2aac37c11b8c17"></a>

 AWS Glue supports bootstrapping an environment with Python's `site` module using `sitecustomize` to perform site-specific customizations. Bootstrapping your own initilization functions is recommended for advanced use cases only and is supported on a best-effort basis on AWS Glue 4.0. 

 The environment variable prefix, `GLUE_CUSTOMER`, is reserved for customer use. 

# AWS Glue Spark and PySpark jobs
<a name="spark_and_pyspark"></a>

AWS Glue support Spark and PySpark jobs. A Spark job is run in an Apache Spark environment managed by AWS Glue. It processes data in batches. A streaming ETL job is similar to a Spark job, except that it performs ETL on data streams. It uses the Apache Spark Structured Streaming framework. Some Spark job features are not available to streaming ETL jobs.

The following sections provide information on AWS Glue Spark and PySpark jobs.

**Topics**
+ [Configuring job properties for Spark jobs in AWS Glue](add-job.md)
+ [Editing Spark scripts in the AWS Glue console](edit-script-spark.md)
+ [Jobs (legacy)](console-edit-script.md)
+ [Tracking processed data using job bookmarks](monitor-continuations.md)
+ [Storing Spark shuffle data](monitor-spark-shuffle-manager.md)
+ [Monitoring AWS Glue Spark jobs](monitor-spark.md)
+ [Generative AI troubleshooting for Apache Spark in AWS Glue](troubleshoot-spark.md)
+ [Using materialized views with AWS Glue](materialized-views.md)

# Configuring job properties for Spark jobs in AWS Glue
<a name="add-job"></a>

When you define your job on the AWS Glue console, you provide values for properties to control the AWS Glue runtime environment. 

## Defining job properties for Spark jobs
<a name="create-job"></a>

The following list describes the properties of a Spark job. For the properties of a Python shell job, see [Defining job properties for Python shell jobs](add-job-python.md#create-job-python-properties). For properties of a streaming ETL job, see [Defining job properties for a streaming ETL job](add-job-streaming.md#create-job-streaming-properties).

The properties are listed in the order in which they appear on the **Add job** wizard on AWS Glue console.

**Name**  
Provide a UTF-8 string with a maximum length of 255 characters. 

**Description**  
Provide an optional description of up to 2048 characters. 

**IAM Role**  
Specify the IAM role that is used for authorization to resources used to run the job and access data stores. For more information about permissions for running jobs in AWS Glue, see [Identity and access management for AWS Glue](security-iam.md).

**Type**  
The type of ETL job. This is set automatically based on the type of data sources you select.  
+ **Spark** runs an Apache Spark ETL script with the job command `glueetl`.
+ **Spark Streaming** runs a Apache Spark streaming ETL script with the job command `gluestreaming`. For more information, see [Streaming ETL jobs in AWS Glue](add-job-streaming.md).
+ **Python shell** run a Python script with the job command `pythonshell`. For more information, see [Configuring job properties for Python shell jobs in AWS Glue](add-job-python.md).

**AWS Glue version**  
AWS Glue version determines the versions of Apache Spark and Python that are available to the job, as specified in the following table.      
<a name="table-glue-versions"></a>[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/add-job.html)
Jobs that are created without specifying a AWS Glue version default to AWS Glue 5.1.

**Language**  
The code in the ETL script defines your job's logic. The script can be coded in Python or Scala. You can choose whether the script that the job runs is generated by AWS Glue or provided by you. You provide the script name and location in Amazon Simple Storage Service (Amazon S3). Confirm that there isn't a file with the same name as the script directory in the path. To learn more about writing scripts, see [AWS Glue programming guide](edit-script.md).

**Worker type**  
The following worker types are available:  
The resources available on AWS Glue workers are measured in DPUs. A DPU is a relative measure of processing power that consists of 4 vCPUs of compute capacity and 16 GB of memory.  
+ **G.025X** – When you choose this type, you also provide a value for **Number of workers**. Each worker maps to 0.25 DPU (2 vCPUs, 4 GB of memory) with 84GB disk (approximately 34GB free). We recommend this worker type for low volume streaming jobs. This worker type is only available for AWS Glue version 3.0 or later streaming jobs.
+ **G.1X** – When you choose this type, you also provide a value for **Number of workers**. Each worker maps to 1 DPU (4 vCPUs, 16 GB of memory) with 94GB disk (approximately 44GB free). We recommend this worker type for workloads such as data transforms, joins, and queries, to offers a scalable and cost effective way to run most jobs.
+ **G.2X** – When you choose this type, you also provide a value for **Number of workers**. Each worker maps to 2 DPU (8 vCPUs, 32 GB of memory) with 138GB disk (approximately 78GB free). We recommend this worker type for workloads such as data transforms, joins, and queries, to offers a scalable and cost effective way to run most jobs.
+ **G.4X** – When you choose this type, you also provide a value for **Number of workers**. Each worker maps to 4 DPU (16 vCPUs, 64 GB of memory) with 256GB disk (approximately 230GB free). We recommend this worker type for jobs whose workloads contain your most demanding transforms, aggregations, joins, and queries. 
+ **G.8X** – When you choose this type, you also provide a value for **Number of workers**. Each worker maps to 8 DPU (32 vCPUs, 128 GB of memory) with 512GB disk (approximately 485GB free). We recommend this worker type for jobs whose workloads contain your most demanding transforms, aggregations, joins, and queries.
+ **G.12X** – When you choose this type, you also provide a value for **Number of workers**. Each worker maps to 12 DPU (48 vCPUs, 192 GB of memory) with 768GB disk (approximately 741GB free). We recommend this worker type for jobs with very large and resource-intensive workloads that require significant compute capacity. 
+ **G.16X** – When you choose this type, you also provide a value for **Number of workers**. Each worker maps to 16 DPU (64 vCPUs, 256 GB of memory) with 1024GB disk (approximately 996GB free). We recommend this worker type for jobs with the largest and most resource-intensive workloads that require maximum compute capacity. 
+ **R.1X** – When you choose this type, you also provide a value for **Number of workers**. Each worker maps to 1 DPU with memory-optimized configuration. We recommend this worker type for memory-intensive workloads that frequently encounter out-of-memory errors or require high memory-to-CPU ratios. 
+ **R.2X** – When you choose this type, you also provide a value for **Number of workers**. Each worker maps to 2 DPU with memory-optimized configuration. We recommend this worker type for memory-intensive workloads that frequently encounter out-of-memory errors or require high memory-to-CPU ratios. 
+ **R.4X** – When you choose this type, you also provide a value for **Number of workers**. Each worker maps to 4 DPU with memory-optimized configuration. We recommend this worker type for large memory-intensive workloads that frequently encounter out-of-memory errors or require high memory-to-CPU ratios. 
+ **R.8X** – When you choose this type, you also provide a value for **Number of workers**. Each worker maps to 8 DPU with memory-optimized configuration. We recommend this worker type for very large memory-intensive workloads that frequently encounter out-of-memory errors or require high memory-to-CPU ratios. 
**Worker Type Specifications**  
The following table provides detailed specifications for all available G worker types:    
**G Worker Type Specifications**    
<a name="table-worker-specifications"></a>[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/add-job.html)
**Important:** G.12X and G.16X worker types, as well as all R worker types (R.1X through R.8X), have higher startup latency.  
You are charged an hourly rate based on the number of DPUs used to run your ETL jobs. For more information, see the [AWS Glue pricing page](https://aws.amazon.com/glue/pricing/).  
For AWS Glue version 1.0 or earlier jobs, when you configure a job using the console and specify a **Worker type** of **Standard**, the **Maximum capacity** is set and the **Number of workers** becomes the value of **Maximum capacity** - 1. If you use the AWS Command Line Interface (AWS CLI) or AWS SDK, you can specify the **Max capacity** parameter, or you can specify both **Worker type** and the **Number of workers**.  
For AWS Glue version 2.0 or later jobs, you cannot specify a **Maximum capacity**. Instead, you should specify a **Worker type** and the **Number of workers**.  
**G.4X** and **G.8X** worker types are available only for AWS Glue version 3.0 or later Spark ETL jobs in the following AWS Regions: US East (Ohio), US East (N. Virginia), US West (N. California), US West (Oregon), Asia Pacific (Mumbai), Asia Pacific (Seoul), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Tokyo), Canada (Central), Europe (Frankfurt), Europe (Ireland), Europe (London), Europe (Spain), Europe (Stockholm), and South America (São Paulo).  
**G.12X**, **G.16X**, and **R.1X** through **R.8X** worker types are available only for AWS Glue version 4.0 or later Spark ETL jobs in the following AWS Regions: US East (N. Virginia), US West (Oregon), US East (Ohio), Europe (Ireland), and Europe (Frankfurt). Additional regions will be supported in future releases.

**Requested number of workers**  
For most worker types, you must specify the number of workers that are allocated when the job runs. 

**Job bookmark**  
Specify how AWS Glue processes state information when the job runs. You can have it remember previously processed data, update state information, or ignore state information. For more information, see [Tracking processed data using job bookmarks](monitor-continuations.md).

**Job run queuing**  
Specifies whether job runs are queued to run later when they cannot run immediately due to service quotas.  
When checked, job run queuing is enabled for the job runs. If not populated, the job runs will not be considered for queueing.  
If this setting does not match the value set in the job run, then the value from the job run field will be used.

**Flex execution**  
When you configure a job using AWS Studio or the API you may specify a standard or flexible job execution class. Your jobs may have varying degrees of priority and time sensitivity. The standard execution-class is ideal for time-sensitive workloads that require fast job startup and dedicated resources.  
The flexible execution class is appropriate for non-urgent jobs such as pre-production jobs, testing, and one-time data loads. Flexible job runs are supported for jobs using AWS Glue version 3.0 or later and `G.1X` or `G.2X` worker types. The new worker types (`G.12X`, `G.16X`, and `R.1X` through `R.8X`) do not support flexible execution.  

[![AWS Videos](http://img.youtube.com/vi/https://www.youtube.com/embed/FnHCoTuDLXU/0.jpg)](http://www.youtube.com/watch?v=https://www.youtube.com/embed/FnHCoTuDLXU)

Flex job runs are billed based on the number of workers running at any point in time. Number of workers may be added or removed for a running flexible job run. Instead of billing as a simple calculation of `Max Capacity` \$1 `Execution Time`, each worker will contribute for the time it ran during the job run. The bill is the sum of (`Number of DPUs per worker` \$1 `time each worker ran`).  
For more information, see the help panel in AWS Studio, or [Jobs](aws-glue-api-jobs-job.md) and [Job runs](aws-glue-api-jobs-runs.md).

**Number of retries**  
Specify the number of times, from 0 to 10, that AWS Glue should automatically restart the job if it fails. Jobs that reach the timeout limit are not restarted.

**Job timeout**  
Sets the maximum execution time in minutes. The maximum is 7 days or 10,080 minutes. Otherwise, the jobs will throw an exception.  
When the value is left blank, the timeout is defaulted to 2880 minutes.  
Any existing AWS Glue jobs that had a timeout value greater than 7 days will be defaulted to 7 days. For instance if you specified a timeout of 20 days for a batch job, it will be stopped on the 7th day.  
**Best practices for job timeouts**  
Jobs are billed based on execution time. To avoid unexpected charges, configure timeout values appropriate for the expected execution time of your job. 

**Advanced Properties**    
**Script filename**  
A unique script name for your job. Cannot be named **Untitled job**.  
**Script path**  
The Amazon S3 location of the script. The path must be in the form `s3://bucket/prefix/path/`. It must end with a slash (`/`) and not include any files.  
**Job metrics**  
Turn on or turn off the creation of Amazon CloudWatch metrics when this job runs. To see profiling data, you must enable this option. For more information about how to turn on and visualize metrics, see [Job monitoring and debugging](monitor-profile-glue-job-cloudwatch-metrics.md).   
**Job observability metrics**  
Turn on the creation of additional observability CloudWatch metrics when this job runs. For more information, see [Monitoring with AWS Glue Observability metrics](monitor-observability.md).  
**Continuous logging**  
Turn on continuous logging to Amazon CloudWatch. If this option is not enabled, logs are available only after the job completes. For more information, see [Logging for AWS Glue jobs](monitor-continuous-logging.md).  
**Spark UI**  
Turn on the use of Spark UI for monitoring this job. For more information, see [Enabling the Apache Spark web UI for AWS Glue jobs](monitor-spark-ui-jobs.md).   
**Spark UI logs path**  
The path to write logs when Spark UI is enabled.  
**Spark UI logging and monitoring configuration**  
Choose one of the following options:  
+ *Standard*: write logs using the AWS Glue job run ID as the filename. Turn on Spark UI monitoring in the AWS Glue console.
+ *Legacy*: write logs using 'spark-application-\$1timestamp\$1' as the filename. Do not turn on Spark UI monitoring.
+ *Standard and legacy*: write logs to both the standard and legacy locations. Turn on Spark UI monitoring in the AWS Glue console.  
**Maximum concurrency**  
Sets the maximum number of concurrent runs that are allowed for this job. The default is 1. An error is returned when this threshold is reached. The maximum value you can specify is controlled by a service limit. For example, if a previous run of a job is still running when a new instance is started, you might want to return an error to prevent two instances of the same job from running concurrently.   
**Temporary path**  
Provide the location of a working directory in Amazon S3 where temporary intermediate results are written when AWS Glue runs the script. Confirm that there isn't a file with the same name as the temporary directory in the path. This directory is used when AWS Glue reads and writes to Amazon Redshift and by certain AWS Glue transforms.  
AWS Glue creates a temporary bucket for jobs if a bucket doesn't already exist in a region. This bucket might permit public access. You can either modify the bucket in Amazon S3 to set the public access block, or delete the bucket later after all jobs in that region have completed.  
**Delay notification threshold (minutes)**  
Sets the threshold (in minutes) before a delay notification is sent. You can set this threshold to send notifications when a `RUNNING`, `STARTING`, or `STOPPING` job run takes more than an expected number of minutes.  
**Security configuration**  
Choose a security configuration from the list. A security configuration specifies how the data at the Amazon S3 target is encrypted: no encryption, server-side encryption with AWS KMS-managed keys (SSE-KMS), or Amazon S3-managed encryption keys (SSE-S3).  
**Server-side encryption**  
If you select this option, when the ETL job writes to Amazon S3, the data is encrypted at rest using SSE-S3 encryption. Both your Amazon S3 data target and any data that is written to an Amazon S3 temporary directory is encrypted. This option is passed as a job parameter. For more information, see [Protecting Data Using Server-Side Encryption with Amazon S3-Managed Encryption Keys (SSE-S3)](https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingServerSideEncryption.html) in the *Amazon Simple Storage Service User Guide*.  
This option is ignored if a security configuration is specified.  
**Use Glue data catalog as the Hive metastore**  
Select to use the AWS Glue Data Catalog as the Hive metastore. The IAM role used for the job must have the `glue:CreateDatabase` permission. A database called “default” is created in the Data Catalog if it does not exist.

**Connections**  
Choose a VPC configuration to access Amazon S3 data sources located in your virtual private cloud (VPC). You can create and manage Network connection in AWS Glue. For more information, see [Connecting to data](glue-connections.md). 

**Libraries**    
**Python library path, Dependent JARs path, and Referenced files path**  
Specify these options if your script requires them. You can define the comma-separated Amazon S3 paths for these options when you define the job. You can override these paths when you run the job. For more information, see [Providing your own custom scripts](console-custom-created.md).  
**Job parameters**  
A set of key-value pairs that are passed as named parameters to the script. These are default values that are used when the script is run, but you can override them in triggers or when you run the job. You must prefix the key name with `--`; for example: `--myKey`. You pass job parameters as a map when using the AWS Command Line Interface.  
For examples, see Python parameters in [Passing and accessing Python parameters in AWS Glue](aws-glue-programming-python-calling.md#aws-glue-programming-python-calling-parameters).  
**Tags**  
Tag your job with a **Tag key** and an optional **Tag value**. After tag keys are created, they are read-only. Use tags on some resources to help you organize and identify them. For more information, see [AWS tags in AWS Glue](monitor-tags.md). 

## Restrictions for jobs that access Lake Formation managed tables
<a name="lf-table-restrictions"></a>

Keep in mind the following notes and restrictions when creating jobs that read from or write to tables managed by AWS Lake Formation: 
+ The following features are not supported in jobs that access tables with cell-level filters:
  + [Job bookmarks](https://docs.aws.amazon.com/glue/latest/dg/monitor-continuations.html) and [bounded execution](https://docs.aws.amazon.com/glue/latest/dg/bounded-execution.html)
  + [Push-down predicates](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-partitions.html#aws-glue-programming-etl-partitions-pushdowns)
  + [Server-side catalog partition predicates](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-partitions.html#aws-glue-programming-etl-partitions-cat-predicates)
  + [enableUpdateCatalog](https://docs.aws.amazon.com/glue/latest/dg/update-from-job.html)

# Editing Spark scripts in the AWS Glue console
<a name="edit-script-spark"></a>

A script contains the code that extracts data from sources, transforms it, and loads it into targets. AWS Glue runs a script when it starts a job.

AWS Glue ETL scripts can be coded in Python or Scala. Python scripts use a language that is an extension of the PySpark Python dialect for extract, transform, and load (ETL) jobs. The script contains *extended constructs* to deal with ETL transformations. When you automatically generate the source code logic for your job, a script is created. You can edit this script, or you can provide your own script to process your ETL work.

 For information about defining and editing scripts in AWS Glue, see [AWS Glue programming guide](edit-script.md).

## Additional libraries or files
<a name="w2aac37c11c12c13b9"></a>

If your script requires additional libraries or files, you can specify them as follows:

**Python library path**  
Comma-separated Amazon Simple Storage Service (Amazon S3) paths to Python libraries that are required by the script.  
Only pure Python libraries can be used. Libraries that rely on C extensions, such as the pandas Python Data Analysis Library, are not yet supported.

**Dependent jars path**  
Comma-separated Amazon S3 paths to JAR files that are required by the script.  
Currently, only pure Java or Scala (2.11) libraries can be used.

**Referenced files path**  
Comma-separated Amazon S3 paths to additional files (for example, configuration files) that are required by the script.

# Jobs (legacy)
<a name="console-edit-script"></a>

A script contains the code that performs extract, transform, and load (ETL) work. You can provide your own script, or AWS Glue can generate a script with guidance from you. For information about creating your own scripts, see [Providing your own custom scripts](console-custom-created.md).

You can edit a script in the AWS Glue console. When you edit a script, you can add sources, targets, and transforms.

**To edit a script**

1. Sign in to the AWS Management Console and open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/). Then choose the **Jobs** tab.

1. Choose a job in the list, and then choose **Action**, **Edit script** to open the script editor.

   You can also access the script editor from the job details page. Choose the **Script** tab, and then choose **Edit script**.

   
## Script editor
<a name="console-edit-script-editor"></a>

The AWS Glue script editor lets you insert, modify, and delete sources, targets, and transforms in your script. The script editor displays both the script and a diagram to help you visualize the flow of data.

To create a diagram for the script, choose **Generate diagram**. AWS Glue uses annotation lines in the script beginning with **\$1\$1** to render the diagram. To correctly represent your script in the diagram, you must keep the parameters in the annotations and the parameters in the Apache Spark code in sync.

The script editor lets you add code templates wherever your cursor is positioned in the script. At the top of the editor, choose from the following options:
+ To add a source table to the script, choose **Source**.
+ To add a target table to the script, choose **Target**.
+ To add a target location to the script, choose **Target location**.
+ To add a transform to the script, choose **Transform**. For information about the functions that are called in your script, see [Program AWS Glue ETL scripts in PySpark](aws-glue-programming-python.md).
+ To add a Spigot transform to the script, choose **Spigot**.

In the inserted code, modify the `parameters` in both the annotations and Apache Spark code. For example, if you add a **Spigot** transform, verify that the `path` is replaced in both the `@args` annotation line and the `output` code line.

The **Logs** tab shows the logs that are associated with your job as it runs. The most recent 1,000 lines are displayed.

The **Schema** tab shows the schema of the selected sources and targets, when available in the Data Catalog. 

# Tracking processed data using job bookmarks
<a name="monitor-continuations"></a>

AWS Glue tracks data that has already been processed during a previous run of an ETL job by persisting state information from the job run. This persisted state information is called a *job bookmark*. Job bookmarks help AWS Glue maintain state information and prevent the reprocessing of old data. With job bookmarks, you can process new data when rerunning on a scheduled interval. A job bookmark is composed of the states for various elements of jobs, such as sources, transformations, and targets. For example, your ETL job might read new partitions in an Amazon S3 file. AWS Glue tracks which partitions the job has processed successfully to prevent duplicate processing and duplicate data in the job's target data store.

Job bookmarks are implemented for JDBC data sources, the Relationalize transform, and some Amazon Simple Storage Service (Amazon S3) sources. The following table lists the Amazon S3 source formats that AWS Glue supports for job bookmarks.


| AWS Glue version | Amazon S3 source formats | 
| --- | --- | 
| Version 0.9 | JSON, CSV, Apache Avro, XML | 
| Version 1.0 and later | JSON, CSV, Apache Avro, XML, Parquet, ORC | 

For information about AWS Glue versions, see [Defining job properties for Spark jobs](add-job.md#create-job).

The job bookmarks feature has additional functionalities when accessed through AWS Glue scripts. When browsing your generated script, you may see transformation contexts, which are related to this feature. For more information, see [Using job bookmarks](programming-etl-connect-bookmarks.md).

**Topics**
+ [Using job bookmarks in AWS Glue](#monitor-continuations-implement)
+ [Operational details of the job bookmarks feature](#monitor-continuations-script)

## Using job bookmarks in AWS Glue
<a name="monitor-continuations-implement"></a>

The job bookmark option is passed as a parameter when the job is started. The following table describes the options for setting job bookmarks on the AWS Glue console.


****  

| Job bookmark | Description | 
| --- | --- | 
| Enable | Causes the job to update the state after a run to keep track of previously processed data. If your job has a source with job bookmark support, it will keep track of processed data, and when a job runs, it processes new data since the last checkpoint. | 
| Disable | Job bookmarks are not used, and the job always processes the entire dataset. You are responsible for managing the output from previous job runs. This is the default. | 
| Pause |  Process incremental data since the last successful run or the data in the range identified by the following sub-options, without updating the state of last bookmark. You are responsible for managing the output from previous job runs. The two sub-options are: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitor-continuations.html) The job bookmark state is not updated when this option set is specified. The sub-options are optional, however when used both the sub-options needs to be provided.  | 

For details about the parameters passed to a job on the command line, and specifically for job bookmarks, see [Using job parameters in AWS Glue jobs](aws-glue-programming-etl-glue-arguments.md).

For Amazon S3 input sources, AWS Glue job bookmarks check the last modified time of the objects to verify which objects need to be reprocessed. If your input source data has been modified since your last job run, the files are reprocessed when you run the job again.

For JDBC sources, the following rules apply:
+ For each table, AWS Glue uses one or more columns as bookmark keys to determine new and processed data. The bookmark keys combine to form a single compound key.
+ AWS Glue by default uses the primary key as the bookmark key, provided that it is sequentially increasing or decreasing (with no gaps).
+ You can specify the columns to use as bookmark keys in your AWS Glue script. For more information about using Job bookmarks in AWS Glue scripts, see [Using job bookmarks](programming-etl-connect-bookmarks.md).
+ AWS Glue doesn't support using columns with case-sensitive names as job bookmark keys.

You can rewind your job bookmarks for your AWS Glue Spark ETL jobs to any previous job run. You can support data backfilling scenarios better by rewinding your job bookmarks to any previous job run, resulting in the subsequent job run reprocessing data only from the bookmarked job run.

If you intend to reprocess all the data using the same job, reset the job bookmark. To reset the job bookmark state, use the AWS Glue console, the [ResetJobBookmark action (Python: reset\$1job\$1bookmark)](aws-glue-api-jobs-runs.md#aws-glue-api-jobs-runs-ResetJobBookmark) API operation, or the AWS CLI. For example, enter the following command using the AWS CLI:

```
    aws glue reset-job-bookmark --job-name my-job-name
```

When you rewind or reset a bookmark, AWS Glue does not clean the target files because there could be multiple targets and targets are not tracked with job bookmarks. Only source files are tracked with job bookmarks. You can create different output targets when rewinding and reprocessing the source files to avoid duplicate data in your output.

AWS Glue keeps track of job bookmarks by job. If you delete a job, the job bookmark is deleted.

In some cases, you might have enabled AWS Glue job bookmarks but your ETL job is reprocessing data that was already processed in an earlier run. For information about resolving common causes of this error, see [Troubleshooting Glue common setup errors](glue-troubleshooting-errors.md).

## Operational details of the job bookmarks feature
<a name="monitor-continuations-script"></a>

This section describes more of the operational details of using job bookmarks.

Job bookmarks store the states for a job. Each instance of the state is keyed by a job name and a version number. When a script invokes `job.init`, it retrieves its state and always gets the latest version. Within a state, there are multiple state elements, which are specific to each source, transformation, and sink instance in the script. These state elements are identified by a transformation context that is attached to the corresponding element (source, transformation, or sink) in the script. The state elements are saved atomically when `job.commit` is invoked from the user script. The script gets the job name and the control option for the job bookmarks from the arguments.

The state elements in the job bookmark are source, transformation, or sink-specific data. For example, suppose that you want to read incremental data from an Amazon S3 location that is being constantly written to by an upstream job or process. In this case, the script must determine what has been processed so far. The job bookmark implementation for the Amazon S3 source saves information so that when the job runs again, it can filter only the new objects using the saved information and recompute the state for the next run of the job. A timestamp is used to filter the new files.

In addition to the state elements, job bookmarks have a *run number*, an *attempt number*, and a *version number*. The run number tracks the run of the job, and the attempt number records the attempts for a job run. The job run number is a monotonically increasing number that is incremented for every successful run. The attempt number tracks the attempts for each run, and is only incremented when there is a run after a failed attempt. The version number increases monotonically and tracks the updates to a job bookmark.

In the AWS Glue service database, the bookmark states for all the transformations are stored together as key-value pairs:

```
{
  "job_name" : ...,
  "run_id": ...,
  "run_number": ..,
  "attempt_number": ...
  "states": {
    "transformation_ctx1" : {
      bookmark_state1
    },
    "transformation_ctx2" : {
      bookmark_state2
    }
  }
}
```

**Best practices**  
The following are best practices for using job bookmarks.
+ *Do not change the data source property with the bookmark enabled*. For example, there is a datasource0 pointing to an Amazon S3 input path A, and the job has been reading from a source which has been running for several rounds with the bookmark enabled. If you change the input path of datasource0 to Amazon S3 path B without changing the `transformation_ctx`, the AWS Glue job will use the old bookmark state stored. That will result in missing or skipping files in the input path B as AWS Glue would assume that those files had been processed in previous runs. 
+ *Use a catalog table with bookmarks for better partition management*. Bookmarks works both for data sources from the Data Catalog or from options. However, it's difficult to remove/add new partitions with the from options approach. Using a catalog table with crawlers can provide better automation to track the newly added [partitions](https://docs.aws.amazon.com/glue/latest/dg/tables-described.html#tables-partition) and give you the flexibility to select particular partitions with a [pushdown predicate](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-partitions.html).
+ *Use the [AWS Glue Amazon S3 file lister](https://aws.amazon.com/premiumsupport/knowledge-center/glue-oom-java-heap-space-error/) for large datasets*. A bookmark will list all files under each input partition and do the filering, so if there are too many files under a single partition the bookmark can run into driver OOM. Use the AWS Glue Amazon S3 file lister to avoid listing all files in memory at once.

# Storing Spark shuffle data
<a name="monitor-spark-shuffle-manager"></a>

Shuffling is an important step in a Spark job whenever data is rearranged between partitions. This is required because wide transformations such as `join`, ` groupByKey`, `reduceByKey`, and `repartition` require information from other partitions to complete processing. Spark gathers the required data from each partition and combines it into a new partition. During a shuffle, data is written to disk and transferred across the network. As a result, the shuffle operation is bound to local disk capacity. Spark throws a `No space left on device` or ` MetadataFetchFailedException` error when there is not enough disk space left on the executor and there is no recovery.

**Note**  
 AWS Glue Spark shuffle plugin with Amazon S3 is only supported for AWS Glue ETL jobs. 

**Solution**  
With AWS Glue, you can now use Amazon S3 to store Spark shuffle data. Amazon S3 is an object storage service that offers industry-leading scalability, data availability, security, and performance. This solution disaggregates compute and storage for your Spark jobs, and gives complete elasticity and low-cost shuffle storage, allowing you to run your most shuffle-intensive workloads reliably.

![\[Spark workflow showing Map and Reduce stages using Amazon S3 for shuffle data storage.\]](http://docs.aws.amazon.com/glue/latest/dg/images/gs-s3-shuffle-diagram.png)


We are introducing a new Cloud Shuffle Storage Plugin for Apache Spark to use Amazon S3. You can turn on Amazon S3 shuffling to run your AWS Glue jobs reliably without failures if they are known to be bound by the local disk capacity for large shuffle operations. In some cases, shuffling to Amazon S3 is marginally slower than local disk (or EBS) if you have a large number of small partitions or shuffle files written out to Amazon S3.

## Prerequisites for using Cloud Shuffle Storage Plugin
<a name="monitor-spark-shuffle-manager-prereqs"></a>

 In order to use the Cloud Shuffle Storage Plugin with AWS Glue ETL jobs, you need the following: 
+ An Amazon S3 bucket located in the same region as your job run, for storing the intermediate shuffle and spilled data. The Amazon S3 prefix of shuffle storage can be specified with `--conf spark.shuffle.glue.s3ShuffleBucket=s3://shuffle-bucket/prefix/`, as in the following example:

  ```
  --conf spark.shuffle.glue.s3ShuffleBucket=s3://glue-shuffle-123456789-us-east-1/glue-shuffle-data/
  ```
+  Set the Amazon S3 storage lifecycle policies on the *prefix* (such as `glue-shuffle-data`) as the shuffle manager does not clean the files after the job is done. The intermediate shuffle and spilled data should be deleted after a job is finished. Users can set a short lifecycle policies on the prefix. Instructions for setting up an Amazon S3 lifecycle policy are available at [Setting lifecycle configuration on a bucket](https://docs.aws.amazon.com//AmazonS3/latest/userguide/how-to-set-lifecycle-configuration-intro.html) in the Amazon Simple Storage Service User Guide.

## Using AWS Glue Spark shuffle manager from the AWS console
<a name="monitor-spark-shuffle-manager-using-console"></a>

To set up the AWS Glue Spark shuffle manager using the AWS Glue console or AWS Glue Studio when configuring a job: choose the ** --write-shuffle-files-to-s3** job parameter to turn on Amazon S3 shuffling for the job.

![\[Job parameters interface showing --write-shuffle-files- parameter and option to add more.\]](http://docs.aws.amazon.com/glue/latest/dg/images/gs-s3-shuffle.png)


## Using AWS Glue Spark shuffle plugin
<a name="monitor-spark-shuffle-manager-using"></a>

The following job parameters turn on and tune the AWS Glue shuffle manager. These parameters are flags, so any values provided are not considered.
+ `--write-shuffle-files-to-s3` — The main flag, which enables the AWS Glue Spark shuffle manager to use Amazon S3 buckets for writing and reading shuffle data. When the flag is not specified, the shuffle manager is not used.
+ `--write-shuffle-spills-to-s3` — (Supported only on AWS Glue version 2.0). An optional flag that allows you to offload spill files to Amazon S3 buckets, which provides additional resiliency to your Spark job. This is only required for large workloads that spill a lot of data to disk. When the flag is not specified, no intermediate spill files are written.
+ ` --conf spark.shuffle.glue.s3ShuffleBucket=s3://<shuffle-bucket>` — Another optional flag that specifies the Amazon S3 bucket where you write the shuffle files. By default, `--TempDir`/shuffle-data. AWS Glue 3.0\$1 supports writing shuffle files to multiple buckets by specifying buckets with comma delimiter, as in `--conf spark.shuffle.glue.s3ShuffleBucket=s3://shuffle-bucket-1/prefix,s3://shuffle-bucket-2/prefix/`. Using multiple buckets improves performance. 

You need to provide security configuration settings to enable encryption at-rest for the shuffle data. For more information about security configurations, see [Setting up encryption in AWS Glue](set-up-encryption.md). AWS Glue supports all other shuffle related configurations provided by Spark.

**Software binaries for the Cloud Shuffle Storage plugin**  
You can also download the software binaries of the Cloud Shuffle Storage Plugin for Apache Spark under the Apache 2.0 license and run it on any Spark environment. The new plugin comes with out-of-the box support for Amazon S3, and can also be easily configured to use other forms of cloud storage such as [Google Cloud Storage and Microsoft Azure Blob Storage](https://github.com/aws-samples/aws-glue-samples/blob/master/docs/cloud-shuffle-plugin/README.md). For more information, see [Cloud Shuffle Storage Plugin for Apache Spark](https://docs.aws.amazon.com/glue/latest/dg/cloud-shuffle-storage-plugin.html).

**Notes and limitations**  
The following are notes or limitations for the AWS Glue shuffle manager:
+  AWS Glue shuffle manager doesn't automatically delete the (temporary) shuffle data files stored in your Amazon S3 bucket after a job is completed. To ensure data protection, follow the instructions in [Prerequisites for using Cloud Shuffle Storage Plugin](#monitor-spark-shuffle-manager-prereqs) before enabling the Cloud Shuffle Storage Plugin. 
+ You can use this feature if your data is skewed.

# Cloud Shuffle Storage Plugin for Apache Spark
<a name="cloud-shuffle-storage-plugin"></a>

The Cloud Shuffle Storage Plugin is an Apache Spark plugin compatible with the [`ShuffleDataIO` API](https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/shuffle/api/ShuffleDataIO.java) which allows storing shuffle data on cloud storage systems (such as Amazon S3). It helps you to supplement or replace local disk storage capacity for large shuffle operations, commonly triggered by transformations such as `join`, `reduceByKey`, `groupByKey` and `repartition` in your Spark applications, thereby reducing common failures or price/performance dislocation of your serverless data analytics jobs and pipelines.

**AWS Glue**  
AWS Glue versions 3.0 and 4.0 comes with the plugin pre-installed and ready to enable shuffling to Amazon S3 without any extra steps. For more information, see [AWS Glue Spark shuffle plugin with Amazon S3](https://docs.aws.amazon.com/glue/latest/dg/monitor-spark-shuffle-manager.html) to enable the feature for your Spark applications.

**Other Spark environments**  
The plugin requires the following Spark configurations to be set on other Spark environments:
+ `--conf spark.shuffle.sort.io.plugin.class=com.amazonaws.spark.shuffle.io.cloud.ChopperPlugin`: This informs Spark to use this plugin for Shuffle IO.
+ `--conf spark.shuffle.storage.path=s3://bucket-name/shuffle-file-dir`: The path where your shuffle files will be stored.

**Note**  
The plugin overwrites one Spark core class. As a result, the plugin jar needs to be loaded before Spark jars. You can do this using `userClassPathFirst` in on-prem YARN environments if the plugin is used outside AWS Glue.

## Bundling the plugin with your Spark applications
<a name="cloud-shuffle-storage-plugin-bundling"></a>

You can bundle the plugin with your Spark applications and Spark distributions (versions 3.1 and above) by adding the plugin dependency in your Maven `pom.xml` while developing your Spark applications locally. For more information on the plugin and Spark versions, see [Plugin versions](#cloud-shuffle-storage-plugin-versions).

```
<repositories>
   ...
    <repository>
        <id>aws-glue-etl-artifacts</id>
        <url>https://aws-glue-etl-artifacts.s3.amazonaws.com/release/ </url>
    </repository>
</repositories>
...
<dependency>
    <groupId>com.amazonaws</groupId>
    <artifactId>chopper-plugin</artifactId>
    <version>3.1-amzn-LATEST</version>
</dependency>
```

You can alternatively download the binaries from AWS Glue Maven artifacts directly and include them in your Spark application as follows.

```
#!/bin/bash
sudo wget -v https://aws-glue-etl-artifacts.s3.amazonaws.com/release/com/amazonaws/chopper-plugin/3.1-amzn-LATEST/chopper-plugin-3.1-amzn-LATEST.jar -P /usr/lib/spark/jars/
```

Example spark-submit

```
spark-submit --deploy-mode cluster \
--conf spark.shuffle.storage.s3.path=s3://<ShuffleBucket>/<shuffle-dir> \
--conf spark.driver.extraClassPath=<Path to plugin jar> \ 
--conf spark.executor.extraClassPath=<Path to plugin jar> \
--class <your test class name> s3://<ShuffleBucket>/<Your application jar> \
```

## Optional configurations
<a name="cloud-shuffle-storage-plugin-optional"></a>

These are optional configuration values that control Amazon S3 shuffle behavior. 
+ `spark.shuffle.storage.s3.enableServerSideEncryption`: Enable/disable S3 SSE for shuffle and spill files. Default value is `true`.
+ `spark.shuffle.storage.s3.serverSideEncryption.algorithm`: The SSE algorithm to be used. Default value is `AES256`.
+ `spark.shuffle.storage.s3.serverSideEncryption.kms.key`: The KMS key ARN when SSE aws:kms is enabled.

Along with these configurations, you may need to set configurations such as `spark.hadoop.fs.s3.enableServerSideEncryption` and **other environment-specific configurations** to ensure appropriate encryption is applied for your use case.

## Plugin versions
<a name="cloud-shuffle-storage-plugin-versions"></a>

This plugin is supported for the Spark versions associated with each AWS Glue version. The following table shows the AWS Glue version, Spark version and associated plugin version with Amazon S3 location for the plugin's software binary.


| AWS Glue version | Spark version | Plugin version | Amazon S3 location | 
| --- | --- | --- | --- | 
| 3.0 | 3.1 | 3.1-amzn-LATEST |  s3://aws-glue-etl-artifacts/release/com/amazonaws/chopper-plugin/3.1-amzn-0/chopper-plugin-3.1-amzn-LATEST.jar  | 
| 4.0 | 3.3 | 3.3-amzn-LATEST |  s3://aws-glue-etl-artifacts/release/com/amazonaws/chopper-plugin/3.3-amzn-0/chopper-plugin-3.3-amzn-LATEST.jar  | 

## License
<a name="cloud-shuffle-storage-plugin-binary-license"></a>

The software binary for this plugin is licensed under the Apache-2.0 License.

# Monitoring AWS Glue Spark jobs
<a name="monitor-spark"></a>

**Topics**
+ [Spark Metrics available in AWS Glue Studio](#console-jobs-details-metrics-spark)
+ [Monitoring jobs using the Apache Spark web UI](monitor-spark-ui.md)
+ [Monitoring with AWS Glue job run insights](monitor-job-insights.md)
+ [Monitoring with Amazon CloudWatch](monitor-cloudwatch.md)
+ [Job monitoring and debugging](monitor-profile-glue-job-cloudwatch-metrics.md)

## Spark Metrics available in AWS Glue Studio
<a name="console-jobs-details-metrics-spark"></a>

The **Metrics** tab shows metrics collected when a job runs and profiling is turned on. The following graphs are shown in Spark jobs: 
+ ETL Data Movement
+ Memory Profile: Driver and Executors

Choose **View additional metrics** to show the following graphs:
+ ETL Data Movement
+ Memory Profile: Driver and Executors
+ Data Shuffle Across Executors
+ CPU Load: Driver and Executors
+ Job Execution: Active Executors, Completed Stages & Maximum Needed Executors

Data for these graphs is pushed to CloudWatch metrics if the job is configured to collect metrics. For more information about how to turn on metrics and interpret the graphs, see [Job monitoring and debugging](monitor-profile-glue-job-cloudwatch-metrics.md). 

**Example ETL data movement graph**  
The ETL Data Movement graph shows the following metrics:  
+ The number of bytes read from Amazon S3 by all executors—[`glue.ALL.s3.filesystem.read_bytes`](monitoring-awsglue-with-cloudwatch-metrics.md#glue.ALL.s3.filesystem.read_bytes)
+ The number of bytes written to Amazon S3 by all executors—[`glue.ALL.s3.filesystem.write_bytes`](monitoring-awsglue-with-cloudwatch-metrics.md#glue.ALL.s3.filesystem.write_bytes)

![\[The graph for ETL Data Movement in the Metrics tab of the AWS Glue console.\]](http://docs.aws.amazon.com/glue/latest/dg/images/job_detailed_etl.png)


**Example Memory profile graph**  
The Memory Profile graph shows the following metrics:  
+ The fraction of memory used by the JVM heap for this driver (scale: 0–1) by the driver, an executor identified by *executorId*, or all executors—
  + [`glue.driver.jvm.heap.usage`](monitoring-awsglue-with-cloudwatch-metrics.md#glue.driver.jvm.heap.usage)
  + [`glue.executorId.jvm.heap.usage`](monitoring-awsglue-with-cloudwatch-metrics.md#glue.executorId.jvm.heap.usage)
  + [`glue.ALL.jvm.heap.usage`](monitoring-awsglue-with-cloudwatch-metrics.md#glue.ALL.jvm.heap.usage)

![\[The graph for Memory Profile in the Metrics tab of the AWS Glue console.\]](http://docs.aws.amazon.com/glue/latest/dg/images/job_detailed_mem.png)


**Example Data shuffle across executors graph**  
The Data Shuffle Across Executors graph shows the following metrics:  
+ The number of bytes read by all executors to shuffle data between them—[`glue.driver.aggregate.shuffleLocalBytesRead`](monitoring-awsglue-with-cloudwatch-metrics.md#glue.driver.aggregate.shuffleLocalBytesRead)
+ The number of bytes written by all executors to shuffle data between them—[`glue.driver.aggregate.shuffleBytesWritten`](monitoring-awsglue-with-cloudwatch-metrics.md#glue.driver.aggregate.shuffleBytesWritten)

![\[The graph for Data Shuffle Across Executors in the Metrics tab of the AWS Glue console.\]](http://docs.aws.amazon.com/glue/latest/dg/images/job_detailed_data.png)


**Example CPU load graph**  
The CPU Load graph shows the following metrics:  
+ The fraction of CPU system load used (scale: 0–1) by the driver, an executor identified by *executorId*, or all executors—
  + [`glue.driver.system.cpuSystemLoad`](monitoring-awsglue-with-cloudwatch-metrics.md#glue.driver.system.cpuSystemLoad)
  + [`glue.executorId.system.cpuSystemLoad`](monitoring-awsglue-with-cloudwatch-metrics.md#glue.executorId.system.cpuSystemLoad)
  + [`glue.ALL.system.cpuSystemLoad`](monitoring-awsglue-with-cloudwatch-metrics.md#glue.ALL.system.cpuSystemLoad)

![\[The graph for CPU Load in the Metrics tab of the AWS Glue console.\]](http://docs.aws.amazon.com/glue/latest/dg/images/job_detailed_cpu.png)


**Example Job execution graph**  
The Job Execution graph shows the following metrics:  
+ The number of actively running executors—[`glue.driver.ExecutorAllocationManager.executors.numberAllExecutors`](monitoring-awsglue-with-cloudwatch-metrics.md#glue.driver.ExecutorAllocationManager.executors.numberAllExecutors)
+ The number of completed stages—[`glue.aggregate.numCompletedStages`](monitoring-awsglue-with-cloudwatch-metrics.md#glue.driver.aggregate.numCompletedStages)
+ The number of maximum needed executors—[`glue.driver.ExecutorAllocationManager.executors.numberMaxNeededExecutors`](monitoring-awsglue-with-cloudwatch-metrics.md#glue.driver.ExecutorAllocationManager.executors.numberMaxNeededExecutors)

![\[The graph for Job Execution in the Metrics tab of the AWS Glue console.\]](http://docs.aws.amazon.com/glue/latest/dg/images/job_detailed_exec.png)


# Monitoring jobs using the Apache Spark web UI
<a name="monitor-spark-ui"></a>

You can use the Apache Spark web UI to monitor and debug AWS Glue ETL jobs running on the AWS Glue job system, and also Spark applications running on AWS Glue development endpoints. The Spark UI enables you to check the following for each job:
+ The event timeline of each Spark stage
+ A directed acyclic graph (DAG) of the job
+ Physical and logical plans for SparkSQL queries
+ The underlying Spark environmental variables for each job

For more information about using the Spark Web UI, see [Web UI](https://spark.apache.org/docs/3.3.0/web-ui.html) in the Spark documentation. For guidance on how to interpret Spark UI results to improve the performance of your job, see [Best practices for performance tuning AWS Glue for Apache Spark jobs](https://docs.aws.amazon.com/prescriptive-guidance/latest/tuning-aws-glue-for-apache-spark/introduction.html) in AWS Prescriptive Guidance.

 You can see the Spark UI in the AWS Glue console. This is available when an AWS Glue job runs on AWS Glue 3.0 or later versions with logs generated in the Standard (rather than legacy) format, which is the default for newer jobs. If you have log files greater than 0.5 GB, you can enable rolling log support for job runs on AWS Glue 4.0 or later versions to simplify log archiving, analysis, and troubleshooting.

You can enable the Spark UI by using the AWS Glue console or the AWS Command Line Interface (AWS CLI). When you enable the Spark UI, AWS Glue ETL jobs and Spark applications on AWS Glue development endpoints can back up Spark event logs to a location that you specify in Amazon Simple Storage Service (Amazon S3). You can use the backed up event logs in Amazon S3 with the Spark UI, both in real time as the job is operating and after the job is complete. While the logs remain in Amazon S3, the Spark UI in the AWS Glue console can view them. 

## Permissions
<a name="monitor-spark-ui-limitations-permissions"></a>

 In order to use the Spark UI in the AWS Glue console, you can use `UseGlueStudio` or add all the individual service APIs. All APIs are needed to use the Spark UI completely, however users can access SparkUI features by adding its service APIs in their IAM permission for fine-grained access. 

 `RequestLogParsing` is the most critical as it performs the parsing of logs. The remaining APIs are for reading the respective parsed data. For example, `GetStages` provides access to the data about all stages of a Spark job. 

 The list of Spark UI service APIs mapped to `UseGlueStudio` are below in the sample policy. The policy below provides access to use only Spark UI features. To add more permissions like Amazon S3 and IAM see [ Creating Custom IAM Policies for AWS Glue Studio. ](https://docs.aws.amazon.com/glue/latest/dg/getting-started-min-privs.html#getting-started-all-gs-privs.html) 

 The list of Spark UI service APIs mapped to `UseGlueStudio` is below in the sample policy. When using a Spark UI service API, use the following namespace: `glue:<ServiceAPI>`. 

------
#### [ JSON ]

****  

```
{
  "Version":"2012-10-17",		 	 	 
  "Statement": [
    {
      "Sid": "AllowGlueStudioSparkUI",
      "Effect": "Allow",
      "Action": [
        "glue:RequestLogParsing",
        "glue:GetLogParsingStatus",
        "glue:GetEnvironment",
        "glue:GetJobs",
        "glue:GetJob",
        "glue:GetStage",
        "glue:GetStages",
        "glue:GetStageFiles",
        "glue:BatchGetStageFiles",
        "glue:GetStageAttempt",
        "glue:GetStageAttemptTaskList",
        "glue:GetStageAttemptTaskSummary",
        "glue:GetExecutors",
        "glue:GetExecutorsThreads",
        "glue:GetStorage",
        "glue:GetStorageUnit",
        "glue:GetQueries",
        "glue:GetQuery"
      ],
      "Resource": [
        "*"
      ]
    }
  ]
}
```

------

## Limitations
<a name="monitor-spark-ui-limitations"></a>
+ Spark UI in the AWS Glue console is not available for job runs that occurred before 20 Nov 2023 because they are in the legacy log format.
+  Spark UI in the AWS Glue console supports rolling logs for AWS Glue 4.0, such as those generated by default in streaming jobs. The maximum sum of all generated rolled log event files is 2 GB. For AWS Glue jobs without rolled log support, the maximum log event file size supported for SparkUI is 0.5 GB. 
+  Serverless Spark UI is not available for Spark event logs stored in an Amazon S3 bucket that can only be accessed by your VPC. 

## Example: Apache Spark web UI
<a name="monitor-spark-ui-limitations-example"></a>

This example shows you how to use the Spark UI to understand your job performance. Screen shots show the Spark web UI as provided by a self-managed Spark history server. Spark UI in the AWS Glue console provides similar views. For more information about using the Spark Web UI, see [Web UI](https://spark.apache.org/docs/3.3.0/web-ui.html) in the Spark documentation.

The following is an example of a Spark application that reads from two data sources, performs a join transform, and writes it out to Amazon S3 in Parquet format.

```
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.sql.functions import count, when, expr, col, sum, isnull
from pyspark.sql.functions import countDistinct
from awsglue.dynamicframe import DynamicFrame
 
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
 
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
 
job = Job(glueContext)
job.init(args['JOB_NAME'])
 
df_persons = spark.read.json("s3://awsglue-datasets/examples/us-legislators/all/persons.json")
df_memberships = spark.read.json("s3://awsglue-datasets/examples/us-legislators/all/memberships.json")
 
df_joined = df_persons.join(df_memberships, df_persons.id == df_memberships.person_id, 'fullouter')
df_joined.write.parquet("s3://aws-glue-demo-sparkui/output/")
 
job.commit()
```

The following DAG visualization shows the different stages in this Spark job.

![\[Screenshot of Spark UI showing 2 completed stages for job 0.\]](http://docs.aws.amazon.com/glue/latest/dg/images/spark-ui1.png)


The following event timeline for a job shows the start, execution, and termination of different Spark executors.

![\[Screenshot of Spark UI showing the completed, failed, and active stages of different Spark executors.\]](http://docs.aws.amazon.com/glue/latest/dg/images/spark-ui2.png)


The following screen shows the details of the SparkSQL query plans:
+ Parsed logical plan
+ Analyzed logical plan
+ Optimized logical plan
+ Physical plan for execution

![\[SparkSQL query plans: parsed, analyzed, and optimized logical plan and physical plans for execution.\]](http://docs.aws.amazon.com/glue/latest/dg/images/spark-ui3.png)


**Topics**
+ [Permissions](#monitor-spark-ui-limitations-permissions)
+ [Limitations](#monitor-spark-ui-limitations)
+ [Example: Apache Spark web UI](#monitor-spark-ui-limitations-example)
+ [Enabling the Apache Spark web UI for AWS Glue jobs](monitor-spark-ui-jobs.md)
+ [Launching the Spark history server](monitor-spark-ui-history.md)

# Enabling the Apache Spark web UI for AWS Glue jobs
<a name="monitor-spark-ui-jobs"></a>

You can use the Apache Spark web UI to monitor and debug AWS Glue ETL jobs running on the AWS Glue job system. You can configure the Spark UI using the AWS Glue console or the AWS Command Line Interface (AWS CLI).

Every 30 seconds, AWS Glue backs up the Spark event logs to the Amazon S3 path that you specify.

**Topics**
+ [Configuring the Spark UI (console)](#monitor-spark-ui-jobs-console)
+ [Configuring the Spark UI (AWS CLI)](#monitor-spark-ui-jobs-cli)
+ [Configuring the Spark UI for sessions using Notebooks](#monitor-spark-ui-sessions)
+ [Enable rolling logs](#monitor-spark-ui-rolling-logs)

## Configuring the Spark UI (console)
<a name="monitor-spark-ui-jobs-console"></a>

Follow these steps to configure the Spark UI by using the AWS Management Console. When creating an AWS Glue job, Spark UI is enabled by default.

**To turn on the Spark UI when you create or edit a job**

1. Sign in to the AWS Management Console and open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/).

1. In the navigation pane, choose **Jobs**.

1. Choose **Add job**, or select an existing one.

1. In **Job details**, open the **Advanced properties**.

1. Under the **Spark UI** tab, choose **Write Spark UI logs to Amazon S3**.

1. Specify an Amazon S3 path for storing the Spark event logs for the job. Note that if you use a security configuration in the job, the encryption also applies to the Spark UI log file. For more information, see [Encrypting data written by AWS Glue](encryption-security-configuration.md).

1. Under **Spark UI logging and monitoring configuration**:
   + Select **Standard** if you are generating logs to view in the AWS Glue console.
   + Select **Legacy** if you are generating logs to view on a Spark history server.
   + You can also choose to generate both.

## Configuring the Spark UI (AWS CLI)
<a name="monitor-spark-ui-jobs-cli"></a>

To generate logs for viewing with Spark UI, in the AWS Glue console, use the AWS CLI to pass the following job parameters to AWS Glue jobs. For more information, see [Using job parameters in AWS Glue jobs](aws-glue-programming-etl-glue-arguments.md).

```
'--enable-spark-ui': 'true',
'--spark-event-logs-path': 's3://s3-event-log-path'
```

To distribute logs to their legacy locations, set the `--enable-spark-ui-legacy-path` parameter to `"true"`. If you do not want to generate logs in both formats, remove the `--enable-spark-ui` parameter.

## Configuring the Spark UI for sessions using Notebooks
<a name="monitor-spark-ui-sessions"></a>

**Warning**  
AWS Glue interactive sessions do not currently support Spark UI in the console. Configure a Spark history server.

 If you use AWS Glue notebooks, set up SparkUI config before starting the session. To do this, use the `%%configure` cell magic: 

```
%%configure { “--enable-spark-ui”: “true”, “--spark-event-logs-path”: “s3://path” }
```

## Enable rolling logs
<a name="monitor-spark-ui-rolling-logs"></a>

 Enabling SparkUI and rolling log event files for AWS Glue jobs provides several benefits: 
+  Rolling Log Event Files – With rolling log event files enabled, AWS Glue generates separate log files for each step of the job execution, making it easier to identify and troubleshoot issues specific to a particular stage or transformation. 
+  Better Log Management – Rolling log event files help in managing log files more efficiently. Instead of having a single, potentially large log file, the logs are split into smaller, more manageable files based on the job execution stages. This can simplify log archiving, analysis, and troubleshooting. 
+  Improved Fault Tolerance – If a AWS Glue job fails or is interrupted, the rolling log event files can provide valuable information about the last successful stage, making it easier to resume the job from that point rather than starting from scratch. 
+  Cost Optimization – By enabling rolling log event files, you can save on storage costs associated with log files. Instead of storing a single, potentially large log file, you store smaller, more manageable log files, which can be more cost-effective, especially for long-running or complex jobs. 

 In a new environment, users can explicitly enable rolling logs through: 

```
'—conf': 'spark.eventLog.rolling.enabled=true'
```

or

```
'—conf': 'spark.eventLog.rolling.enabled=true —conf 
spark.eventLog.rolling.maxFileSize=128m'
```

 When rolling logs are activated, `spark.eventLog.rolling.maxFileSize` specifies the maximum size of the event log file before it rolls over. The default value of this optional parameter if not specified is 128 MB. Minimum is 10 MB. 

 The maximum sum of all generated rolled log event files is 2 GB. For AWS Glue jobs without rolling log support, the maximum log event file size supported for SparkUI is 0.5 GB. 

You can turn off rolling logs for a streaming job by passing an additional configuration. Note that very large log files may be costly to maintain.

To turn off rolling logs, provide the following configuration:

```
'--spark-ui-event-logs-path': 'true',
'--conf': 'spark.eventLog.rolling.enabled=false'
```

# Launching the Spark history server
<a name="monitor-spark-ui-history"></a>

You can use a Spark history server to visualize Spark logs on your own infrastructure. You can see the same visualizations in the AWS Glue console for AWS Glue job runs on AWS Glue 4.0 or later versions with logs generated in the Standard (rather than legacy) format. For more information, see [Monitoring jobs using the Apache Spark web UI](monitor-spark-ui.md).

You can launch the Spark history server using a AWS CloudFormation template that hosts the server on an EC2 instance, or launch locally using Docker.

**Topics**
+ [Launching the Spark history server and viewing the Spark UI using AWS CloudFormation](#monitor-spark-ui-history-cfn)
+ [Launching the Spark history server and viewing the Spark UI using Docker](#monitor-spark-ui-history-local)

## Launching the Spark history server and viewing the Spark UI using AWS CloudFormation
<a name="monitor-spark-ui-history-cfn"></a>

You can use an AWS CloudFormation template to start the Apache Spark history server and view the Spark web UI. These templates are samples that you should modify to meet your requirements.

**To start the Spark history server and view the Spark UI using CloudFormation**

1. Choose one of the **Launch Stack** buttons in the following table. This launches the stack on the CloudFormation console.    
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitor-spark-ui-history.html)

1. On the **Specify template** page, choose **Next**.

1. On the **Specify stack details** page, enter the **Stack name**. Enter additional information under **Parameters**.

   1. 

**Spark UI configuration**

      Provide the following information:
      + **IP address range** — The IP address range that can be used to view the Spark UI. If you want to restrict access from a specific IP address range, you should use a custom value. 
      + **History server port** — The port for the Spark UI. You can use the default value.
      + **Event log directory** — Choose the location where Spark event logs are stored from the AWS Glue job or development endpoints. You must use **s3a://** for the event logs path scheme.
      + **Spark package location** — You can use the default value.
      + **Keystore path** — SSL/TLS keystore path for HTTPS. If you want to use a custom keystore file, you can specify the S3 path `s3://path_to_your_keystore_file` here. If you leave this parameter empty, a self-signed certificate based keystore is generated and used.
      + **Keystore password** — Enter a SSL/TLS keystore password for HTTPS.

   1. 

**EC2 instance configuration**

      Provide the following information:
      + **Instance type** — The type of Amazon EC2 instance that hosts the Spark history server. Because this template launches Amazon EC2 instance in your account, Amazon EC2 cost will be charged in your account separately.
      + **Latest AMI ID** — The AMI ID of Amazon Linux 2 for the Spark history server instance. You can use the default value.
      + **VPC ID** — The virtual private cloud (VPC) ID for the Spark history server instance. You can use any of the VPCs available in your account Using a default VPC with a [default Network ACL](https://docs.aws.amazon.com/vpc/latest/userguide/vpc-network-acls.html#default-network-acl) is not recommended. For more information, see [Default VPC and Default Subnets](https://docs.aws.amazon.com/vpc/latest/userguide/default-vpc.html) and [Creating a VPC](https://docs.aws.amazon.com/vpc/latest/userguide/working-with-vpcs.html#Create-VPC) in the *Amazon VPC User Guide*.
      + **Subnet ID** — The ID for the Spark history server instance. You can use any of the subnets in your VPC. You must be able to reach the network from your client to the subnet. If you want to access via the internet, you must use a public subnet that has the internet gateway in the route table.

   1. Choose **Next**.

1. On the **Configure stack options** page, to use the current user credentials for determining how CloudFormation can create, modify, or delete resources in the stack, choose **Next**. You can also specify a role in the ** Permissions** section to use instead of the current user permissions, and then choose **Next**.

1. On the **Review** page, review the template. 

   Select **I acknowledge that CloudFormation might create IAM resources**, and then choose **Create stack**.

1. Wait for the stack to be created.

1. Open the **Outputs** tab.

   1. Copy the URL of **SparkUiPublicUrl** if you are using a public subnet.

   1. Copy the URL of **SparkUiPrivateUrl** if you are using a private subnet.

1. Open a web browser, and paste in the URL. This lets you access the server using HTTPS on the specified port. Your browser may not recognize the server's certificate, in which case you have to override its protection and proceed anyway. 

## Launching the Spark history server and viewing the Spark UI using Docker
<a name="monitor-spark-ui-history-local"></a>

If you prefer local access (not to have an EC2 instance for the Apache Spark history server), you can also use Docker to start the Apache Spark history server and view the Spark UI locally. This Dockerfile is a sample that you should modify to meet your requirements. 

 **Prerequisites** 

For information about how to install Docker on your laptop see the [Docker Engine community](https://docs.docker.com/install/).

**To start the Spark history server and view the Spark UI locally using Docker**

1. Download files from GitHub.

   Download the Dockerfile and `pom.xml` from [ AWS Glue code samples](https://github.com/aws-samples/aws-glue-samples/tree/master/utilities/Spark_UI/).

1. Determine if you want to use your user credentials or federated user credentials to access AWS.
   + To use the current user credentials for accessing AWS, get the values to use for ` AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY` in the `docker run` command. For more information, see [Managing access keys for IAM users](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_access-keys.html) in the *IAM User Guide*.
   + To use SAML 2.0 federated users for accessing AWS, get the values for ` AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, and ` AWS_SESSION_TOKEN`. For more information, see [Requesting temporary security credentials](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_temp_request.html)

1. Determine the location of your event log directory, to use in the `docker run` command.

1. Build the Docker image using the files in the local directory, using the name ` glue/sparkui`, and the tag `latest`.

   ```
   $ docker build -t glue/sparkui:latest . 
   ```

1. Create and start the docker container.

   In the following commands, use the values obtained previously in steps 2 and 3.

   1. To create the docker container using your user credentials, use a command similar to the following

      ```
      docker run -itd -e SPARK_HISTORY_OPTS="$SPARK_HISTORY_OPTS -Dspark.history.fs.logDirectory=s3a://path_to_eventlog
       -Dspark.hadoop.fs.s3a.access.key=AWS_ACCESS_KEY_ID -Dspark.hadoop.fs.s3a.secret.key=AWS_SECRET_ACCESS_KEY"
       -p 18080:18080 glue/sparkui:latest "/opt/spark/bin/spark-class org.apache.spark.deploy.history.HistoryServer"
      ```

   1. To create the docker container using temporary credentials, use ` org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider` as the provider, and provide the credential values obtained in step 2. For more information, see [Using Session Credentials with TemporaryAWSCredentialsProvider](https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#Using_Session_Credentials_with_TemporaryAWSCredentialsProvider) in the *Hadoop: Integration with Amazon Web Services* documentation.

      ```
      docker run -itd -e SPARK_HISTORY_OPTS="$SPARK_HISTORY_OPTS -Dspark.history.fs.logDirectory=s3a://path_to_eventlog
       -Dspark.hadoop.fs.s3a.access.key=AWS_ACCESS_KEY_ID -Dspark.hadoop.fs.s3a.secret.key=AWS_SECRET_ACCESS_KEY
       -Dspark.hadoop.fs.s3a.session.token=AWS_SESSION_TOKEN
       -Dspark.hadoop.fs.s3a.aws.credentials.provider=org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider"
       -p 18080:18080 glue/sparkui:latest "/opt/spark/bin/spark-class org.apache.spark.deploy.history.HistoryServer"
      ```
**Note**  
These configuration parameters come from the [ Hadoop-AWS Module](https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html). You may need to add specific configuration based on your use case. For example: users in isolated regions will need to configure the ` spark.hadoop.fs.s3a.endpoint`.

1. Open `http://localhost:18080` in your browser to view the Spark UI locally.

# Monitoring with AWS Glue job run insights
<a name="monitor-job-insights"></a>

AWS Glue job run insights is a feature in AWS Glue that simplifies job debugging and optimization for your AWS Glue jobs. AWS Glue provides [Spark UI](https://docs.aws.amazon.com/glue/latest/dg/monitor-spark-ui.html), and [CloudWatch logs and metrics](https://docs.aws.amazon.com/glue/latest/dg/monitor-cloudwatch.html) for monitoring your AWS Glue jobs. With this feature, you get this information about your AWS Glue job's execution:
+ Line number of your AWS Glue job script that had a failure.
+ Spark action that executed last in the Spark query plan just before the failure of your job.
+ Spark exception events related to the failure presented in a time-ordered log stream.
+ Root cause analysis and recommended action (such as tuning your script) to fix the issue.
+ Common Spark events (log messages relating to a Spark action) with a recommended action that addresses the root cause.

All these insights are available to you using two new log streams in the CloudWatch logs for your AWS Glue jobs.

## Requirements
<a name="monitor-job-insights-requirements"></a>

The AWS Glue job run insights feature is available for AWS Glue versions 2.0 and above. You can follow the [migration guide](https://docs.aws.amazon.com/glue/latest/dg/migrating-version-30.html) for your existing jobs to upgrade them from older AWS Glue versions.

## Enabling job run insights for an AWS Glue ETL job
<a name="monitor-job-insights-enable"></a>

You can enable job run insights through AWS Glue Studio or the CLI.

### AWS Glue Studio
<a name="monitor-job-insights-requirements"></a>

When creating a job via AWS Glue Studio, you can enable or disable job run insights under the **Job Details** tab. Check that the **Generate job insights** box is selected.

![\[Enabling job run insights in AWS Glue Studio.\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-job-run-insights-1.png)


### Command line
<a name="monitor-job-insights-enable-cli"></a>

If creating a job via the CLI, you can start a job run with a single new [job parameter](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-glue-arguments.html): `--enable-job-insights = true`.

By default, the job run insights log streams are created under the same default log group used by [AWS Glue continuous logging](https://docs.aws.amazon.com/glue/latest/dg/monitor-continuous-logging.html), that is, `/aws-glue/jobs/logs-v2/`. You may set up custom log group name, log filters and log group configurations using the same set of arguments for continuous logging. For more information, see [Enabling Continuous Logging for AWS Glue Jobs](https://docs.aws.amazon.com/glue/latest/dg/monitor-continuous-logging-enable.html).

## Accessing the job run insights log streams in CloudWatch
<a name="monitor-job-insights-access"></a>

With the job run insights feature enabled, there may be two log streams created when a job run fails. When a job finishes successfully, neither of the streams are generated.

1. *Exception analysis log stream*: `<job-run-id>-job-insights-rca-driver`. This stream provides the following:
   + Line number of your AWS Glue job script that caused the failure.
   + Spark action that executed last in the Spark query plan (DAG).
   + Concise time-ordered events from the Spark driver and executors that are related to the exception. You can find details such as complete error messages, the failed Spark task and its executors ID that help you to focus on the specific executor's log stream for a deeper investigation if needed.

1. *Rule-based insights stream*: 
   + Root cause analysis and recommendations on how to fix the errors (such as using a specific job parameter to optimize the performance).
   + Relevant Spark events serving as the basis for root cause analysis and a recommended action.

**Note**  
The first stream will exist only if any exception Spark events are available for a failed job run, and the second stream will exist only if any insights are available for the failed job run. For example, if your job finishes successfully, neither of the streams will be generated; if your job fails but there isn't a service-defined rule that can match with your failure scenario, then only the first stream will be generated.

If the job is created from AWS Glue Studio, the links to the above streams are also available under the job run details tab (Job run insights) as "Concise and consolidated error logs" and "Error analysis and guidance".

![\[The Job Run Details page containing links to the log streams.\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-job-run-insights-2.png)


## Example for AWS Glue job run insights
<a name="monitor-job-insights-example"></a>

In this section we present an example of how the job run insights feature can help you resolve an issue in your failed job. In this example, a user forgot to import the required module (tensorflow) in an AWS Glue job to analyze and build a machine learning model on their data.

```
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.sql.types import *
from pyspark.sql.functions import udf,col

args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

data_set_1 = [1, 2, 3, 4]
data_set_2 = [5, 6, 7, 8]

scoresDf = spark.createDataFrame(data_set_1, IntegerType())

def data_multiplier_func(factor, data_vector):
    import tensorflow as tf
    with tf.compat.v1.Session() as sess:
        x1 = tf.constant(factor)
        x2 = tf.constant(data_vector)
        result = tf.multiply(x1, x2)
        return sess.run(result).tolist()

data_multiplier_udf = udf(lambda x:data_multiplier_func(x, data_set_2), ArrayType(IntegerType(),False))
factoredDf = scoresDf.withColumn("final_value", data_multiplier_udf(col("value")))
print(factoredDf.collect())
```

Without the job run insights feature, as the job fails, you only see this message thrown by Spark:

`An error occurred while calling o111.collectToPython. Traceback (most recent call last):`

The message is ambiguous and limits your debugging experience. In this case, this feature provides with you additional insights in two CloudWatch log streams:

1. The `job-insights-rca-driver` log stream:
   + *Exception events*: This log stream provides you the Spark exception events related to the failure collected from the Spark driver and different distributed workers. These events help you understand the time-ordered propagation of the exception as faulty code executes across Spark tasks, executors, and stages distributed across the AWS Glue workers.
   + *Line numbers*: This log stream identifies line 21, which made the call to import the missing Python module that caused the failure; it also identifies line 24, the call to Spark Action `collect()`, as the last executed line in your script.  
![\[The job-insights-rca-driver log stream.\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-job-run-insights-3.png)

1. The `job-insights-rule-driver` log stream:
   + *Root cause and recommendation*: In addition to the line number and last executed line number for the fault in your script, this log stream shows the root cause analysis and recommendation for you to follow the AWS Glue doc and set up the necessary job parameters in order to use an additional Python module in your AWS Glue job. 
   + *Basis event*: This log stream also shows the Spark exception event that was evaluated with the service-defined rule to infer the root cause and provide a recommendation.  
![\[The job-insights-rule-driver log stream.\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-job-run-insights-4.png)

# Monitoring with Amazon CloudWatch
<a name="monitor-cloudwatch"></a>

You can monitor AWS Glue using Amazon CloudWatch, which collects and processes raw data from AWS Glue into readable, near-real-time metrics. These statistics are recorded for a period of two weeks so that you can access historical information for a better perspective on how your web application or service is performing. By default, AWS Glue metrics data is sent to CloudWatch automatically. For more information, see [What Is Amazon CloudWatch?](https://docs.aws.amazon.com/AmazonCloudWatch/latest/DeveloperGuide/WhatIsCloudWatch.html) in the *Amazon CloudWatch User Guide*, and [AWS Glue metrics](monitoring-awsglue-with-cloudwatch-metrics.md#awsglue-metrics).

 **Continous logging** 

AWS Glue also supports real-time continuous logging for AWS Glue jobs. When continuous logging is enabled for a job, you can view the real-time logs on the AWS Glue console or the CloudWatch console dashboard. For more information, see [Logging for AWS Glue jobs](monitor-continuous-logging.md).

 **Observability metrics** 

 When **Job observability metrics** is enabled, additional Amazon CloudWatch metrics are generated when the job is run. Use AWS Glue Observability metrics to generate insights into what is happening inside your AWS Glue to improve triaging and analysis of issues. 

**Topics**
+ [Monitoring AWS Glue using Amazon CloudWatch metrics](monitoring-awsglue-with-cloudwatch-metrics.md)
+ [Setting up Amazon CloudWatch alarms on AWS Glue job profiles](monitor-profile-glue-job-cloudwatch-alarms.md)
+ [Logging for AWS Glue jobs](monitor-continuous-logging.md)
+ [Monitoring with AWS Glue Observability metrics](monitor-observability.md)

# Monitoring AWS Glue using Amazon CloudWatch metrics
<a name="monitoring-awsglue-with-cloudwatch-metrics"></a>

You can profile and monitor AWS Glue operations using AWS Glue job profiler. It collects and processes raw data from AWS Glue jobs into readable, near real-time metrics stored in Amazon CloudWatch. These statistics are retained and aggregated in CloudWatch so that you can access historical information for a better perspective on how your application is performing.

**Note**  
 You may incur additional charges when you enable job metrics and CloudWatch custom metrics are created. For more information, see [ Amazon CloudWatch pricing ](https://aws.amazon.com/cloudwatch/pricing/). 

## AWS Glue metrics overview
<a name="metrics-overview"></a>

When you interact with AWS Glue, it sends metrics to CloudWatch. You can view these metrics using the AWS Glue console (the preferred method), the CloudWatch console dashboard, or the AWS Command Line Interface (AWS CLI). 

**To view metrics using the AWS Glue console dashboard**

You can view summary or detailed graphs of metrics for a job, or detailed graphs for a job run. 

1. Sign in to the AWS Management Console and open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/).

1. In the navigation pane, choose **Job run monitoring**.

1. In **Job runs** choose **Actions** to stop a job that is currently running, view a job, or rewind job bookmark.

1. Select a job, then choose **View run details** to view additional information about the job run.

**To view metrics using the CloudWatch console dashboard**

Metrics are grouped first by the service namespace, and then by the various dimension combinations within each namespace.

1. Open the CloudWatch console at [https://console.aws.amazon.com/cloudwatch/](https://console.aws.amazon.com/cloudwatch/).

1. In the navigation pane, choose **Metrics**.

1. Choose the **Glue** namespace.

**To view metrics using the AWS CLI**
+ At a command prompt, use the following command.

  ```
  1. aws cloudwatch list-metrics --namespace Glue
  ```

AWS Glue reports metrics to CloudWatch every 30 seconds, and the CloudWatch metrics dashboards are configured to display them every minute. The AWS Glue metrics represent delta values from the previously reported values. Where appropriate, metrics dashboards aggregate (sum) the 30-second values to obtain a value for the entire last minute.

### AWS Glue metrics behavior for Spark jobs
<a name="metrics-overview-spark"></a>

 AWS Glue metrics are enabled at initialization of a `GlueContext` in a script and are generally updated only at the end of an Apache Spark task. They represent the aggregate values across all completed Spark tasks so far.

However, the Spark metrics that AWS Glue passes on to CloudWatch are generally absolute values representing the current state at the time they are reported. AWS Glue reports them to CloudWatch every 30 seconds, and the metrics dashboards generally show the average across the data points received in the last 1 minute.

AWS Glue metrics names are all preceded by one of the following types of prefix:
+ `glue.driver.` – Metrics whose names begin with this prefix either represent AWS Glue metrics that are aggregated from all executors at the Spark driver, or Spark metrics corresponding to the Spark driver.
+ `glue.`*executorId*`.` – The *executorId* is the number of a specific Spark executor. It corresponds with the executors listed in the logs.
+ `glue.ALL.` – Metrics whose names begin with this prefix aggregate values from all Spark executors.

## AWS Glue metrics
<a name="awsglue-metrics"></a>

AWS Glue profiles and sends the following metrics to CloudWatch every 30 seconds, and the AWS Glue Metrics Dashboard report them once a minute:


| Metric | Description | 
| --- | --- | 
|  `glue.driver.aggregate.bytesRead` |  The number of bytes read from all data sources by all completed Spark tasks running in all executors. Valid dimensions: `JobName` (the name of the AWS Glue Job), `JobRunId` (the JobRun ID. or `ALL`), and `Type` (count). Valid Statistics: SUM. This metric is a delta value from the last reported value, so on the AWS Glue Metrics Dashboard, a SUM statistic is used for aggregation.  Unit: Bytes Can be used to monitor: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html) This metric can be used the same way as the `glue.ALL.s3.filesystem.read_bytes` metric, with the difference that this metric is updated at the end of a Spark task and captures non-S3 data sources as well.  | 
|  `glue.driver.aggregate.elapsedTime` |  The ETL elapsed time in milliseconds (does not include the job bootstrap times). Valid dimensions: `JobName` (the name of the AWS Glue Job), `JobRunId` (the JobRun ID. or `ALL`), and `Type` (count). Valid Statistics: SUM. This metric is a delta value from the last reported value, so on the AWS Glue Metrics Dashboard, a SUM statistic is used for aggregation. Unit: Milliseconds Can be used to determine how long it takes a job run to run on average. Some ways to use the data: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html)  | 
|   `glue.driver.aggregate.numCompletedStages` |  The number of completed stages in the job. Valid dimensions: `JobName` (the name of the AWS Glue Job), `JobRunId` (the JobRun ID. or `ALL`), and `Type` (count). Valid Statistics: SUM. This metric is a delta value from the last reported value, so on the AWS Glue Metrics Dashboard, a SUM statistic is used for aggregation. Unit: Count Can be used to monitor: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html) Some ways to use the data: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html)  | 
|  `glue.driver.aggregate.numCompletedTasks` |  The number of completed tasks in the job. Valid dimensions: `JobName` (the name of the AWS Glue Job), `JobRunId` (the JobRun ID. or `ALL`), and `Type` (count). Valid Statistics: SUM. This metric is a delta value from the last reported value, so on the AWS Glue Metrics Dashboard, a SUM statistic is used for aggregation. Unit: Count Can be used to monitor: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html)  | 
|  `glue.driver.aggregate.numFailedTasks` |  The number of failed tasks. Valid dimensions: `JobName` (the name of the AWS Glue Job), `JobRunId` (the JobRun ID. or `ALL`), and `Type` (count). Valid Statistics: SUM. This metric is a delta value from the last reported value, so on the AWS Glue Metrics Dashboard, a SUM statistic is used for aggregation. Unit: Count Can be used to monitor: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html) The data can be used to set alarms for increased failures that might suggest abnormalities in data, cluster or scripts.  | 
|  `glue.driver.aggregate.numKilledTasks` |  The number of tasks killed. Valid dimensions: `JobName` (the name of the AWS Glue Job), `JobRunId` (the JobRun ID. or `ALL`), and `Type` (count). Valid Statistics: SUM. This metric is a delta value from the last reported value, so on the AWS Glue Metrics Dashboard, a SUM statistic is used for aggregation. Unit: Count Can be used to monitor: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html) Some ways to use the data: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html)  | 
|  `glue.driver.aggregate.recordsRead` |  The number of records read from all data sources by all completed Spark tasks running in all executors. Valid dimensions: `JobName` (the name of the AWS Glue Job), `JobRunId` (the JobRun ID. or `ALL`), and `Type` (count). Valid Statistics: SUM. This metric is a delta value from the last reported value, so on the AWS Glue Metrics Dashboard, a SUM statistic is used for aggregation. Unit: Count Can be used to monitor: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html) This metric can be used in a similar way to the `glue.ALL.s3.filesystem.read_bytes` metric, with the difference that this metric is updated at the end of a Spark task.  | 
|   `glue.driver.aggregate.shuffleBytesWritten` |  The number of bytes written by all executors to shuffle data between them since the previous report (aggregated by the AWS Glue Metrics Dashboard as the number of bytes written for this purpose during the previous minute). Valid dimensions: `JobName` (the name of the AWS Glue Job), `JobRunId` (the JobRun ID. or `ALL`), and `Type` (count). Valid Statistics: SUM. This metric is a delta value from the last reported value, so on the AWS Glue Metrics Dashboard, a SUM statistic is used for aggregation. Unit: Bytes Can be used to monitor: Data shuffle in jobs (large joins, groupBy, repartition, coalesce). Some ways to use the data: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html)  | 
|   `glue.driver.aggregate.shuffleLocalBytesRead` |  The number of bytes read by all executors to shuffle data between them since the previous report (aggregated by the AWS Glue Metrics Dashboard as the number of bytes read for this purpose during the previous minute). Valid dimensions: `JobName` (the name of the AWS Glue Job), `JobRunId` (the JobRun ID. or `ALL`), and `Type` (count). Valid Statistics: SUM. This metric is a delta value from the last reported value, so on the AWS Glue Metrics Dashboard, a SUM statistic is used for aggregation. Unit: Bytes Can be used to monitor: Data shuffle in jobs (large joins, groupBy, repartition, coalesce). Some ways to use the data: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html)  | 
|  `glue.driver.BlockManager.disk.diskSpaceUsed_MB` |  The number of megabytes of disk space used across all executors. Valid dimensions: `JobName` (the name of the AWS Glue Job), `JobRunId` (the JobRun ID. or `ALL`), and `Type` (gauge). Valid Statistics: Average. This is a Spark metric, reported as an absolute value. Unit: Megabytes Can be used to monitor: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html) Some ways to use the data: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html)  | 
|   `glue.driver.ExecutorAllocationManager.executors.numberAllExecutors` |  The number of actively running job executors. Valid dimensions: `JobName` (the name of the AWS Glue Job), `JobRunId` (the JobRun ID. or `ALL`), and `Type` (gauge). Valid Statistics: Average. This is a Spark metric, reported as an absolute value. Unit: Count Can be used to monitor: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html) Some ways to use the data: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html)  | 
|   `glue.driver.ExecutorAllocationManager.executors.numberMaxNeededExecutors` |  The number of maximum (actively running and pending) job executors needed to satisfy the current load. Valid dimensions: `JobName` (the name of the AWS Glue Job), `JobRunId` (the JobRun ID. or `ALL`), and `Type` (gauge). Valid Statistics: Maximum. This is a Spark metric, reported as an absolute value. Unit: Count Can be used to monitor: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html) Some ways to use the data: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html)  | 
|   `glue.driver.jvm.heap.usage`  `glue.`*executorId*`.jvm.heap.usage`  `glue.ALL.jvm.heap.usage`  |  The fraction of memory used by the JVM heap for this driver (scale: 0-1) for driver, executor identified by executorId, or ALL executors. Valid dimensions: `JobName` (the name of the AWS Glue Job), `JobRunId` (the JobRun ID. or `ALL`), and `Type` (gauge). Valid Statistics: Average. This is a Spark metric, reported as an absolute value. Unit: Percentage Can be used to monitor: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html) Some ways to use the data: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html)  | 
|  `glue.driver.jvm.heap.used`  `glue.`*executorId*`.jvm.heap.used`  `glue.ALL.jvm.heap.used`  |  The number of memory bytes used by the JVM heap for the driver, the executor identified by *executorId*, or ALL executors. Valid dimensions: `JobName` (the name of the AWS Glue Job), `JobRunId` (the JobRun ID. or `ALL`), and `Type` (gauge). Valid Statistics: Average. This is a Spark metric, reported as an absolute value. Unit: Bytes Can be used to monitor: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html) Some ways to use the data: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html)  | 
|   `glue.driver.s3.filesystem.read_bytes`  `glue.`*executorId*`.s3.filesystem.read_bytes`  `glue.ALL.s3.filesystem.read_bytes`  |  The number of bytes read from Amazon S3 by the driver, an executor identified by *executorId*, or ALL executors since the previous report (aggregated by the AWS Glue Metrics Dashboard as the number of bytes read during the previous minute). Valid dimensions: `JobName`, `JobRunId`, and `Type` (gauge). Valid Statistics: SUM. This metric is a delta value from the last reported value, so on the AWS Glue Metrics Dashboard a SUM statistic is used for aggregation. The area under the curve on the AWS Glue Metrics Dashboard can be used to visually compare bytes read by two different job runs. Unit: Bytes. Can be used to monitor: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html) Resulting data can be used for: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html)  | 
|   `glue.driver.s3.filesystem.write_bytes`  `glue.`*executorId*`.s3.filesystem.write_bytes`  `glue.ALL.s3.filesystem.write_bytes`  |  The number of bytes written to Amazon S3 by the driver, an executor identified by *executorId*, or ALL executors since the previous report (aggregated by the AWS Glue Metrics Dashboard as the number of bytes written during the previous minute). Valid dimensions: `JobName`, `JobRunId`, and `Type` (gauge). Valid Statistics: SUM. This metric is a delta value from the last reported value, so on the AWS Glue Metrics Dashboard a SUM statistic is used for aggregation. The area under the curve on the AWS Glue Metrics Dashboard can be used to visually compare bytes written by two different job runs. Unit: Bytes Can be used to monitor: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html) Some ways to use the data: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html)  | 
|  `glue.driver.streaming.numRecords` |  The number of records that are received in a micro-batch. This metric is only available for AWS Glue streaming jobs with AWS Glue version 2.0 and above. Valid dimensions: `JobName` (the name of the AWS Glue job), `JobRunId` (the JobRun ID. or `ALL`), and `Type` (count). Valid Statistics: Sum, Maximum, Minimum, Average, Percentile Unit: Count Can be used to monitor: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html)  | 
|  `glue.driver.streaming.batchProcessingTimeInMs` |  The time it takes to process the batches in milliseconds. This metric is only available for AWS Glue streaming jobs with AWS Glue version 2.0 and above. Valid dimensions: `JobName` (the name of the AWS Glue job), `JobRunId` (the JobRun ID. or `ALL`), and `Type` (count). Valid Statistics: Sum, Maximum, Minimum, Average, Percentile Unit: Count Can be used to monitor: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html)  | 
|   `glue.driver.system.cpuSystemLoad`  `glue.`*executorId*`.system.cpuSystemLoad`  `glue.ALL.system.cpuSystemLoad`  |  The fraction of CPU system load used (scale: 0-1) by the driver, an executor identified by *executorId*, or ALL executors. Valid dimensions: `JobName` (the name of the AWS Glue job), `JobRunId` (the JobRun ID. or `ALL`), and `Type` (gauge). Valid Statistics: Average. This metric is reported as an absolute value. Unit: Percentage Can be used to monitor: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html) Some ways to use the data: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html)  | 

## Dimensions for AWS Glue Metrics
<a name="awsglue-metricdimensions"></a>

AWS Glue metrics use the AWS Glue namespace and provide metrics for the following dimensions:


| Dimension | Description | 
| --- | --- | 
|  `JobName`  |  This dimension filters for metrics of all job runs of a specific AWS Glue job.  | 
|  `JobRunId`  |  This dimension filters for metrics of a specific AWS Glue job run by a JobRun ID, or `ALL`.  | 
|  `Type`  |  This dimension filters for metrics by either `count` (an aggregate number) or `gauge` (a value at a point in time).  | 

For more information, see the [Amazon CloudWatch User Guide](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/).

# Setting up Amazon CloudWatch alarms on AWS Glue job profiles
<a name="monitor-profile-glue-job-cloudwatch-alarms"></a>

AWS Glue metrics are also available in Amazon CloudWatch. You can set up alarms on any AWS Glue metric for scheduled jobs. 

A few common scenarios for setting up alarms are as follows:
+ Jobs running out of memory (OOM): Set an alarm when the memory usage exceeds the normal average for either the driver or an executor for an AWS Glue job.
+ Straggling executors: Set an alarm when the number of executors falls below a certain threshold for a large duration of time in an AWS Glue job.
+ Data backlog or reprocessing: Compare the metrics from individual jobs in a workflow using a CloudWatch math expression. You can then trigger an alarm on the resulting expression value (such as the ratio of bytes written by a job and bytes read by a following job).

For detailed instructions on setting alarms, see [Create or Edit a CloudWatch Alarm](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/ConsoleAlarms.html) in the *[Amazon CloudWatch Events User Guide](https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/)*. 

For monitoring and debugging scenarios using CloudWatch, see [Job monitoring and debugging](monitor-profile-glue-job-cloudwatch-metrics.md).

# Logging for AWS Glue jobs
<a name="monitor-continuous-logging"></a>

 In AWS Glue 5.0, all jobs have real-time logging capabilities. Additionally, you can specify custom configuration options to tailor the logging behavior. These options include setting the Amazon CloudWatch log group name, the Amazon CloudWatch log stream prefix (which will precede the AWS Glue job run ID and driver/executor ID), and the log conversion pattern for log messages. These configurations allow you to aggregate logs in custom Amazon CloudWatch log groups with different expiration policies. Furthermore, you can analyze the logs more effectively by using custom log stream prefixes and conversion patterns. This level of customization enables you to optimize log management and analysis according to your specific requirements. 

## Logging behavior in AWS Glue 5.0
<a name="monitor-logging-behavior-glue-50"></a>

 By default, system logs, Spark daemon logs, and user AWS Glue Logger logs are written to the `/aws-glue/jobs/error` log group in Amazon CloudWatch. On the other hand, user stdout (standard output) and stderr (standard error) logs are written to the `/aws-glue/jobs/output` log group by default. 

## Custom logging
<a name="monitor-logging-custom"></a>

 You can customize the default log group and log stream prefixes using the following job arguments: 
+  `--custom-logGroup-prefix`: Allows you to specify a custom prefix for the `/aws-glue/jobs/error` and `/aws-glue/jobs/output` log groups. If you provide a custom prefix, the log group names will be in the following format: 
  +  `/aws-glue/jobs/error` will be `<customer prefix>/error` 
  +  `/aws-glue/jobs/output ` will be `<customer prefix>/output` 
+  `--custom-logStream-prefix`: Allows you to specify a custom prefix for the log stream names within the log groups. If you provide a custom prefix, the log stream names will be in the following format: 
  +  `jobrunid-driver` will be `<customer log stream>-driver` 
  +  `jobrunid-executorNum` will be `<customer log stream>-executorNum` 

 Validation rules and limitations for custom prefixes: 
+  The entire log stream name must be between 1 and 512 characters long. 
+  The custom prefix itself is restricted to 400 characters. 
+  The custom prefix must match the regular expression pattern `[^:\$1]\$1` (special characters allowed are '\$1', '-', and '/'). 

## Logging application-specific messages using the custom script logger
<a name="monitor-logging-script"></a>

You can use the AWS Glue logger to log any application-specific messages in the script that are sent in real time to the driver log stream.

The following example shows a Python script.

```
from awsglue.context import GlueContext
from pyspark.context import SparkContext

sc = SparkContext()
glueContext = GlueContext(sc)
logger = glueContext.get_logger()
logger.info("info message")
logger.warn("warn message")
logger.error("error message")
```

The following example shows a Scala script.

```
import com.amazonaws.services.glue.log.GlueLogger

object GlueApp {
  def main(sysArgs: Array[String]) {
    val logger = new GlueLogger
    logger.info("info message")
    logger.warn("warn message")
    logger.error("error message")
  }
}
```

## Enabling the progress bar to show job progress
<a name="monitor-logging-progress"></a>

AWS Glue provides a real-time progress bar under the `JOB_RUN_ID-progress-bar` log stream to check AWS Glue job run status. Currently it supports only jobs that initialize `glueContext`. If you run a pure Spark job without initializing `glueContext`, the AWS Glue progress bar does not appear.

The progress bar shows the following progress update every 5 seconds.

```
Stage Number (Stage Name): > (numCompletedTasks + numActiveTasks) / totalNumOfTasksInThisStage]
```

## Security configuration with Amazon CloudWatch logging
<a name="monitor-security-config-logging"></a>

 When a security configuration is enabled for Amazon CloudWatch logs, AWS Glue creates log groups with specific naming patterns that incorporate the security configuration name. 

### Log group naming with security configuration
<a name="monitor-log-group-naming"></a>

 The default and custom log groups will be as follows: 
+  **Default error log group:** `/aws-glue/jobs/Security-Configuration-Name-role/glue-job-role/error` 
+  **Default output log group:** `/aws-glue/jobs/Security-Configuration-Name-role/glue-job-role/output` 
+  **Custom error log group (AWS Glue 5.0):** `custom-log-group-prefix/Security-Configuration-Name-role/glue-job-role/error` 
+  **Custom output log group (AWS Glue 5.0):** `custom-log-group-prefix/Security-Configuration-Name-role/glue-job-role/output` 

### Required IAM Permissions
<a name="monitor-logging-iam-permissions"></a>

 You need to add the `logs:AssociateKmsKey` permission to your IAM role permissions, if you enable a security configuration with Amazon CloudWatch Logs. If that permission is not included, continuous logging will be disabled. 

 Also, to configure the encryption for the Amazon CloudWatch Logs, follow the instructions at [ Encrypt Log Data in Amazon CloudWatch Logs Using AWS Key Management Service](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/encrypt-log-data-kms.html) in the Amazon Amazon CloudWatch Logs User Guide. 

### Additional Information
<a name="additional-info"></a>

 For more information on creating security configurations, see [Managing security configurations on the AWS Glue console](https://docs.aws.amazon.com/glue/latest/dg/console-security-configurations.html). 

**Topics**
+ [Logging behavior in AWS Glue 5.0](#monitor-logging-behavior-glue-50)
+ [Custom logging](#monitor-logging-custom)
+ [Logging application-specific messages using the custom script logger](#monitor-logging-script)
+ [Enabling the progress bar to show job progress](#monitor-logging-progress)
+ [Security configuration with Amazon CloudWatch logging](#monitor-security-config-logging)
+ [Enabling continuous logging for AWS Glue 4.0 and earlier jobs](monitor-continuous-logging-enable.md)
+ [Viewing logs for AWS Glue jobs](monitor-continuous-logging-view.md)

# Enabling continuous logging for AWS Glue 4.0 and earlier jobs
<a name="monitor-continuous-logging-enable"></a>

**Note**  
 In AWS Glue 4.0 and earlier versions, continuous logging was an available feature. However, with the introduction of AWS Glue 5.0, all jobs have real-time logging capability. For more details on the logging capabilities and configuration options in AWS Glue 5.0, see [Logging for AWS Glue jobs](https://docs.aws.amazon.com/glue/latest/dg/monitor-continuous-logging.html). 

You can enable continuous logging using the AWS Glue console or through the AWS Command Line Interface (AWS CLI). 

You can enable continuous logging when you create a new job, edit an existing job, or enable it through the AWS CLI.

You can also specify custom configuration options such as the Amazon CloudWatch log group name, CloudWatch log stream prefix before the AWS Glue job run ID driver/executor ID, and log conversion pattern for log messages. These configurations help you to set aggregate logs in custom CloudWatch log groups with different expiration policies, and analyze them further with custom log stream prefixes and conversions patterns. 

**Topics**
+ [Using the AWS Management Console](#monitor-continuous-logging-enable-console)
+ [Logging application-specific messages using the custom script logger](#monitor-continuous-logging-script)
+ [Enabling the progress bar to show job progress](#monitor-continuous-logging-progress)
+ [Security configuration with continuous logging](#monitor-logging-encrypt-log-data)

## Using the AWS Management Console
<a name="monitor-continuous-logging-enable-console"></a>

Follow these steps to use the console to enable continuous logging when creating or editing an AWS Glue job.

**To create a new AWS Glue job with continuous logging**

1. Sign in to the AWS Management Console and open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/).

1. In the navigation pane, choose **ETL jobs**.

1. Choose **Visual ETL**.

1. In the **Job details** tab, expand the **Advanced properties** section.

1. Under **Continuous logging** select **Enable logs in CloudWatch**.

**To enable continuous logging for an existing AWS Glue job**

1. Open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/).

1. In the navigation pane, choose **Jobs**.

1. Choose an existing job from the **Jobs** list.

1. Choose **Action**, **Edit job**.

1. In the **Job details** tab, expand the **Advanced properties** section.

1. Under **Continuous logging** select **Enable logs in CloudWatch**.

### Using the AWS CLI
<a name="monitor-continuous-logging-cli"></a>

To enable continuous logging, you pass in job parameters to an AWS Glue job. Pass the following special job parameters similar to other AWS Glue job parameters. For more information, see [Using job parameters in AWS Glue jobs](aws-glue-programming-etl-glue-arguments.md).

```
'--enable-continuous-cloudwatch-log': 'true'
```

You can specify a custom Amazon CloudWatch log group name. If not specified, the default log group name is `/aws-glue/jobs/logs-v2`.

```
'--continuous-log-logGroup': 'custom_log_group_name'
```

You can specify a custom Amazon CloudWatch log stream prefix. If not specified, the default log stream prefix is the job run ID.

```
'--continuous-log-logStreamPrefix': 'custom_log_stream_prefix'
```

You can specify a custom continuous logging conversion pattern. If not specified, the default conversion pattern is `%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n`. Note that the conversion pattern only applies to driver logs and executor logs. It does not affect the AWS Glue progress bar.

```
'--continuous-log-conversionPattern': 'custom_log_conversion_pattern'
```

## Logging application-specific messages using the custom script logger
<a name="monitor-continuous-logging-script"></a>

You can use the AWS Glue logger to log any application-specific messages in the script that are sent in real time to the driver log stream.

The following example shows a Python script.

```
from awsglue.context import GlueContext
from pyspark.context import SparkContext

sc = SparkContext()
glueContext = GlueContext(sc)
logger = glueContext.get_logger()
logger.info("info message")
logger.warn("warn message")
logger.error("error message")
```

The following example shows a Scala script.

```
import com.amazonaws.services.glue.log.GlueLogger

object GlueApp {
  def main(sysArgs: Array[String]) {
    val logger = new GlueLogger
    logger.info("info message")
    logger.warn("warn message")
    logger.error("error message")
  }
}
```

## Enabling the progress bar to show job progress
<a name="monitor-continuous-logging-progress"></a>

AWS Glue provides a real-time progress bar under the `JOB_RUN_ID-progress-bar` log stream to check AWS Glue job run status. Currently it supports only jobs that initialize `glueContext`. If you run a pure Spark job without initializing `glueContext`, the AWS Glue progress bar does not appear.

The progress bar shows the following progress update every 5 seconds.

```
Stage Number (Stage Name): > (numCompletedTasks + numActiveTasks) / totalNumOfTasksInThisStage]
```

## Security configuration with continuous logging
<a name="monitor-logging-encrypt-log-data"></a>

If a security configuration is enabled for CloudWatch logs, AWS Glue will create a log group named as follows for continuous logs:

```
<Log-Group-Name>-<Security-Configuration-Name>
```

The default and custom log groups will be as follows:
+ The default continuous log group will be `/aws-glue/jobs/error-<Security-Configuration-Name>`
+ The custom continuous log group will be `<custom-log-group-name>-<Security-Configuration-Name>`

You need to add the `logs:AssociateKmsKey` to your IAM role permissions, if you enable a security configuration with CloudWatch Logs. If that permission is not included, continuous logging will be disabled. Also, to configure the encryption for the CloudWatch Logs, follow the instructions at [Encrypt Log Data in CloudWatch Logs Using AWS Key Management Service](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/encrypt-log-data-kms.html) in the *Amazon CloudWatch Logs User Guide*.

For more information on creating security configurations, see [Managing security configurations on the AWS Glue console](console-security-configurations.md).

**Note**  
 You may incur additional charges when you enable logging and additional CloudWatch log events are created. For more information, see [ Amazon CloudWatch pricing ](https://aws.amazon.com/cloudwatch/pricing/). 

# Viewing logs for AWS Glue jobs
<a name="monitor-continuous-logging-view"></a>

You can view real-time logs using the AWS Glue console or the Amazon CloudWatch console.

**To view real-time logs using the AWS Glue console dashboard**

1. Sign in to the AWS Management Console and open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/).

1. In the navigation pane, choose **Jobs**.

1. Add or start an existing job. Choose **Action**, **Run job**.

   When you start running a job, you navigate to a page that contains information about the running job:
   + The **Logs** tab shows the older aggregated application logs.
   + The **Logs** tab shows a real-time progress bar when the job is running with `glueContext` initialized.
   + The **Logs** tab also contains the **Driver logs**, which capture real-time Apache Spark driver logs, and application logs from the script logged using the AWS Glue application logger when the job is running.

1. For older jobs, you can also view the real-time logs under the **Job History** view by choosing **Logs**. This action takes you to the CloudWatch console that shows all Spark driver, executor, and progress bar log streams for that job run.

**To view real-time logs using the CloudWatch console dashboard**

1. Open the CloudWatch console at [https://console.aws.amazon.com/cloudwatch/](https://console.aws.amazon.com/cloudwatch/).

1. In the navigation pane, choose **Log**.

1. Choose the **/aws-glue/jobs/error/** log group.

1. In the **Filter** box, paste the job run ID.

   You can view the driver logs, executor logs, and progress bar (if using the **Standard filter**).

# Monitoring with AWS Glue Observability metrics
<a name="monitor-observability"></a>

**Note**  
AWS Glue Observability metrics is available on AWS Glue 4.0 and later versions.

 Use AWS Glue Observability metrics to generate insights into what is happening inside your AWS Glue for Apache Spark jobs to improve triaging and analysis of issues. Observability metrics are visualized through Amazon CloudWatch dashboards and can be used to help perform root cause analysis for errors and for diagnosing performance bottlenecks. You can reduce the time spent debugging issues at scale so you can focus on resolving issues faster and more effectively. 

 AWS Glue Observability provides Amazon CloudWatch metrics categorized in following four groups: 
+  **Reliability (i.e., Errors Classes)** – easily identify the most common failure reasons at given time range that you may want to address. 
+  **Performance (i.e., Skewness)** – identify a performance bottleneck and apply tuning techniques. For example, when you experience degraded performance due to job skewness, you may want to enable Spark Adaptive Query Execution and fine-tune the skew join threshold. 
+  **Throughput (i.e., per source/sink throughput)** – monitor trends of data reads and writes. You can also configure Amazon CloudWatch alarms for anomalies. 
+  **Resource Utilization (i.e., workers, memory and disk utilization)** – efficiently find the jobs with low capacity utilization. You may want to enable AWS Glue auto-scaling for those jobs. 

## Getting started with AWS Glue Observability metrics
<a name="monitor-observability-getting-started"></a>

**Note**  
 The new metrics are enabled by default in the AWS Glue Studio console. 

**To configure observability metrics in AWS Glue Studio:**

1. Log in to the AWS Glue console and choose **ETL jobs** from the console menu.

1. Choose a job by clicking on the job name in the **Your jobs** section.

1. Choose the **Job details** tab.

1. Scroll to the bottom and choose **Advanced properties**, then **Job observability metrics**.  
![\[The screenshot shows the Job details tab Advanced properties. The Job observability metrics option is highlighted.\]](http://docs.aws.amazon.com/glue/latest/dg/images/job-details-observability-metrics.png)

**To enable AWS Glue Observability metrics using AWS CLI:**
+  Add to the `--default-arguments` map the following key-value in the input JSON file: 

  ```
  --enable-observability-metrics, true
  ```

## Using AWS Glue observability
<a name="monitor-observability-cloudwatch"></a>

 Because the AWS Glue observability metrics is provided through Amazon CloudWatch, you can use the Amazon CloudWatch console, AWS CLI, SDK or API to query the observability metrics datapoints. See [ Using Glue Observability for monitoring resource utilization to reduce cost ](https://aws.amazon.com/blogs/big-data/enhance-monitoring-and-debugging-for-aws-glue-jobs-using-new-job-observability-metrics/) for an example use case when to use AWS Glue observability metrics. 

### Using AWS Glue observability in the Amazon CloudWatch console
<a name="monitor-observability-cloudwatch-console"></a>

**To query and visualize metrics in the Amazon CloudWatch console:**

1.  Open the Amazon CloudWatch console and choose **All metrics**. 

1.  Under custom namespaces, choose **AWS Glue**. 

1.  Choose **Job Observability Metrics, Observability Metrics Per Source, or Observability Metrics Per Sink** . 

1. Search for the specific metric name, job name, job run ID, and select them.

1. Under the **Graphed metrics** tab, configure your preferred statistic, period, and other options.  
![\[The screenshot shows the Amazon CloudWatch console and metrics graph.\]](http://docs.aws.amazon.com/glue/latest/dg/images/cloudwatch-console-metrics.png)

**To query an Observability metric using AWS CLI:**

1.  Create a metric definition JSON file and replace `your-Glue-job-name`and `your-Glue-job-run-id` with yours. 

   ```
   $ cat multiplequeries.json
   [
       {
           "Id": "avgWorkerUtil_0",
           "MetricStat": {
               "Metric": {
                   "Namespace": "Glue",
                   "MetricName": "glue.driver.workerUtilization",
                   "Dimensions": [
                       {
                           "Name": "JobName",
                           "Value": "<your-Glue-job-name-A>"
                       },
                       {
                           "Name": "JobRunId",
                           "Value": "<your-Glue-job-run-id-A>"
                       },
                       {
                           "Name": "Type",
                           "Value": "gauge"
                       },
                       {
                           "Name": "ObservabilityGroup",
                           "Value": "resource_utilization"
                       }
                   ]
               },
               "Period": 1800,
               "Stat": "Minimum",
               "Unit": "None"
           }
       },
       {
           "Id": "avgWorkerUtil_1",
           "MetricStat": {
               "Metric": {
                   "Namespace": "Glue",
                   "MetricName": "glue.driver.workerUtilization",
                   "Dimensions": [
                       {
                           "Name": "JobName",
                           "Value": "<your-Glue-job-name-B>"
                       },
                       {
                           "Name": "JobRunId",
                           "Value": "<your-Glue-job-run-id-B>"
                       },
                       {
                           "Name": "Type",
                           "Value": "gauge"
                       },
                       {
                           "Name": "ObservabilityGroup",
                           "Value": "resource_utilization"
                       }
                   ]
               },
               "Period": 1800,
               "Stat": "Minimum",
               "Unit": "None"
           }
       }
   ]
   ```

1.  Run the `get-metric-data` command: 

   ```
   $ aws cloudwatch get-metric-data --metric-data-queries file: //multiplequeries.json \
        --start-time '2023-10-28T18: 20' \
        --end-time '2023-10-28T19: 10'  \
        --region us-east-1
   {
       "MetricDataResults": [
           {
               "Id": "avgWorkerUtil_0",
               "Label": "<your-label-for-A>",
               "Timestamps": [
                   "2023-10-28T18:20:00+00:00"
               ],
               "Values": [
                   0.06718750000000001
               ],
               "StatusCode": "Complete"
           },
           {
               "Id": "avgWorkerUtil_1",
               "Label": "<your-label-for-B>",
               "Timestamps": [
                   "2023-10-28T18:50:00+00:00"
               ],
               "Values": [
                   0.5959183673469387
               ],
               "StatusCode": "Complete"
           }
       ],
       "Messages": []
   }
   ```

## Observability metrics
<a name="monitor-observability-metrics-definitions"></a>

 AWS Glue Observability profiles and sends the following metrics to Amazon CloudWatch every 30 seconds, and some of these metrics can be visible in the AWS Glue Studio Job Runs Monitoring Page. 


| Metric | Description | Category | 
| --- | --- | --- | 
| glue.driver.skewness.stage |  Metric Category: job\$1performance The spark stages execution Skewness: this metric is an indicator of how long the maximum task duration time in a certain stage is compared to the median task duration in this stage. It captures execution skewness, which might be caused by input data skewness or by a transformation (e.g., skewed join). The values of this metric falls into the range of [0, infinity[, where 0 means the ratio of the maximum to median tasks' execution time, among all tasks in the stage is less than a certain stage skewness factor. The default stage skewness factor is `5` and it can be overwritten via spark conf: spark.metrics.conf.driver.source.glue.jobPerformance.skewnessFactor A stage skewness value of 1 means the ratio is twice the stage skewness factor.  The value of stage skewness is updated every 30 seconds to reflect the current skewness. The value at the end of the stage reflects the final stage skewness. This stage-level metric is used to calculate the job-level metric `glue.driver.skewness.job`. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (gauge), and ObservabilityGroup (job\$1performance) Valid Statistics: Average, Maximum, Minimum, Percentile Unit: Count  | job\$1performance | 
| glue.driver.skewness.job |  Metric Category: job\$1performance  Job skewness is the maximum of weighted skewness of all stages. The stage skewness (glue.driver.skewness.stage) is weighted with stage duration. This is to avoid the corner case when a very skewed stage is actually running for very short time relative to other stages (and thus its skewness is not significant for the overall job performance and does not worth the effort to try to address its skewness).  This metric is updated upon completion of each stage, and thus the last value reflects the actual overall job skewness. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (gauge), and ObservabilityGroup (job\$1performance) Valid Statistics: Average, Maximum, Minimum, Percentile Unit: Count  | job\$1performance | 
| glue.succeed.ALL |  Metric Category: error Total number of successful job runs, to complete the picture of failures categories Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (count), and ObservabilityGroup (error) Valid Statistics: SUM Unit: Count  | error | 
| glue.error.ALL |  Metric Category: error  Total number of job run errors, to complete the picture of failures categories Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (count), and ObservabilityGroup (error) Valid Statistics: SUM Unit: Count  | error | 
| glue.error.[error category] |  Metric Category: error  This is actually a set of metrics, that are updated only when a job run fails. The error categorization helps with triaging and debugging. When a job run fails, the error causing the failure is categorized and the corresponding error category metric is set to 1. This helps to perform over time failures analysis, as well as over all jobs error analysis to identify most common failure categories to start addressing them. AWS Glue has 28 error categories, including OUT\$1OF\$1MEMORY (driver and executor), PERMISSION, SYNTAX and THROTTLING error categories. Error categories also include COMPILATION, LAUNCH and TIMEOUT error categories. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (count), and ObservabilityGroup (error) Valid Statistics: SUM Unit: Count  | error | 
| glue.driver.workerUtilization |  Metric Category: resource\$1utilization  The percentage of the allocated workers which are actually used. If not good, auto scaling can help. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (gauge), and ObservabilityGroup (resource\$1utilization) Valid Statistics: Average, Maximum, Minimum, Percentile Unit: Percentage  | resource\$1utilization | 
| glue.driver.memory.heap.[available \$1 used] |  Metric Category: resource\$1utilization  The driver's available / used heap memory during the job run. This helps to understand memory usage trends, especially over time, which can help avoid potential failures, in addition to debugging memory related failures. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (gauge), and ObservabilityGroup (resource\$1utilization) Valid Statistics: Average Unit: Bytes  | resource\$1utilization | 
| glue.driver.memory.heap.used.percentage |  Metric Category: resource\$1utilization  The driver's used (%) heap memory during the job run. This helps to understand memory usage trends, especially over time, which can help avoid potential failures, in addition to debugging memory related failures. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (gauge), and ObservabilityGroup (resource\$1utilization) Valid Statistics: Average Unit: Percentage  | resource\$1utilization | 
| glue.driver.memory.non-heap.[available \$1 used] |  Metric Category: resource\$1utilization  The driver's available / used non-heap memory during the job run. This helps to understand memory usage trensd, especially over time, which can help avoid potential failures, in addition to debugging memory related failures. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (gauge), and ObservabilityGroup (resource\$1utilization) Valid Statistics: Average Unit: Bytes  | resource\$1utilization | 
| glue.driver.memory.non-heap.used.percentage |  Metric Category: resource\$1utilization  The driver's used (%) non-heap memory during the job run. This helps to understand memory usage trends, especially over time, which can help avoid potential failures, in addition to debugging memory related failures. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (gauge), and ObservabilityGroup (resource\$1utilization) Valid Statistics: Average Unit: Percentage  | resource\$1utilization | 
| glue.driver.memory.total.[available \$1 used] |  Metric Category: resource\$1utilization  The driver's available / used total memory during the job run. This helps to understand memory usage trends, especially over time, which can help avoid potential failures, in addition to debugging memory related failures. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (gauge), and ObservabilityGroup (resource\$1utilization) Valid Statistics: Average Unit: Bytes  | resource\$1utilization | 
| glue.driver.memory.total.used.percentage |  Metric Category: resource\$1utilization  The driver's used (%) total memory during the job run. This helps to understand memory usage trends, especially over time, which can help avoid potential failures, in addition to debugging memory related failures. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (gauge), and ObservabilityGroup (resource\$1utilization) Valid Statistics: Average Unit: Percentage  | resource\$1utilization | 
| glue.ALL.memory.heap.[available \$1 used] |  Metric Category: resource\$1utilization  The executors' available/used heap memory. ALL means all executors. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (gauge), and ObservabilityGroup (resource\$1utilization) Valid Statistics: Average Unit: Bytes  | resource\$1utilization | 
| glue.ALL.memory.heap.used.percentage |  Metric Category: resource\$1utilization  The executors' used (%) heap memory. ALL means all executors. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (gauge), and ObservabilityGroup (resource\$1utilization) Valid Statistics: Average Unit: Percentage  | resource\$1utilization | 
| glue.ALL.memory.non-heap.[available \$1 used] |  Metric Category: resource\$1utilization  The executors' available/used non-heap memory. ALL means all executors. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (gauge), and ObservabilityGroup (resource\$1utilization) Valid Statistics: Average Unit: Bytes  | resource\$1utilization | 
| glue.ALL.memory.non-heap.used.percentage |  Metric Category: resource\$1utilization  The executors' used (%) non-heap memory. ALL means all executors. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (gauge), and ObservabilityGroup (resource\$1utilization) Valid Statistics: Average Unit: Percentage  | resource\$1utilization | 
| glue.ALL.memory.total.[available \$1 used] |  Metric Category: resource\$1utilization  The executors' available/used total memory. ALL means all executors. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (gauge), and ObservabilityGroup (resource\$1utilization) Valid Statistics: Average Unit: Bytes  | resource\$1utilization | 
| glue.ALL.memory.total.used.percentage |  Metric Category: resource\$1utilization  The executors' used (%) total memory. ALL means all executors. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (gauge), and ObservabilityGroup (resource\$1utilization) Valid Statistics: Average Unit: Percentage  | resource\$1utilization | 
| glue.driver.disk.[available\$1GB \$1 used\$1GB] |  Metric Category: resource\$1utilization  The driver's available/used disk space during the job run. This helps to understand disk usage trends, especially over time, which can help avoid potential failures, in addition to debugging not enought disk space related failures. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (gauge), and ObservabilityGroup (resource\$1utilization) Valid Statistics: Average Unit: Gigabytes  | resource\$1utilization | 
| glue.driver.disk.used.percentage] |  Metric Category: resource\$1utilization  The driver's available/used disk space during the job run. This helps to understand disk usage trends, especially over time, which can help avoid potential failures, in addition to debugging not enought disk space related failures. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (gauge), and ObservabilityGroup (resource\$1utilization) Valid Statistics: Average Unit: Percentage  | resource\$1utilization | 
| glue.ALL.disk.[available\$1GB \$1 used\$1GB] |  Metric Category: resource\$1utilization  The executors' available/used disk space. ALL means all executors. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (gauge), and ObservabilityGroup (resource\$1utilization) Valid Statistics: Average Unit: Gigabytes  | resource\$1utilization | 
| glue.ALL.disk.used.percentage |  Metric Category: resource\$1utilization  The executors' available/used/used(%) disk space. ALL means all executors. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (gauge), and ObservabilityGroup (resource\$1utilization) Valid Statistics: Average Unit: Percentage  | resource\$1utilization | 
| glue.driver.bytesRead |  Metric Category: throughput  The number of bytes read per input source in this job run, as well as well as for ALL sources. This helps understand the data volume and its changes over time, which helps addressing issues such as data skewness. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (gauge), ObservabilityGroup (resource\$1utilization), and Source (source data location) Valid Statistics: Average Unit: Bytes  | throughput | 
| glue.driver.[recordsRead \$1 filesRead]  |  Metric Category: throughput  The number of records/files read per input source in this job run, as well as well as for ALL sources. This helps understand the data volume and its changes over time, which helps addressing issues such as data skewness. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (gauge), ObservabilityGroup (resource\$1utilization), and Source (source data location) Valid Statistics: Average Unit: Count  | throughput | 
| glue.driver.partitionsRead  |  Metric Category: throughput  The number of partitions read per Amazon S3 input source in this job run, as well as well as for ALL sources. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (gauge), ObservabilityGroup (resource\$1utilization), and Source (source data location) Valid Statistics: Average Unit: Count  | throughput | 
| glue.driver.bytesWrittten |  Metric Category: throughput  The number of bytes written per output sink in this job run, as well as well as for ALL sinks. This helps understand the data volume and how it evolves over time, which helps addressing issues such as processing skewness. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (gauge), ObservabilityGroup (resource\$1utilization), and Sink (sink data location) Valid Statistics: Average Unit: Bytes  | throughput | 
| glue.driver.[recordsWritten \$1 filesWritten] |  Metric Category: throughput  The nnumber of records/files written per output sink in this job run, as well as well as for ALL sinks. This helps understand the data volume and how it evolves over time, which helps addressing issues such as processing skewness. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (gauge), ObservabilityGroup (resource\$1utilization), and Sink (sink data location) Valid Statistics: Average Unit: Count  | throughput | 

## Error categories
<a name="monitor-observability-error-categories"></a>


| Error categories | Description | 
| --- | --- | 
| COMPILATION\$1ERROR | Errors arise during the compilation of Scala code. | 
| CONNECTION\$1ERROR | Errors arise during connecting to a service/remote host/database service, etc. | 
| DISK\$1NO\$1SPACE\$1ERROR |  Errors arise when there is no space left in disk on driver/executor.  | 
| OUT\$1OF\$1MEMORY\$1ERROR | Errors arise when there is no space left in memory on driver/executor. | 
| IMPORT\$1ERROR | Errors arise when import dependencies. | 
| INVALID\$1ARGUMENT\$1ERROR | Errors arise when the input arguments are invalid/illegal. | 
| PERMISSION\$1ERROR | Errors arise when lacking the permission to service, data, etc.  | 
| RESOURCE\$1NOT\$1FOUND\$1ERROR |  Errors arise when data, location, etc does not exit.   | 
| QUERY\$1ERROR | Errors arise from Spark SQL query execution.  | 
| SYNTAX\$1ERROR | Errors arise when there is syntax error in the script.  | 
| THROTTLING\$1ERROR | Errors arise when hitting service concurrency limitation or execeding service quota limitaion. | 
| DATA\$1LAKE\$1FRAMEWORK\$1ERROR | Errors arise from AWS Glue native-supported data lake framework like Hudi, Iceberg, etc. | 
| UNSUPPORTED\$1OPERATION\$1ERROR | Errors arise when making unsupported operation. | 
| RESOURCES\$1ALREADY\$1EXISTS\$1ERROR | Errors arise when a resource to be created or added already exists. | 
| GLUE\$1INTERNAL\$1SERVICE\$1ERROR | Errors arise when there is a AWS Glue internal service issue.  | 
| GLUE\$1OPERATION\$1TIMEOUT\$1ERROR | Errors arise when a AWS Glue operation is timeout. | 
| GLUE\$1VALIDATION\$1ERROR | Errors arise when a required value could not be validated for AWS Glue job. | 
| GLUE\$1JOB\$1BOOKMARK\$1VERSION\$1MISMATCH\$1ERROR | Errors arise when same job exon the same source bucket and write to the same/different destination concurrently (concurrency >1) | 
| LAUNCH\$1ERROR | Errors arise during the AWS Glue job launch phase. | 
| DYNAMODB\$1ERROR | Generic errors arise from Amazon DynamoDB service. | 
| GLUE\$1ERROR | Generic Errors arise from AWS Glue service. | 
| LAKEFORMATION\$1ERROR | Generic Errors arise from AWS Lake Formation service. | 
| REDSHIFT\$1ERROR | Generic Errors arise from Amazon Redshift service. | 
| S3\$1ERROR | Generic Errors arise from Amazon S3 service. | 
| SYSTEM\$1EXIT\$1ERROR | Generic system exit error. | 
| TIMEOUT\$1ERROR | Generic errors arise when job failed by operation time out. | 
| UNCLASSIFIED\$1SPARK\$1ERROR | Generic errors arise from Spark. | 
| UNCLASSIFIED\$1ERROR | Default error category. | 

## Limitations
<a name="monitoring-observability-limitations"></a>

**Note**  
`glueContext` must be initialized to publish the metrics.

 In the Source Dimension, the value is either Amazon S3 path or table name, depending on the source type. In addition, if the source is JDBC and the query option is used, the query string is set in the source dimension. If the value is longer than 500 characters, it is trimmed within 500 characters.The following are limitations in the value: 
+ Non-ASCII characters will be removed.
+ If the source name doesn’t contain any ASCII character, it is converted to <non-ASCII input>.

### Limitations and considerations for throughput metrics
<a name="monitoring-observability-considerations"></a>
+  DataFrame and DataFrame-based DynamicFrame (e.g. JDBC, reading from parquet on Amazon S3) are supported, however, RDD-based DynamicFrame (e.g. reading csv, json on Amazon S3, etc.) is not supported. Technically, all reads and writes visible on Spark UI are supported. 
+  The `recordsRead` metric will be emitted if the data source is catalog table and the format is JSON, CSV, text, or Iceberg. 
+  `glue.driver.throughput.recordsWritten`, `glue.driver.throughput.bytesWritten`, and `glue.driver.throughput.filesWritten` metrics are not available in JDBC and Iceberg tables. 
+  Metrics may be delayed. If the job finishes in about one minute, there may be no throughput metrics in Amazon CloudWatch Metrics. 

# Job monitoring and debugging
<a name="monitor-profile-glue-job-cloudwatch-metrics"></a>

You can collect metrics about AWS Glue jobs and visualize them on the AWS Glue and Amazon CloudWatch consoles to identify and fix issues. Profiling your AWS Glue jobs requires the following steps:

1.  Enable metrics: 

   1.  Enable the **Job metrics** option in the job definition. You can enable profiling in the AWS Glue console or as a parameter to the job. For more information see [Defining job properties for Spark jobs](add-job.md#create-job) or [Using job parameters in AWS Glue jobs](aws-glue-programming-etl-glue-arguments.md). 

   1.  Enable the **AWS Glue Observability metrics** option in the job definition. You can enable Observability in the AWS Glue console or as a parameter to the job. For more information see [Monitoring with AWS Glue Observability metrics](monitor-observability.md). 

1. Confirm that the job script initializes a `GlueContext`. For example, the following script snippet initializes a `GlueContext` and shows where profiled code is placed in the script. This general format is used in the debugging scenarios that follow. 

   ```
   import sys
   from awsglue.transforms import *
   from awsglue.utils import getResolvedOptions
   from pyspark.context import SparkContext
   from awsglue.context import GlueContext
   from awsglue.job import Job
   import time
   
   ## @params: [JOB_NAME]
   args = getResolvedOptions(sys.argv, ['JOB_NAME'])
   
   sc = SparkContext()
   glueContext = GlueContext(sc)
   spark = glueContext.spark_session
   job = Job(glueContext)
   job.init(args['JOB_NAME'], args)
   
   ...
   ...
   code-to-profile
   ...
   ...
   
   
   job.commit()
   ```

1. Run the job.

1. Visualize the metrics:

   1. Visualize job metrics on the AWS Glue console and identify abnormal metrics for the driver or an executor.

   1. Check observability metrics in the Job run monitoring page, job run details page, or on Amazon CloudWatch. For more information, see [Monitoring with AWS Glue Observability metrics](monitor-observability.md).

1. Narrow down the root cause using the identified metric.

1. Optionally, confirm the root cause using the log stream of the identified driver or job executor.

 **Use cases for AWS Glue observability metrics** 
+  [Debugging OOM exceptions and job abnormalities](monitor-profile-debug-oom-abnormalities.md) 
+  [Debugging demanding stages and straggler tasks](monitor-profile-debug-straggler.md) 
+  [Monitoring the progress of multiple jobs](monitor-debug-multiple.md) 
+  [Monitoring for DPU capacity planning](monitor-debug-capacity.md) 
+  [ Using AWS Glue Observability for monitoring resource utilization to reduce cost ](https://aws.amazon.com/blogs/big-data/enhance-monitoring-and-debugging-for-aws-glue-jobs-using-new-job-observability-metrics) 

# Debugging OOM exceptions and job abnormalities
<a name="monitor-profile-debug-oom-abnormalities"></a>

You can debug out-of-memory (OOM) exceptions and job abnormalities in AWS Glue. The following sections describe scenarios for debugging out-of-memory exceptions of the Apache Spark driver or a Spark executor. 
+ [Debugging a driver OOM exception](#monitor-profile-debug-oom-driver)
+ [Debugging an executor OOM exception](#monitor-profile-debug-oom-executor)

## Debugging a driver OOM exception
<a name="monitor-profile-debug-oom-driver"></a>

In this scenario, a Spark job is reading a large number of small files from Amazon Simple Storage Service (Amazon S3). It converts the files to Apache Parquet format and then writes them out to Amazon S3. The Spark driver is running out of memory. The input Amazon S3 data has more than 1 million files in different Amazon S3 partitions. 

The profiled code is as follows:

```
data = spark.read.format("json").option("inferSchema", False).load("s3://input_path")
data.write.format("parquet").save(output_path)
```

### Visualize the profiled metrics on the AWS Glue console
<a name="monitor-debug-oom-visualize"></a>

The following graph shows the memory usage as a percentage for the driver and executors. This usage is plotted as one data point that is averaged over the values reported in the last minute. You can see in the memory profile of the job that the [driver memory](monitoring-awsglue-with-cloudwatch-metrics.md#glue.driver.jvm.heap.usage) crosses the safe threshold of 50 percent usage quickly. On the other hand, the [average memory usage](monitoring-awsglue-with-cloudwatch-metrics.md#glue.ALL.jvm.heap.usage) across all executors is still less than 4 percent. This clearly shows abnormality with driver execution in this Spark job. 

![\[The memory usage in percentage for the driver and executors.\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-debug-oom-memoryprofile.png)


The job run soon fails, and the following error appears in the **History** tab on the AWS Glue console: Command Failed with Exit Code 1. This error string means that the job failed due to a systemic error—which in this case is the driver running out of memory.

![\[The error message shown on the AWS Glue console.\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-debug-oom-errorstring.png)


On the console, choose the **Error logs** link on the **History** tab to confirm the finding about driver OOM from the CloudWatch Logs. Search for "**Error**" in the job's error logs to confirm that it was indeed an OOM exception that failed the job:

```
# java.lang.OutOfMemoryError: Java heap space
# -XX:OnOutOfMemoryError="kill -9 %p"
# Executing /bin/sh -c "kill -9 12039"...
```

On the **History** tab for the job, choose **Logs**. You can find the following trace of driver execution in the CloudWatch Logs at the beginning of the job. The Spark driver tries to list all the files in all the directories, constructs an `InMemoryFileIndex`, and launches one task per file. This in turn results in the Spark driver having to maintain a large amount of state in memory to track all the tasks. It caches the complete list of a large number of files for the in-memory index, resulting in a driver OOM.

### Fix the processing of multiple files using grouping
<a name="monitor-debug-oom-fix"></a>

You can fix the processing of the multiple files by using the *grouping* feature in AWS Glue. Grouping is automatically enabled when you use dynamic frames and when the input dataset has a large number of files (more than 50,000). Grouping allows you to coalesce multiple files together into a group, and it allows a task to process the entire group instead of a single file. As a result, the Spark driver stores significantly less state in memory to track fewer tasks. For more information about manually enabling grouping for your dataset, see [Reading input files in larger groups](grouping-input-files.md).

To check the memory profile of the AWS Glue job, profile the following code with grouping enabled:

```
df = glueContext.create_dynamic_frame_from_options("s3", {'paths': ["s3://input_path"], "recurse":True, 'groupFiles': 'inPartition'}, format="json")
datasink = glueContext.write_dynamic_frame.from_options(frame = df, connection_type = "s3", connection_options = {"path": output_path}, format = "parquet", transformation_ctx = "datasink")
```

You can monitor the memory profile and the ETL data movement in the AWS Glue job profile.

The driver runs below the threshold of 50 percent memory usage over the entire duration of the AWS Glue job. The executors stream the data from Amazon S3, process it, and write it out to Amazon S3. As a result, they consume less than 5 percent memory at any point in time.

![\[The memory profile showing the issue is fixed.\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-debug-oom-memoryprofile-fixed.png)


The data movement profile below shows the total number of Amazon S3 bytes that are [read](monitoring-awsglue-with-cloudwatch-metrics.md#glue.ALL.s3.filesystem.read_bytes) and [written](monitoring-awsglue-with-cloudwatch-metrics.md#glue.ALL.s3.filesystem.write_bytes) in the last minute by all executors as the job progresses. Both follow a similar pattern as the data is streamed across all the executors. The job finishes processing all one million files in less than three hours.

![\[The data movement profile showing the issue is fixed.\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-debug-oom-etlmovement.png)


## Debugging an executor OOM exception
<a name="monitor-profile-debug-oom-executor"></a>

In this scenario, you can learn how to debug OOM exceptions that could occur in Apache Spark executors. The following code uses the Spark MySQL reader to read a large table of about 34 million rows into a Spark dataframe. It then writes it out to Amazon S3 in Parquet format. You can provide the connection properties and use the default Spark configurations to read the table.

```
val connectionProperties = new Properties()
connectionProperties.put("user", user)
connectionProperties.put("password", password)
connectionProperties.put("Driver", "com.mysql.jdbc.Driver")
val sparkSession = glueContext.sparkSession
val dfSpark = sparkSession.read.jdbc(url, tableName, connectionProperties)
dfSpark.write.format("parquet").save(output_path)
```

### Visualize the profiled metrics on the AWS Glue console
<a name="monitor-debug-oom-visualize-2"></a>

If the slope of the memory usage graph is positive and crosses 50 percent, then if the job fails before the next metric is emitted, then memory exhaustion is a good candidate for the cause. The following graph shows that within a minute of execution, the [average memory usage](monitoring-awsglue-with-cloudwatch-metrics.md#glue.ALL.jvm.heap.usage) across all executors spikes up quickly above 50 percent. The usage reaches up to 92 percent and the container running the executor is stopped by Apache Hadoop YARN. 

![\[The average memory usage across all executors.\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-debug-oom-2-memoryprofile.png)


As the following graph shows, there is always a [single executor](monitoring-awsglue-with-cloudwatch-metrics.md#glue.driver.ExecutorAllocationManager.executors.numberAllExecutors) running until the job fails. This is because a new executor is launched to replace the stopped executor. The JDBC data source reads are not parallelized by default because it would require partitioning the table on a column and opening multiple connections. As a result, only one executor reads in the complete table sequentially.

![\[The job execution shows a single executor running until the job fails.\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-debug-oom-2-execution.png)


As the following graph shows, Spark tries to launch a new task four times before failing the job. You can see the [memory profile](monitoring-awsglue-with-cloudwatch-metrics.md#glue.ALL.jvm.heap.used) of three executors. Each executor quickly uses up all of its memory. The fourth executor runs out of memory, and the job fails. As a result, its metric is not reported immediately.

![\[The memory profiles of the executors.\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-debug-oom-2-exec-memprofile.png)


You can confirm from the error string on the AWS Glue console that the job failed due to OOM exceptions, as shown in the following image.

![\[The error message shown on the AWS Glue console.\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-debug-oom-2-errorstring.png)


**Job output logs:** To further confirm your finding of an executor OOM exception, look at the CloudWatch Logs. When you search for **Error**, you find the four executors being stopped in roughly the same time windows as shown on the metrics dashboard. All are terminated by YARN as they exceed their memory limits.

Executor 1

```
18/06/13 16:54:29 WARN YarnAllocator: Container killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
18/06/13 16:54:29 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
18/06/13 16:54:29 ERROR YarnClusterScheduler: Lost executor 1 on ip-10-1-2-175.ec2.internal: Container killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
18/06/13 16:54:29 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, ip-10-1-2-175.ec2.internal, executor 1): ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
```

Executor 2

```
18/06/13 16:55:35 WARN YarnAllocator: Container killed by YARN for exceeding memory limits. 5.8 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
18/06/13 16:55:35 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container killed by YARN for exceeding memory limits. 5.8 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
18/06/13 16:55:35 ERROR YarnClusterScheduler: Lost executor 2 on ip-10-1-2-16.ec2.internal: Container killed by YARN for exceeding memory limits. 5.8 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
18/06/13 16:55:35 WARN TaskSetManager: Lost task 0.1 in stage 0.0 (TID 1, ip-10-1-2-16.ec2.internal, executor 2): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 5.8 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
```

Executor 3

```
18/06/13 16:56:37 WARN YarnAllocator: Container killed by YARN for exceeding memory limits. 5.8 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
18/06/13 16:56:37 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container killed by YARN for exceeding memory limits. 5.8 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
18/06/13 16:56:37 ERROR YarnClusterScheduler: Lost executor 3 on ip-10-1-2-189.ec2.internal: Container killed by YARN for exceeding memory limits. 5.8 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
18/06/13 16:56:37 WARN TaskSetManager: Lost task 0.2 in stage 0.0 (TID 2, ip-10-1-2-189.ec2.internal, executor 3): ExecutorLostFailure (executor 3 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 5.8 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
```

Executor 4

```
18/06/13 16:57:18 WARN YarnAllocator: Container killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
18/06/13 16:57:18 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
18/06/13 16:57:18 ERROR YarnClusterScheduler: Lost executor 4 on ip-10-1-2-96.ec2.internal: Container killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
18/06/13 16:57:18 WARN TaskSetManager: Lost task 0.3 in stage 0.0 (TID 3, ip-10-1-2-96.ec2.internal, executor 4): ExecutorLostFailure (executor 4 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
```

### Fix the fetch size setting using AWS Glue dynamic frames
<a name="monitor-debug-oom-fix-2"></a>

The executor ran out of memory while reading the JDBC table because the default configuration for the Spark JDBC fetch size is zero. This means that the JDBC driver on the Spark executor tries to fetch the 34 million rows from the database together and cache them, even though Spark streams through the rows one at a time. With Spark, you can avoid this scenario by setting the fetch size parameter to a non-zero default value.

You can also fix this issue by using AWS Glue dynamic frames instead. By default, dynamic frames use a fetch size of 1,000 rows that is a typically sufficient value. As a result, the executor does not take more than 7 percent of its total memory. The AWS Glue job finishes in less than two minutes with only a single executor. While using AWS Glue dynamic frames is the recommended approach, it is also possible to set the fetch size using the Apache Spark `fetchsize` property. See the [Spark SQL, DataFrames and Datasets Guide](https://spark.apache.org/docs/2.2.0/sql-programming-guide.html#jdbc-to-other-databases).

```
val (url, database, tableName) = {
 ("jdbc_url", "db_name", "table_name")
 } 
val source = glueContext.getSource(format, sourceJson)
val df = source.getDynamicFrame
glueContext.write_dynamic_frame.from_options(frame = df, connection_type = "s3", connection_options = {"path": output_path}, format = "parquet", transformation_ctx = "datasink")
```

**Normal profiled metrics:** The [executor memory](monitoring-awsglue-with-cloudwatch-metrics.md#glue.ALL.jvm.heap.usage) with AWS Glue dynamic frames never exceeds the safe threshold, as shown in the following image. It streams in the rows from the database and caches only 1,000 rows in the JDBC driver at any point in time. An out of memory exception does not occur.

![\[AWS Glue console showing executor memory below the safe threshold.\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-debug-oom-2-memoryprofile-fixed.png)


# Debugging demanding stages and straggler tasks
<a name="monitor-profile-debug-straggler"></a>

You can use AWS Glue job profiling to identify demanding stages and straggler tasks in your extract, transform, and load (ETL) jobs. A straggler task takes much longer than the rest of the tasks in a stage of an AWS Glue job. As a result, the stage takes longer to complete, which also delays the total execution time of the job.

## Coalescing small input files into larger output files
<a name="monitor-profile-debug-straggler-scenario-1"></a>

A straggler task can occur when there is a non-uniform distribution of work across the different tasks, or a data skew results in one task processing more data.

You can profile the following code—a common pattern in Apache Spark—to coalesce a large number of small files into larger output files. For this example, the input dataset is 32 GB of JSON Gzip compressed files. The output dataset has roughly 190 GB of uncompressed JSON files. 

The profiled code is as follows:

```
datasource0 = spark.read.format("json").load("s3://input_path")
df = datasource0.coalesce(1)
df.write.format("json").save(output_path)
```

### Visualize the profiled metrics on the AWS Glue console
<a name="monitor-debug-straggler-visualize"></a>

You can profile your job to examine four different sets of metrics:
+ ETL data movement
+ Data shuffle across executors
+ Job execution
+ Memory profile

**ETL data movement**: In the **ETL Data Movement** profile, the bytes are [read](monitoring-awsglue-with-cloudwatch-metrics.md#glue.ALL.s3.filesystem.read_bytes) fairly quickly by all the executors in the first stage that completes within the first six minutes. However, the total job execution time is around one hour, mostly consisting of the data [writes](monitoring-awsglue-with-cloudwatch-metrics.md#glue.ALL.s3.filesystem.write_bytes).

![\[Graph showing the ETL Data Movement profile.\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-debug-straggler-1.png)


**Data shuffle across executors:** The number of bytes [read](monitoring-awsglue-with-cloudwatch-metrics.md#glue.driver.aggregate.shuffleLocalBytesRead) and [written](monitoring-awsglue-with-cloudwatch-metrics.md#glue.driver.aggregate.shuffleBytesWritten) during shuffling also shows a spike before Stage 2 ends, as indicated by the **Job Execution** and **Data Shuffle** metrics. After the data shuffles from all executors, the reads and writes proceed from executor number 3 only.

![\[The metrics for data shuffle across the executors.\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-debug-straggler-2.png)


**Job execution:** As shown in the graph below, all other executors are idle and are eventually relinquished by the time 10:09. At that point, the total number of executors decreases to only one. This clearly shows that executor number 3 consists of the straggler task that is taking the longest execution time and is contributing to most of the job execution time.

![\[The execution metrics for the active executors.\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-debug-straggler-3.png)


**Memory profile:** After the first two stages, only [executor number 3](monitoring-awsglue-with-cloudwatch-metrics.md#glue.executorId.jvm.heap.used) is actively consuming memory to process the data. The remaining executors are simply idle or have been relinquished shortly after the completion of the first two stages. 

![\[The metrics for the memory profile after the first two stages.\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-debug-straggler-4.png)


### Fix straggling executors using grouping
<a name="monitor-debug-straggler-fix"></a>

You can avoid straggling executors by using the *grouping* feature in AWS Glue. Use grouping to distribute the data uniformly across all the executors and coalesce files into larger files using all the available executors on the cluster. For more information, see [Reading input files in larger groups](grouping-input-files.md).

To check the ETL data movements in the AWS Glue job, profile the following code with grouping enabled:

```
df = glueContext.create_dynamic_frame_from_options("s3", {'paths': ["s3://input_path"], "recurse":True, 'groupFiles': 'inPartition'}, format="json")
datasink = glueContext.write_dynamic_frame.from_options(frame = df, connection_type = "s3", connection_options = {"path": output_path}, format = "json", transformation_ctx = "datasink4")
```

**ETL data movement:** The data writes are now streamed in parallel with the data reads throughout the job execution time. As a result, the job finishes within eight minutes—much faster than previously.

![\[The ETL data movements showing the issue is fixed.\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-debug-straggler-5.png)


**Data shuffle across executors:** As the input files are coalesced during the reads using the grouping feature, there is no costly data shuffle after the data reads.

![\[The data shuffle metrics showing the issue is fixed.\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-debug-straggler-6.png)


**Job execution:** The job execution metrics show that the total number of active executors running and processing data remains fairly constant. There is no single straggler in the job. All executors are active and are not relinquished until the completion of the job. Because there is no intermediate shuffle of data across the executors, there is only a single stage in the job.

![\[The metrics for the Job Execution widget showing that there are no stragglers in the job.\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-debug-straggler-7.png)


**Memory profile:** The metrics show the [active memory consumption](monitoring-awsglue-with-cloudwatch-metrics.md#glue.executorId.jvm.heap.used) across all executors—reconfirming that there is activity across all executors. As data is streamed in and written out in parallel, the total memory footprint of all executors is roughly uniform and well below the safe threshold for all executors.

![\[The memory profile metrics showing active memory consumption across all executors.\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-debug-straggler-8.png)


# Monitoring the progress of multiple jobs
<a name="monitor-debug-multiple"></a>

You can profile multiple AWS Glue jobs together and monitor the flow of data between them. This is a common workflow pattern, and requires monitoring for individual job progress, data processing backlog, data reprocessing, and job bookmarks.

**Topics**
+ [Profiled code](#monitor-debug-multiple-profile)
+ [Visualize the profiled metrics on the AWS Glue console](#monitor-debug-multiple-visualize)
+ [Fix the processing of files](#monitor-debug-multiple-fix)

## Profiled code
<a name="monitor-debug-multiple-profile"></a>

In this workflow, you have two jobs: an Input job and an Output job. The Input job is scheduled to run every 30 minutes using a periodic trigger. The Output job is scheduled to run after each successful run of the Input job. These scheduled jobs are controlled using job triggers.

![\[Console screenshot showing the job triggers controlling the scheduling of the Input and Output jobs.\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-debug-multiple-1.png)


**Input job**: This job reads in data from an Amazon Simple Storage Service (Amazon S3) location, transforms it using `ApplyMapping`, and writes it to a staging Amazon S3 location. The following code is profiled code for the Input job:

```
datasource0 = glueContext.create_dynamic_frame.from_options(connection_type="s3", connection_options = {"paths": ["s3://input_path"], "useS3ListImplementation":True,"recurse":True}, format="json")
applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [map_spec])
datasink2 = glueContext.write_dynamic_frame.from_options(frame = applymapping1, connection_type = "s3", connection_options = {"path": staging_path, "compression": "gzip"}, format = "json")
```

**Output job**: This job reads the output of the Input job from the staging location in Amazon S3, transforms it again, and writes it to a destination:

```
datasource0 = glueContext.create_dynamic_frame.from_options(connection_type="s3", connection_options = {"paths": [staging_path], "useS3ListImplementation":True,"recurse":True}, format="json")
applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [map_spec])
datasink2 = glueContext.write_dynamic_frame.from_options(frame = applymapping1, connection_type = "s3", connection_options = {"path": output_path}, format = "json")
```

## Visualize the profiled metrics on the AWS Glue console
<a name="monitor-debug-multiple-visualize"></a>

The following dashboard superimposes the Amazon S3 bytes written metric from the Input job onto the Amazon S3 bytes read metric on the same timeline for the Output job. The timeline shows different job runs of the Input and Output jobs. The Input job (shown in red) starts every 30 minutes. The Output Job (shown in brown) starts at the completion of the Input Job, with a Max Concurrency of 1. 

![\[Graph showing the data read and written.\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-debug-multiple-4.png)


In this example, [job bookmarks](https://docs.aws.amazon.com/glue/latest/dg/monitor-continuations.html) are not enabled. No transformation contexts are used to enable job bookmarks in the script code. 

**Job History**: The Input and Output jobs have multiple runs, as shown on the **History** tab, starting from 12:00 PM.

The Input job on the AWS Glue console looks like this:

![\[Console screenshot showing the History tab of the Input job.\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-debug-multiple-2.png)


The following image shows the Output job:

![\[Console screenshot showing the History tab of the Output job.\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-debug-multiple-3.png)


**First job runs**: As shown in the Data Bytes Read and Written graph below, the first job runs of the Input and Output jobs between 12:00 and 12:30 show roughly the same area under the curves. Those areas represent the Amazon S3 bytes written by the Input job and the Amazon S3 bytes read by the Output job. This data is also confirmed by the ratio of Amazon S3 bytes written (summed over 30 minutes – the job trigger frequency for the Input job). The data point for the ratio for the Input job run that started at 12:00PM is also 1.

The following graph shows the data flow ratio across all the job runs:

![\[Graph showing the data flow ratio: bytes written and bytes read.\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-debug-multiple-5.png)


**Second job runs**: In the second job run, there is a clear difference in the number of bytes read by the Output job compared to the number of bytes written by the Input job. (Compare the area under the curve across the two job runs for the Output job, or compare the areas in the second run of the Input and Output jobs.) The ratio of the bytes read and written shows that the Output Job read about 2.5x the data written by the Input job in the second span of 30 minutes from 12:30 to 13:00. This is because the Output Job reprocessed the output of the first job run of the Input job because job bookmarks were not enabled. A ratio above 1 shows that there is an additional backlog of data that was processed by the Output job.

**Third job runs**: The Input job is fairly consistent in terms of the number of bytes written (see the area under the red curves). However, the third job run of the Input job ran longer than expected (see the long tail of the red curve). As a result, the third job run of the Output job started late. The third job run processed only a fraction of the data accumulated in the staging location in the remaining 30 minutes between 13:00 and 13:30. The ratio of the bytes flow shows that it only processed 0.83 of data written by the third job run of the Input job (see the ratio at 13:00).

**Overlap of Input and Output jobs**: The fourth job run of the Input job started at 13:30 as per the schedule, before the third job run of the Output job finished. There is a partial overlap between these two job runs. However, the third job run of the Output job captures only the files that it listed in the staging location of Amazon S3 when it began around 13:17. This consists of all data output from the first job runs of the Input job. The actual ratio at 13:30 is around 2.75. The third job run of the Output job processed about 2.75x of data written by the fourth job run of the Input job from 13:30 to 14:00.

As these images show, the Output job is reprocessing data from the staging location from all prior job runs of the Input job. As a result, the fourth job run for the Output job is the longest and overlaps with the entire fifth job run of the Input job.

## Fix the processing of files
<a name="monitor-debug-multiple-fix"></a>

You should ensure that the Output job processes only the files that haven't been processed by previous job runs of the Output job. To do this, enable job bookmarks and set the transformation context in the Output job, as follows:

```
datasource0 = glueContext.create_dynamic_frame.from_options(connection_type="s3", connection_options = {"paths": [staging_path], "useS3ListImplementation":True,"recurse":True}, format="json", transformation_ctx = "bookmark_ctx")
```

With job bookmarks enabled, the Output job doesn't reprocess the data in the staging location from all the previous job runs of the Input job. In the following image showing the data read and written, the area under the brown curve is fairly consistent and similar with the red curves. 

![\[Graph showing the data read and written as red and brown lines.\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-debug-multiple-6.png)


The ratios of byte flow also remain roughly close to 1 because there is no additional data processed.

![\[Graph showing the data flow ratio: bytes written and bytes read\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-debug-multiple-7.png)


A job run for the Output job starts and captures the files in the staging location before the next Input job run starts putting more data into the staging location. As long as it continues to do this, it processes only the files captured from the previous Input job run, and the ratio stays close to 1.

![\[Graph showing the data flow ratio: bytes written and bytes read\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-debug-multiple-7.png)


Suppose that the Input job takes longer than expected, and as a result, the Output job captures files in the staging location from two Input job runs. The ratio is then higher than 1 for that Output job run. However, the following job runs of the Output job don't process any files that are already processed by the previous job runs of the Output job.

# Monitoring for DPU capacity planning
<a name="monitor-debug-capacity"></a>

You can use job metrics in AWS Glue to estimate the number of data processing units (DPUs) that can be used to scale out an AWS Glue job.

**Note**  
This page is only applicable to AWS Glue versions 0.9 and 1.0. Later versions of AWS Glue contain cost-saving features that introduce additional considerations when capacity planning. 

**Topics**
+ [Profiled code](#monitor-debug-capacity-profile)
+ [Visualize the profiled metrics on the AWS Glue console](#monitor-debug-capacity-visualize)
+ [Determine the optimal DPU capacity](#monitor-debug-capacity-fix)

## Profiled code
<a name="monitor-debug-capacity-profile"></a>

The following script reads an Amazon Simple Storage Service (Amazon S3) partition containing 428 gzipped JSON files. The script applies a mapping to change the field names, and converts and writes them to Amazon S3 in Apache Parquet format. You provision 10 DPUs as per the default and run this job. 

```
datasource0 = glueContext.create_dynamic_frame.from_options(connection_type="s3", connection_options = {"paths": [input_path], "useS3ListImplementation":True,"recurse":True}, format="json")
applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [(map_spec])
datasink2 = glueContext.write_dynamic_frame.from_options(frame = applymapping1, connection_type = "s3", connection_options = {"path": output_path}, format = "parquet")
```

## Visualize the profiled metrics on the AWS Glue console
<a name="monitor-debug-capacity-visualize"></a>

**Job run 1:** In this job run we show how to find if there are under-provisioned DPUs in the cluster. The job execution functionality in AWS Glue shows the total [number of actively running executors](monitoring-awsglue-with-cloudwatch-metrics.md#glue.driver.ExecutorAllocationManager.executors.numberAllExecutors), the [number of completed stages](monitoring-awsglue-with-cloudwatch-metrics.md#glue.driver.aggregate.numCompletedStages), and the [number of maximum needed executors](monitoring-awsglue-with-cloudwatch-metrics.md#glue.driver.ExecutorAllocationManager.executors.numberMaxNeededExecutors).

The number of maximum needed executors is computed by adding the total number of running tasks and pending tasks, and dividing by the tasks per executor. This result is a measure of the total number of executors required to satisfy the current load. 

In contrast, the number of actively running executors measures how many executors are running active Apache Spark tasks. As the job progresses, the maximum needed executors can change and typically goes down towards the end of the job as the pending task queue diminishes.

The horizontal red line in the following graph shows the number of maximum allocated executors, which depends on the number of DPUs that you allocate for the job. In this case, you allocate 10 DPUs for the job run. One DPU is reserved for management. Nine DPUs run two executors each and one executor is reserved for the Spark driver. The Spark driver runs inside the primary application. So, the number of maximum allocated executors is 2\$19 - 1 = 17 executors.

![\[The job metrics showing active executors and maximum needed executors.\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-debug-capacity-1.png)


As the graph shows, the number of maximum needed executors starts at 107 at the beginning of the job, whereas the number of active executors remains 17. This is the same as the number of maximum allocated executors with 10 DPUs. The ratio between the maximum needed executors and maximum allocated executors (adding 1 to both for the Spark driver) gives you the under-provisioning factor: 108/18 = 6x. You can provision 6 (under provisioning ratio) \$19 (current DPU capacity - 1) \$1 1 DPUs = 55 DPUs to scale out the job to run it with maximum parallelism and finish faster. 

The AWS Glue console displays the detailed job metrics as a static line representing the original number of maximum allocated executors. The console computes the maximum allocated executors from the job definition for the metrics. By constrast, for detailed job run metrics, the console computes the maximum allocated executors from the job run configuration, specifically the DPUs allocated for the job run. To view metrics for an individual job run, select the job run and choose **View run metrics**.

![\[The job metrics showing ETL data movement.\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-debug-capacity-2.png)


Looking at the Amazon S3 bytes [read](monitoring-awsglue-with-cloudwatch-metrics.md#glue.ALL.s3.filesystem.read_bytes) and [written](monitoring-awsglue-with-cloudwatch-metrics.md#glue.ALL.s3.filesystem.write_bytes), notice that the job spends all six minutes streaming in data from Amazon S3 and writing it out in parallel. All the cores on the allocated DPUs are reading and writing to Amazon S3. The maximum number of needed executors being 107 also matches the number of files in the input Amazon S3 path—428. Each executor can launch four Spark tasks to process four input files (JSON gzipped).

## Determine the optimal DPU capacity
<a name="monitor-debug-capacity-fix"></a>

Based on the results of the previous job run, you can increase the total number of allocated DPUs to 55, and see how the job performs. The job finishes in less than three minutes—half the time it required previously. The job scale-out is not linear in this case because it is a short running job. Jobs with long-lived tasks or a large number of tasks (a large number of max needed executors) benefit from a close-to-linear DPU scale-out performance speedup.

![\[Graph showing increasing the total number of allocated DPUs\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-debug-capacity-3.png)


As the above image shows, the total number of active executors reaches the maximum allocated—107 executors. Similarly, the maximum needed executors is never above the maximum allocated executors. The maximum needed executors number is computed from the actively running and pending task counts, so it might be smaller than the number of active executors. This is because there can be executors that are partially or completely idle for a short period of time and are not yet decommissioned.

![\[Graph showing the total number of active executors reaching the maximum allocated.\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-debug-capacity-4.png)


This job run uses 6x more executors to read and write from Amazon S3 in parallel. As a result, this job run uses more Amazon S3 bandwidth for both reads and writes, and finishes faster. 

### Identify overprovisioned DPUs
<a name="monitor-debug-capacity-over"></a>

Next, you can determine whether scaling out the job with 100 DPUs (99 \$1 2 = 198 executors) helps to scale out any further. As the following graph shows, the job still takes three minutes to finish. Similarly, the job does not scale out beyond 107 executors (55 DPUs configuration), and the remaining 91 executors are overprovisioned and not used at all. This shows that increasing the number of DPUs might not always improve performance, as evident from the maximum needed executors.

![\[Graph showing that job performance does not always increase by increasing the number of DPUs.\]](http://docs.aws.amazon.com/glue/latest/dg/images/monitor-debug-capacity-5.png)


### Compare time differences
<a name="monitor-debug-capacity-time"></a>

The three job runs shown in the following table summarize the job execution times for 10 DPUs, 55 DPUs, and 100 DPUs. You can find the DPU capacity to improve the job execution time using the estimates you established by monitoring the first job run.


| Job ID | Number of DPUs | Execution time | 
| --- | --- | --- | 
| jr\$1c894524c8ef5048a4d9... | 10 | 6 min. | 
| jr\$11a466cf2575e7ffe6856... | 55 | 3 min. | 
| jr\$134fa1ed4c6aa9ff0a814... | 100 | 3 min. | 

# Generative AI troubleshooting for Apache Spark in AWS Glue
<a name="troubleshoot-spark"></a>

 Generative AI Troubleshooting for Apache Spark jobs in AWS Glue is a new capability that helps data engineers and scientists diagnose and fix issues in their Spark applications with ease. Utilizing machine learning and generative AI technologies, this feature analyzes issues in Spark jobs and provides detailed root cause analysis along with actionable recommendations to resolve those issues. The generative AI troubleshooting for Apache Spark is available for jobs running on AWS Glue version 4.0 and above. 


|  | 
| --- |
|  Transform your Apache Spark troubleshooting with our AI-powered Troubleshooting Agent, now supporting all major deployment modes including AWS Glue, Amazon EMR-EC2, Amazon EMR-Serverless and Amazon SageMaker AI Notebooks. This powerful agent eliminates complex debugging processes by combining natural language interactions, real-time workload analysis, and smart code recommendations into a seamless experience. For implementation details, refer to [What is Apache Spark Troubleshooting Agent for Amazon EMR](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/spark-troubleshoot.html). View the second demonstration in [Using the Troubleshooting Agent](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/spark-troubleshooting-using-troubleshooting-agent.html) for AWS Glue troubleshooting examples.  | 

## How does Generative AI Troubleshooting for Apache Spark work?
<a name="troubleshoot-spark-how-it-works"></a>

 For your failed Spark jobs, Generative AI Troubleshooting analyzes the job metadata and the precise metrics and logs associated with the error signature of your job to generate a root cause analysis, and recommends specific solutions and best practices to help address job failures. 

## Setting up Generative AI Troubleshooting for Apache Spark for your jobs
<a name="w2aac37c11c12c33c13"></a>

### Configuring IAM permissions
<a name="troubleshoot-spark-iam-permissions"></a>

 Granting permissions to the APIs used by Spark Troubleshooting for your jobs in AWS Glue requires appropriate IAM permissions. You can obtain permissions by attaching the following custom AWS policy to your IAM identity (such as a user, role, or group). 

------
#### [ JSON ]

****  

```
{
  "Version":"2012-10-17",		 	 	 
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "glue:StartCompletion",
        "glue:GetCompletion"
      ],
      "Resource": [
        "arn:aws:glue:*:*:completion/*",
        "arn:aws:glue:*:*:job/*"
      ]
    }
  ]
}
```

------

**Note**  
 The following two APIs are used in the IAM policy for enabling this experience through the AWS Glue Studio Console: `StartCompletion` and `GetCompletion`. 

### Assigning permissions
<a name="troubleshoot-spark-assigning-permissions"></a>

 To provide access, add permissions to your users, groups, or roles: 
+  For users and groups in IAM Identity Center: Create a permission set. Follow the instructions in [ Create a permission set ](https://docs.aws.amazon.com/singlesignon/latest/userguide/howtocreatepermissionset.html) in the IAM Identity Center User Guide. 
+  For users managed in IAM through an identity provider: Create a role for identity federation. Follow the instructions in [ Creating a role for a third-party identity provider (federation) ](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_create_for-idp.html) in the IAM User Guide. 
+  For IAM users: Create a role that your user can assume. Follow the instructions in [ Creating a role for an IAM user ](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_create_for-user.html) in the IAM User Guide. 

## Running troubleshooting analysis from a failed job run
<a name="troubleshoot-spark-run-analysis"></a>

 You can access the troubleshooting feature through multiple paths in the AWS Glue console. Here's how to get started: 

### Option 1: From the Jobs List page
<a name="troubleshoot-spark-from-jobs-list"></a>

1.  Open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/). 

1.  In the navigation pane, choose **ETL Jobs**. 

1.  Locate your failed job in the jobs list. 

1.  Select the **Runs** tab in the job details section. 

1.  Click on the failed job run you want to analyze. 

1.  Choose **Troubleshoot with AI** to start the analysis. 

1.  When the troubleshooting analysis is complete, you can view the root-cause analysis and recommendations in the **Troubleshooting analysis** tab at the bottom of the screen. 

![\[The GIF shows an end to end implementation of a failed run and the troubleshoot with AI feature running.\]](http://docs.aws.amazon.com/glue/latest/dg/images/troubleshoot_spark_option_1_jobs_list.gif)


### Option 2: Using the Job Run Monitoring page
<a name="troubleshoot-spark-job-run-monitoring-page"></a>

1.  Navigate to the **Job run monitoring** page. 

1.  Locate your failed job run. 

1.  Choose the **Actions** drop-down menu. 

1.  Choose **Troubleshoot with AI**. 

![\[The GIF shows an end to end implementation of a failed run and the troubleshoot with AI feature running.\]](http://docs.aws.amazon.com/glue/latest/dg/images/troubleshoot_spark_option_2_job_monitoring.gif)


### Option 3: From the Job Run Details page
<a name="troubleshoot-spark-job-run-details-page"></a>

1.  Navigate to your failed job run's details page by either clicking **View details** on a failed run from the **Runs** tab or selecting the job run from the **Job run monitoring** page. 

1.  In the job run details page, find the **Troubleshooting analysis** tab. 

## Supported troubleshooting categories
<a name="troubleshoot-spark-supported-troubleshooting-categories"></a>

 This service focuses on three primary categories of issues that data engineers and developers frequently encounter in their Spark applications: 
+  **Resource setup and access errors:** When running Spark applications in AWS Glue, resource setup and access errors are among the most common yet challenging issues to diagnose. These errors often occur when your Spark application attempts to interact with AWS resources but encounters permission issues, missing resources, or configuration problems. 
+  **Spark driver and executor memory issues:** Memory-related errors in Apache Spark jobs can be complex to diagnose and resolve. These errors often manifest when your data processing requirements exceed the available memory resources, either on the driver node or executor nodes. 
+  **Spark disk capacity issues:** Storage-related errors in AWS Glue Spark jobs often emerge during shuffle operations, data spilling, or when dealing with large-scale data transformations. These errors can be particularly tricky because they might not manifest until your job has been running for a while, potentially wasting valuable compute time and resources. 
+  **Query execution errors:** Query failures in Spark SQL and DataFrame operations can be difficult to troubleshoot because error messages may not clearly point to the root cause, and queries that work fine with small datasets can suddenly fail at scale. These errors become even more challenging when they occur deep within complex transformation pipelines, where the actual issue may stem from data quality problems in earlier stages rather than the query logic itself. 

**Note**  
 Before implementing any suggested changes in your production environment, review the suggested changes thoroughly. The service provides recommendations based on patterns and best practices, but your specific use case might require additional considerations. 

## Supported regions
<a name="troubleshoot-spark-supported-regions"></a>

Generative AI troubleshooting for Apache Spark is available in the following regions:
+ **Africa**: Cape Town (af-south-1)
+ **Asia Pacific**: Hong Kong (ap-east-1), Tokyo (ap-northeast-1), Seoul (ap-northeast-2), Osaka (ap-northeast-3), Mumbai (ap-south-1), Singapore (ap-southeast-1), Sydney (ap-southeast-2), and Jakarta (ap-southeast-3)
+ **Europe**: Frankfurt (eu-central-1), Stockholm (eu-north-1), Milan (eu-south-1), Ireland (eu-west-1), London (eu-west-2), and Paris (eu-west-3)
+ **Middle East**: Bahrain (me-south-1) and UAE (me-central-1)
+ **North America**: Canada (ca-central-1)
+ **South America**: São Paulo (sa-east-1)
+ **United States**: North Virginia (us-east-1), Ohio (us-east-2), North California (us-west-1), and Oregon (us-west-2)

# Using materialized views with AWS Glue
<a name="materialized-views"></a>

AWS Glue version 5.1 and later supports creating and managing Apache Iceberg materialized views in the AWS Glue Data Catalog. A materialized view is a managed table that stores the precomputed result of a SQL query in Apache Iceberg format and incrementally updates as the underlying source tables change. You can use materialized views to simplify data transformation pipelines and accelerate query performance for complex analytical workloads.

When you create a materialized view using Spark in AWS Glue, the view definition and metadata are stored in the AWS Glue Data Catalog. The precomputed results are stored as Apache Iceberg tables in Amazon S3 Tables buckets or Amazon S3 general purpose buckets within your account. The AWS Glue Data Catalog automatically monitors source tables and refreshes materialized views using managed compute infrastructure.

**Topics**
+ [How materialized views work with AWS Glue](#materialized-views-how-they-work)
+ [Prerequisites](#materialized-views-prerequisites)
+ [Configuring Spark to use materialized views](#materialized-views-configuring-spark)
+ [Creating materialized views](#materialized-views-creating)
+ [Querying materialized views](#materialized-views-querying)
+ [Refreshing materialized views](#materialized-views-refreshing)
+ [Managing materialized views](#materialized-views-managing)
+ [Permissions for materialized views](#materialized-views-permissions)
+ [Monitoring materialized view operations](#materialized-views-monitoring)
+ [Example: Complete workflow](#materialized-views-complete-workflow)
+ [Considerations and limitations](#materialized-views-considerations-limitations)

## How materialized views work with AWS Glue
<a name="materialized-views-how-they-work"></a>

Materialized views integrate with AWS Glue through Apache Spark's Iceberg support in AWS Glue jobs and AWS Glue Studio notebooks. When you configure your Spark session to use the AWS Glue Data Catalog, you can create materialized views using standard SQL syntax. The Spark optimizer can automatically rewrite queries to use materialized views when they provide better performance, eliminating the need to manually modify application code.

The AWS Glue Data Catalog handles all operational aspects of materialized view maintenance, including:
+ Detecting changes in source tables using Apache Iceberg's metadata layer
+ Scheduling and executing refresh operations using managed Spark compute
+ Determining whether to perform full or incremental refresh based on the data changes
+ Storing precomputed results in Apache Iceberg format for multi-engine access

You can query materialized views from AWS Glue using the same Spark SQL interfaces you use for regular tables. The precomputed data is also accessible from other services including Amazon Athena and Amazon Redshift.

## Prerequisites
<a name="materialized-views-prerequisites"></a>

To use materialized views with AWS Glue, you need:
+ An account
+ AWS Glue version 5.1 or later
+ Source tables in Apache Iceberg format registered in the AWS Glue Data Catalog
+ AWS Lake Formation permissions configured for source tables and target databases
+ An S3 Tables bucket or S3 general purpose bucket registered with AWS Lake Formation for storing materialized view data
+ An IAM role with permissions to access AWS Glue Data Catalog and Amazon S3

## Configuring Spark to use materialized views
<a name="materialized-views-configuring-spark"></a>

To create and manage materialized views in AWS Glue, configure your Spark session with the required Iceberg extensions and catalog settings. The configuration method varies depending on whether you're using AWS Glue jobs or AWS Glue Studio notebooks.

### Configuring AWS Glue jobs
<a name="materialized-views-configuring-glue-jobs"></a>

When creating or updating an AWS Glue job, add the following configuration parameters as job parameters:

#### For S3 Tables buckets
<a name="materialized-views-s3-tables-buckets"></a>

```
job = glue.create_job(
    Name='materialized-view-job',
    Role='arn:aws:iam::111122223333:role/GlueServiceRole',
    Command={
        'Name': 'glueetl',
        'ScriptLocation': 's3://amzn-s3-demo-bucket/scripts/mv-script.py',
        'PythonVersion': '3'
    },
    DefaultArguments={
        '--enable-glue-datacatalog': 'true',
        '--conf': 'spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions '
        '--conf spark.sql.catalog.glue_catalog=org.apache.iceberg.spark.SparkCatalog '
                  '--conf spark.sql.catalog.glue_catalog.type=glue '
                  '--conf spark.sql.catalog.glue_catalog.warehouse=s3://amzn-s3-demo-bucket/warehouse '
                  '--conf spark.sql.catalog.glue_catalog.glue.region=us-east-1 '
                  '--conf spark.sql.catalog.glue_catalog.glue.id=111122223333 '
                  '--conf spark.sql.catalog.glue_catalog.glue.account-id=111122223333 ',
                  '--conf spark.sql.catalog.glue_catalog.glue.lakeformation-enabled=true ',
                  '--conf spark.sql.catalog.s3t_catalog=org.apache.iceberg.spark.SparkCatalog '
                  '--conf spark.sql.catalog.s3t_catalog.type=glue '
                  '--conf spark.sql.catalog.s3t_catalog.glue.id=111122223333:s3tablescatalog/my-table-bucket ',
                  '--conf spark.sql.catalog.s3t_catalog.glue.account-id=111122223333 ',
                  '--conf spark.sql.catalog.s3t_catalog.glue.lakeformation-enabled=true ',
                  '--conf spark.sql.catalog.s3t_catalog.warehouse=s3://amzn-s3-demo-bucket/mv-warehouse '
                  '--conf spark.sql.catalog.s3t_catalog.glue.region=us-east-1 '
                  '--conf spark.sql.defaultCatalog=s3t_catalog '
                  '--conf spark.sql.optimizer.answerQueriesWithMVs.enabled=true '
                  '--conf spark.sql.materializedViews.metadataCache.enabled=true'
    },
    GlueVersion='5.1'
)
```

#### For S3 general purpose buckets
<a name="materialized-views-s3-general-purpose-buckets"></a>

```
job = glue.create_job(
    Name='materialized-view-job',
    Role='arn:aws:iam::111122223333:role/GlueServiceRole',
    Command={
        'Name': 'glueetl',
        'ScriptLocation': 's3://amzn-s3-demo-bucket/scripts/mv-script.py',
        'PythonVersion': '3'
    },
    DefaultArguments={
        '--enable-glue-datacatalog': 'true',
        '--conf': 'spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions '
                  '--conf spark.sql.catalog.glue_catalog=org.apache.iceberg.spark.SparkCatalog '
                  '--conf spark.sql.catalog.glue_catalog.type=glue '
                  '--conf spark.sql.catalog.glue_catalog.warehouse=s3://amzn-s3-demo-bucket/warehouse '
                  '--conf spark.sql.catalog.glue_catalog.glue.region=us-east-1 '
                  '--conf spark.sql.catalog.glue_catalog.glue.id=111122223333 ',
                  '--conf spark.sql.catalog.glue_catalog.glue.account-id=111122223333 ',
                  '--conf spark.sql.catalog.glue_catalog.glue.lakeformation-enabled=true ',
                  '--conf spark.sql.defaultCatalog=glue_catalog '
                  '--conf spark.sql.optimizer.answerQueriesWithMVs.enabled=true '
                  '--conf spark.sql.materializedViews.metadataCache.enabled=true'
    },
    GlueVersion='5.1'
)
```

### Configuring AWS Glue Studio notebooks
<a name="materialized-views-configuring-glue-studio-notebooks"></a>

In AWS Glue Studio notebooks, configure your Spark session using the %%configure magic command at the beginning of your notebook:

```
%%configure
{
    "conf": {
        "spark.sql.extensions": "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions",
        "spark.sql.catalog.glue_catalog": "org.apache.iceberg.spark.SparkCatalog",
        "spark.sql.catalog.glue_catalog.type": "glue",
        "spark.sql.catalog.glue_catalog.warehouse": "s3://amzn-s3-demo-bucket/warehouse",
        "spark.sql.catalog.glue_catalog.glue.region": "us-east-1",
        "spark.sql.catalog.glue_catalog.glue.id": "111122223333",
        "spark.sql.catalog.glue_catalog.glue.account-id": "111122223333",
        "spark.sql.catalog.glue_catalog.glue.lakeformation-enabled": "true",
        "spark.sql.defaultCatalog": "glue_catalog",
        "spark.sql.optimizer.answerQueriesWithMVs.enabled": "true",
        "spark.sql.materializedViews.metadataCache.enabled": "true"
    }
}
```

### Enabling incremental refresh
<a name="materialized-views-enabling-incremental-refresh"></a>

To enable incremental refresh optimization, add the following configuration properties to your job parameters or notebook configuration:

```
--conf spark.sql.optimizer.incrementalMVRefresh.enabled=true
--conf spark.sql.optimizer.incrementalMVRefresh.deltaThresholdCheckEnabled=false
```

### Configuration parameters
<a name="materialized-views-configuration-parameters"></a>

The following configuration parameters control materialized view behavior:
+ `spark.sql.extensions` – Enables Iceberg Spark session extensions required for materialized view support.
+ `spark.sql.optimizer.answerQueriesWithMVs.enabled` – Enables automatic query rewrite to use materialized views. Set to true to activate this optimization.
+ `spark.sql.materializedViews.metadataCache.enabled` – Enables caching of materialized view metadata for query optimization. Set to true to improve query rewrite performance.
+ `spark.sql.optimizer.incrementalMVRefresh.enabled` – Enables incremental refresh optimization. Set to true to process only changed data during refresh operations.
+ `spark.sql.optimizer.answerQueriesWithMVs.decimalAggregateCheckEnabled` – Controls validation of decimal aggregate operations in query rewrite. Set to false to disable certain decimal overflow checks.

## Creating materialized views
<a name="materialized-views-creating"></a>

You create materialized views using the CREATE MATERIALIZED VIEW SQL statement in AWS Glue jobs or notebooks. The view definition specifies the transformation logic as a SQL query that references one or more source tables.

### Creating a basic materialized view in AWS Glue jobs
<a name="materialized-views-creating-basic-glue-jobs"></a>

The following example demonstrates creating a materialized view in a AWS Glue job script, use fully qualified table names with three part naming convention in view definition:

```
from awsglue.context import GlueContext
from pyspark.context import SparkContext

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session

# Create materialized view
spark.sql("""
    CREATE MATERIALIZED VIEW customer_orders
    AS 
    SELECT 
        customer_name, 
        COUNT(*) as order_count, 
        SUM(amount) as total_amount 
    FROM glue_catalog.sales.orders
    GROUP BY customer_name
""")
```

### Creating a materialized view with automatic refresh
<a name="materialized-views-creating-automatic-refresh"></a>

To configure automatic refresh, specify a refresh schedule when creating the view, using fully qualified table names with three part naming convention in view definition:

```
spark.sql("""
    CREATE MATERIALIZED VIEW customer_orders
    SCHEDULE REFRESH EVERY 1 HOUR
    AS 
    SELECT 
        customer_name, 
        COUNT(*) as order_count, 
        SUM(amount) as total_amount 
    FROM glue_catalog.sales.orders
    GROUP BY customer_name
""")
```

### Creating a materialized view with cross-catalog references
<a name="materialized-views-creating-cross-catalog"></a>

When your source tables are in a different catalog than your materialized view, use fully qualified table names with three-part naming convention in both view name and view definition:

```
spark.sql("""
    CREATE MATERIALIZED VIEW s3t_catalog.analytics.customer_summary
    AS 
    SELECT 
        customer_name, 
        COUNT(*) as order_count, 
        SUM(amount) as total_amount 
    FROM glue_catalog.sales.orders
    GROUP BY customer_name
""")
```

### Creating materialized views in AWS Glue Studio notebooks
<a name="materialized-views-creating-glue-studio-notebooks"></a>

In AWS Glue Studio notebooks, you can use the %%sql magic command to create materialized views, using fully qualified table names with three part naming convention in view definition:

```
%%sql
CREATE MATERIALIZED VIEW customer_orders
AS 
SELECT 
    customer_name, 
    COUNT(*) as order_count, 
    SUM(amount) as total_amount 
FROM glue_catalog.sales.orders
GROUP BY customer_name
```

## Querying materialized views
<a name="materialized-views-querying"></a>

After creating a materialized view, you can query it like any other table using standard SQL SELECT statements in your AWS Glue jobs or notebooks.

### Querying in AWS Glue jobs
<a name="materialized-views-querying-glue-jobs"></a>

```
from awsglue.context import GlueContext
from pyspark.context import SparkContext

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session

# Query materialized view
result = spark.sql("SELECT * FROM customer_orders")
result.show()
```

### Querying in AWS Glue Studio notebooks
<a name="materialized-views-querying-glue-studio-notebooks"></a>

```
%%sql
SELECT * FROM customer_orders
```

### Automatic query rewrite
<a name="materialized-views-automatic-query-rewrite"></a>

When automatic query rewrite is enabled, the Spark optimizer analyzes your queries and automatically uses materialized views when they can improve performance. For example, if you execute the following query:

```
result = spark.sql("""
    SELECT 
        customer_name, 
        COUNT(*) as order_count, 
        SUM(amount) as total_amount 
    FROM orders
    GROUP BY customer_name
""")
```

The Spark optimizer automatically rewrites this query to use the customer\$1orders materialized view instead of processing the base orders table, provided the materialized view is current.

### Verifying automatic query rewrite
<a name="materialized-views-verifying-automatic-query-rewrite"></a>

To verify whether a query uses automatic query rewrite, use the EXPLAIN EXTENDED command:

```
spark.sql("""
    EXPLAIN EXTENDED
    SELECT customer_name, COUNT(*) as order_count, SUM(amount) as total_amount 
    FROM orders
    GROUP BY customer_name
""").show(truncate=False)
```

In the execution plan, look for the materialized view name in the BatchScan operation. If the plan shows BatchScan glue\$1catalog.analytics.customer\$1orders instead of BatchScan glue\$1catalog.sales.orders, the query has been automatically rewritten to use the materialized view.

Note that automatic query rewrite requires time for the Spark metadata cache to populate after creating a materialized view. This process typically completes within 30 seconds.

## Refreshing materialized views
<a name="materialized-views-refreshing"></a>

You can refresh materialized views using two methods: full refresh or incremental refresh. Full refresh recomputes the entire materialized view from all base table data, while incremental refresh processes only the data that has changed since the last refresh.

### Manual full refresh in AWS Glue jobs
<a name="materialized-views-manual-full-refresh-glue-jobs"></a>

To perform a full refresh of a materialized view:

```
spark.sql("REFRESH MATERIALIZED VIEW customer_orders FULL")

# Verify updated results
result = spark.sql("SELECT * FROM customer_orders")
result.show()
```

### Manual incremental refresh in AWS Glue jobs
<a name="materialized-views-manual-incremental-refresh-glue-jobs"></a>

To perform an incremental refresh, ensure incremental refresh is enabled in your Spark session configuration, then execute:

```
spark.sql("REFRESH MATERIALIZED VIEW customer_orders")

# Verify updated results
result = spark.sql("SELECT * FROM customer_orders")
result.show()
```

The AWS Glue Data Catalog automatically determines whether incremental refresh is applicable based on the view definition and the amount of changed data. If incremental refresh is not possible, the operation falls back to full refresh.

### Refreshing in AWS Glue Studio notebooks
<a name="materialized-views-refreshing-glue-studio-notebooks"></a>

In notebooks, use the %%sql magic command:

```
%%sql
REFRESH MATERIALIZED VIEW customer_orders FULL
```

### Verifying incremental refresh execution
<a name="materialized-views-verifying-incremental-refresh"></a>

To confirm that incremental refresh executed successfully, enable debug logging in your AWS Glue job:

```
from awsglue.context import GlueContext
from pyspark.context import SparkContext
import logging

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session

# Enable debug logging
logger = logging.getLogger('org.apache.spark.sql')
logger.setLevel(logging.DEBUG)

# Execute refresh
spark.sql("REFRESH MATERIALIZED VIEW customer_orders")
```

Search for the following message in the AWS Glue job logs:

```
DEBUG RefreshMaterializedViewExec: Executed Incremental Refresh
```

## Managing materialized views
<a name="materialized-views-managing"></a>

AWS Glue provides SQL commands for managing the lifecycle of materialized views in your jobs and notebooks.

### Describing a materialized view
<a name="materialized-views-describing"></a>

To view metadata about a materialized view, including its definition, refresh status, and last refresh timestamp:

```
spark.sql("DESCRIBE EXTENDED customer_orders").show(truncate=False)
```

### Altering a materialized view
<a name="materialized-views-altering"></a>

To modify the refresh schedule of an existing materialized view:

```
spark.sql("""
    ALTER MATERIALIZED VIEW customer_orders 
    ADD SCHEDULE REFRESH EVERY 2 HOURS
""")
```

To remove automatic refresh:

```
spark.sql("""
    ALTER MATERIALIZED VIEW customer_orders 
    DROP SCHEDULE
""")
```

### Dropping a materialized view
<a name="materialized-views-dropping"></a>

To delete a materialized view:

```
spark.sql("DROP MATERIALIZED VIEW customer_orders")
```

This command removes the materialized view definition from the AWS Glue Data Catalog and deletes the underlying Iceberg table data from your S3 bucket.

### Listing materialized views
<a name="materialized-views-listing"></a>

To list all materialized views in a database:

```
spark.sql("SHOW VIEWS FROM analytics").show()
```

## Permissions for materialized views
<a name="materialized-views-permissions"></a>

To create and manage materialized views, you must configure AWS Lake Formation permissions. The IAM role creating the materialized view (the definer role) requires specific permissions on source tables and target databases.

### Required permissions for the definer role
<a name="materialized-views-required-permissions-definer-role"></a>

The definer role must have the following Lake Formation permissions:
+ On source tables – SELECT or ALL permissions without row, column, or cell filters
+ On the target database – CREATE\$1TABLE permission
+ On the AWS Glue Data Catalog – GetTable and CreateTable API permissions

When you create a materialized view, the definer role's ARN is stored in the view definition. The AWS Glue Data Catalog assumes this role when executing automatic refresh operations. If the definer role loses access to source tables, refresh operations will fail until permissions are restored.

### IAM permissions for AWS Glue jobs
<a name="materialized-views-iam-permissions-glue-jobs"></a>

Your AWS Glue job's IAM role requires the following permissions:

```
{
    "Version": "2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "glue:GetCatalog",
                "glue:GetCatalogs",
                "glue:GetTable",
                "glue:GetTables",
                "glue:CreateTable",
                "glue:UpdateTable",
                "glue:DeleteTable",
                "glue:GetDatabase",
                "glue:GetDatabases",
                "cloudwatch:PutMetricData"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:PutObject",
                "s3:DeleteObject"
            ],
            "Resource": [
                "arn:aws:s3:::amzn-s3-demo-bucket/*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::amzn-s3-demo-bucket"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "logs:CreateLogGroup",
                "logs:CreateLogStream",
                "logs:PutLogEvents"
            ],
            "Resource": [
                "arn:aws:logs:*:*:*:/aws-glue/*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "lakeformation:GetDataAccess"
            ],
            "Resource": "*"
        }
    ]
}
```

The role you use for Materialized View auto-refresh must have the iam:PassRole permission on the role.

```
{
  "Version":"2012-10-17",		 	 	 
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "iam:PassRole"
      ],
      "Resource": [
        "arn:aws:iam::111122223333:role/materialized-view-role-name"
      ]
    }
  ]
}
```

To let Glue automatically refresh the materialized view for you, the role must also have the following trust policy that enables the service to assume the role.

```
{
  "Version":"2012-10-17",		 	 	 
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "iam:PassRole"
      ],
      "Resource": [
        "arn:aws:iam::111122223333:role/materialized-view-role-name"
      ]
    }
  ]
}
```

If the Materialized View is stored in S3 Tables Buckets, you also need to add the following permission to the role.

```
{
  "Version": "2012-10-17",		 	 	 
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3tables:PutTableMaintenanceConfiguration"
      ],
      "Resource": "arn:aws:s3tables:*:123456789012:*"
    }
  ]
}
```

### Granting access to materialized views
<a name="materialized-views-granting-access"></a>

To grant other users access to query a materialized view, use AWS Lake Formation to grant SELECT permission on the materialized view table. Users can query the materialized view without requiring direct access to the underlying source tables.

For detailed information about configuring Lake Formation permissions, see Granting and revoking permissions on Data Catalog resources in the AWS Lake Formation Developer Guide.

## Monitoring materialized view operations
<a name="materialized-views-monitoring"></a>

The AWS Glue Data Catalog publishes metrics and logs for materialized view refresh operations to Amazon CloudWatch. You can monitor refresh status, duration, and data volume processed through CloudWatch metrics.

### Viewing job logs
<a name="materialized-views-viewing-job-logs"></a>

To view logs for AWS Glue jobs that create or refresh materialized views:

1. Open the AWS Glue console.

1. Choose Jobs from the navigation pane.

1. Select your job and choose Runs.

1. Select a specific run and choose Logs to view CloudWatch logs.

### Setting up alarms
<a name="materialized-views-setting-up-alarms"></a>

To receive notifications when refresh operations fail or exceed expected duration, create CloudWatch alarms on materialized view metrics. You can also configure Amazon EventBridge rules to trigger automated responses to refresh events.

## Example: Complete workflow
<a name="materialized-views-complete-workflow"></a>

The following example demonstrates a complete workflow for creating and using a materialized view in AWS Glue.

### Example AWS Glue job script
<a name="materialized-views-example-glue-job-script"></a>

```
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# Create database and base table
spark.sql("CREATE DATABASE IF NOT EXISTS sales")
spark.sql("USE sales")

spark.sql("""
    CREATE TABLE IF NOT EXISTS orders (
        id INT,
        customer_name STRING,
        amount DECIMAL(10,2),
        order_date DATE
    )
""")

# Insert sample data
spark.sql("""
    INSERT INTO orders VALUES 
        (1, 'John Doe', 150.00, DATE('2024-01-15')),
        (2, 'Jane Smith', 200.50, DATE('2024-01-16')),
        (3, 'Bob Johnson', 75.25, DATE('2024-01-17'))
""")

# Create materialized view
spark.sql("""
    CREATE MATERIALIZED VIEW customer_summary
    AS 
    SELECT 
        customer_name, 
        COUNT(*) as order_count, 
        SUM(amount) as total_amount 
    FROM glue_catalog.sales.orders
    GROUP BY customer_name
""")

# Query the materialized view
print("Initial materialized view data:")
spark.sql("SELECT * FROM customer_summary").show()

# Insert additional data
spark.sql("""
    INSERT INTO orders VALUES 
        (4, 'Jane Smith', 350.00, DATE('2024-01-18')),
        (5, 'Bob Johnson', 100.25, DATE('2024-01-19'))
""")

# Refresh the materialized view
spark.sql("REFRESH MATERIALIZED VIEW customer_summary FULL")

# Query updated results
print("Updated materialized view data:")
spark.sql("SELECT * FROM customer_summary").show()

job.commit()
```

### Example AWS Glue Studio notebook
<a name="materialized-views-example-glue-studio-notebook"></a>

```
%%configure
{
    "conf": {
        "spark.sql.extensions": "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions",
        "spark.sql.catalog.glue_catalog": "org.apache.iceberg.spark.SparkCatalog",
        "spark.sql.catalog.glue_catalog.type": "glue",
        "spark.sql.catalog.glue_catalog.warehouse": "s3://amzn-s3-demo-bucket/warehouse",
        "spark.sql.catalog.glue_catalog.glue.region": "us-east-1",
        "spark.sql.catalog.glue_catalog.glue.id": "111122223333",
        "spark.sql.catalog.glue_catalog.glue.account-id": "111122223333",
        "spark.sql.catalog.glue_catalog.glue.lakeformation-enabled": "true",
        "spark.sql.defaultCatalog": "glue_catalog",
        "spark.sql.optimizer.answerQueriesWithMVs.enabled": "true",
        "spark.sql.materializedViews.metadataCache.enabled": "true"
    }
}
```

```
%%sql
CREATE DATABASE IF NOT EXISTS sales
```

```
%%sql
USE sales
```

```
%%sql
CREATE TABLE IF NOT EXISTS orders (
    id INT,
    customer_name STRING,
    amount DECIMAL(10,2),
    order_date DATE
)
```

```
%%sql
INSERT INTO orders VALUES 
    (1, 'John Doe', 150.00, DATE('2024-01-15')),
    (2, 'Jane Smith', 200.50, DATE('2024-01-16')),
    (3, 'Bob Johnson', 75.25, DATE('2024-01-17'))
```

```
%%sql
CREATE MATERIALIZED VIEW customer_summary
AS 
SELECT 
    customer_name, 
    COUNT(*) as order_count, 
    SUM(amount) as total_amount 
FROM glue_catalog.sales.orders
GROUP BY customer_name
```

```
%%sql
SELECT * FROM customer_summary
```

```
%%sql
INSERT INTO orders VALUES 
    (4, 'Jane Smith', 350.00, DATE('2024-01-18')),
    (5, 'Bob Johnson', 100.25, DATE('2024-01-19'))
```

```
%%sql
REFRESH MATERIALIZED VIEW customer_summary FULL
```

```
%%sql
SELECT * FROM customer_summary
```

## Considerations and limitations
<a name="materialized-views-considerations-limitations"></a>

Consider the following when using materialized views with AWS Glue:
+ Materialized views require AWS Glue version 5.1 or later.
+ Source tables must be Apache Iceberg tables registered in the AWS Glue Data Catalog. Apache Hive, Apache Hudi, and Linux Foundation Delta Lake tables are not supported at launch.
+ Source tables must reside in the same Region and account as the materialized view.
+ All source tables must be governed by AWS Lake Formation. IAM-only permissions and hybrid access are not supported.
+ Materialized views cannot reference AWS Glue Data Catalog views, multi-dialect views, or other materialized views as source tables.
+ The view definer role must have full read access (SELECT or ALL permission) on all source tables without row, column, or cell filters applied.
+ Materialized views are eventually consistent with source tables. During the refresh window, queries may return stale data. Execute manual refresh for immediate consistency.
+ The minimum automatic refresh interval is one hour.
+ Incremental refresh supports a restricted subset of SQL operations. The view definition must be a single SELECT-FROM-WHERE-GROUP BY-HAVING block and cannot contain set operations, subqueries, the DISTINCT keyword in SELECT or aggregate functions, window functions, or joins other than INNER JOIN.
+ Incremental refresh does not support user-defined functions or certain built-in functions. Only a subset of Spark SQL built-in functions are supported.
+ Query automatic rewrite only considers materialized views whose definitions belong to a restricted SQL subset similar to incremental refresh restrictions.
+ Identifiers containing special characters other than alphanumeric characters and underscores are not supported in CREATE MATERIALIZED VIEW queries. This applies to all identifier types including catalog/namespace/table names, column and struct field names, CTEs, and aliases.
+ Materialized view columns starting with the \$1\$1ivm prefix are reserved for system use. Amazon reserves the right to modify or remove these columns in future releases.
+ The SORT BY, LIMIT, OFFSET, CLUSTER BY, and ORDER BY clauses are not supported in materialized view definitions.
+ Cross-Region and cross-account source tables are not supported.
+ Tables referenced in the view query must use three-part naming convention (e.g., glue\$1catalog.my\$1db.my\$1table) because automatic refresh does not use default catalog and database settings.
+ Full refresh operations override the entire table and make previous snapshots unavailable.
+ Non-deterministic functions such as rand() or current\$1timestamp() are not supported in materialized view definitions.

# AWS Glue worker types
<a name="worker-types"></a>

## Overview
<a name="worker-types-overview"></a>

AWS Glue provides multiple worker types to accommodate different workload requirements, from small streaming jobs to large-scale, memory-intensive data processing tasks. This section provides comprehensive information about all available worker types, their specifications, and usage recommendations.

### Worker type categories
<a name="worker-type-categories"></a>

AWS Glue offers two main categories of worker types:
+ **G Worker Types**: General-purpose compute workers optimized for standard ETL workloads
+ **R Worker Types**: Memory-optimized workers designed for memory-intensive Spark applications

### Data Processing Units (DPUs)
<a name="data-processing-units"></a>

The resources available on AWS Glue workers are measured in DPUs. A DPU is a relative measure of processing power that consists of 4 vCPUs of compute capacity and 16 GB of memory.

**Memory-Optimized DPUs (M-DPUs)**: R type workers use M-DPUs, which provide double the memory allocation for a given size compared to standard DPUs. This means that while a standard DPU provides 16 GB of memory, an M-DPU in R type workers provides 32GB of memory optimized for memory-intensive Spark applications.

## Available worker types
<a name="available-worker-types"></a>

### G.1X
<a name="g1x-standard-worker"></a>
+ **DPU**: 1 DPU (4 vCPUs, 16 GB memory)
+ **Storage**: 94GB disk (approximately 44GB free)
+ **Use Case**: Data transforms, joins, and queries - scalable and cost-effective for most jobs

### G.2X
<a name="g2x-standard-worker"></a>
+ **DPU**: 2 DPU (8 vCPUs, 32 GB memory)
+ **Storage**: 138GB disk (approximately 78GB free)
+ **Use Case**: Data transforms, joins, and queries - scalable and cost-effective for most jobs

### G.4X
<a name="g4x-large-worker"></a>
+ **DPU**: 4 DPU (16 vCPUs, 64 GB memory)
+ **Storage**: 256GB disk (approximately 230GB free)
+ **Use Case**: Demanding transforms, aggregations, joins, and queries

### G.8X
<a name="g8x-extra-large-worker"></a>
+ **DPU**: 8 DPU (32 vCPUs, 128 GB memory)
+ **Storage**: 512GB disk (approximately 485GB free)
+ **Use Case**: Demanding transforms, aggregations, joins, and queries

### G.12X
<a name="g12x-very-large-worker"></a>
+ **DPU**: 12 DPU (48 vCPUs, 192 GB memory)
+ **Storage**: 768GB disk (approximately 741GB free)
+ **Use Case**: Very large and resource-intensive workloads requiring significant compute capacity

### G.16X
<a name="g16x-maximum-worker"></a>
+ **DPU**: 16 DPU (64 vCPUs, 256 GB memory)
+ **Storage**: 1024GB disk (approximately 996GB free)
+ **Use Case**: Largest and most resource-intensive workloads requiring maximum compute capacity

### R.1X - Memory-Optimized\$1
<a name="r1x-memory-optimized-small"></a>
+ **DPU**: 1 M-DPU (4 vCPUs, 32 GB memory)
+ **Use Case**: Memory-intensive workloads with frequent out-of-memory errors or high memory-to-CPU ratio requirements

### R.2X - Memory-Optimized\$1
<a name="r2x-memory-optimized-medium"></a>
+ **DPU**: 2 M-DPU (8 vCPUs, 64 GB memory)
+ **Use Case**: Memory-intensive workloads with frequent out-of-memory errors or high memory-to-CPU ratio requirements

### R.4X - Memory-Optimized\$1
<a name="r4x-memory-optimized-large"></a>
+ **DPU**: 4 M-DPU (16 vCPUs, 128 GB memory)
+ **Use Case**: Large memory-intensive workloads with frequent out-of-memory errors or high memory-to-CPU ratio requirements

### R.8X - Memory-Optimized\$1
<a name="r8x-memory-optimized-extra-large"></a>
+ **DPU**: 8 M-DPU (32 vCPUs, 256 GB memory)
+ **Use Case**: Very large memory-intensive workloads with frequent out-of-memory errors or high memory-to-CPU ratio requirements

**\$1** You may encounter higher startup latency with these workers. To resolve the issue, try the following:
+ Wait a few minutes and then submit your job again.
+ Submit a new job with a reduced number of workers.
+ Submit a new job using a different worker type or size.

## Worker type specifications table
<a name="worker-type-specifications"></a>


**Worker Type Specifications**  

| Worker Type | DPU per Node | vCPU | Memory (GB) | Disk (GB) | Approximate Free Disk Space (GB) | Spark Executors per Node | 
| --- | --- | --- | --- | --- | --- | --- | 
| G.1X | 1 | 4 | 16 | 94 | 44 | 1 | 
| G.2X | 2 | 8 | 32 | 138 | 78 | 1 | 
| G.4X | 4 | 16 | 64 | 256 | 230 | 1 | 
| G.8X | 8 | 32 | 128 | 512 | 485 | 1 | 
| G.12X | 12 | 48 | 192 | 768 | 741 | 1 | 
| G.16X | 16 | 64 | 256 | 1024 | 996 | 1 | 
| R.1X | 1 | 4 | 32 | 94 | 44 | 1 | 
| R.2X | 2 | 8 | 64 | 138 | 78 | 1 | 
| R.4X | 4 | 16 | 128 | 256 | 230 | 1 | 
| R.8X | 8 | 32 | 256 | 512 | 485 | 1 | 

*Note*: R worker types have memory-optimized configurations with specifications optimized for memory-intensive workloads.

## Important considerations
<a name="important-considerations"></a>

### Startup latency
<a name="startup-latency"></a>

**Important**  
G.12X and G.16X worker types, as well as all R worker types (R.1X through R.8X), may encounter higher startup latency. To resolve the issue, try the following:  
Wait a few minutes and then submit your job again.
Submit a new job with a reduced number of workers.
Submit a new job using a different worker type and size.

## Choosing the right worker type
<a name="choosing-right-worker-type"></a>

### For standard ETL workloads
<a name="standard-etl-workloads"></a>
+ **G.1X or G.2X**: Most cost-effective for typical data transforms, joins, and queries
+ **G.4X or G.8X**: For more demanding workloads with larger datasets

### For large-scale workloads
<a name="large-scale-workloads"></a>
+ **G.12X**: Very large datasets requiring significant compute resources
+ **G.16X**: Maximum compute capacity for the most demanding workloads

### For memory-intensive workloads
<a name="memory-intensive-workloads"></a>
+ **R.1X or R.2X**: Small to medium memory-intensive jobs
+ **R.4X or R.8X**: Large memory-intensive workloads with frequent OOM errors

## Cost Optimization Considerations
<a name="cost-optimization-considerations"></a>
+ **Standard G workers**: Provide a balance of compute, memory and networking resources, and can be used for a variety of diverse workloads at lower cost
+ **R workers**: Specialized for memory-intensive tasks with fast performance for workloads that process large data sets in memory

## Best practices
<a name="best-practices"></a>

### Worker selection guidelines
<a name="worker-selection-guidelines"></a>

1. **Start with standard workers** (G.1X, G.2X) for most workloads

1. **Use R workers** when experiencing frequent out-of-memory errors or workloads with memory-intensive operations like caching, shuffling, and aggregating

1. **Consider G.12X/G.16X** for compute-intensive workloads requiring maximum resources

1. **Account for capacity constraints** when using new worker types in time-sensitive workflows

### Performance optimization
<a name="performance-optimization"></a>
+ Monitor CloudWatch metrics to understand resource utilization
+ Use appropriate worker counts based on data size and complexity
+ Consider data partitioning strategies to optimize worker efficiency

# Streaming ETL jobs in AWS Glue
<a name="add-job-streaming"></a>

You can create streaming extract, transform, and load (ETL) jobs that run continuously, consume data from streaming sources like Amazon Kinesis Data Streams, Apache Kafka, and Amazon Managed Streaming for Apache Kafka (Amazon MSK). The jobs cleanse and transform the data, and then load the results into Amazon S3 data lakes or JDBC data stores.

Additionally, you can produce data for Amazon Kinesis Data Streams streams. This feature is only available when writing AWS Glue scripts. For more information, see [Kinesis connections](aws-glue-programming-etl-connect-kinesis-home.md). 

By default, AWS Glue processes and writes out data in 100-second windows. This allows data to be processed efficiently and permits aggregations to be performed on data arriving later than expected. You can modify this window size to increase timeliness or aggregation accuracy. AWS Glue streaming jobs use checkpoints rather than job bookmarks to track the data that has been read.

**Note**  
AWS Glue bills hourly for streaming ETL jobs while they are running.

This video discusses streaming ETL cost challenges, and cost-saving features in AWS Glue.

[![AWS Videos](http://img.youtube.com/vi/https://www.youtube.com/embed/6ggTFOtfUxU/0.jpg)](http://www.youtube.com/watch?v=https://www.youtube.com/embed/6ggTFOtfUxU)


Creating a streaming ETL job involves the following steps:

1. For an Apache Kafka streaming source, create an AWS Glue connection to the Kafka source or the Amazon MSK cluster.

1. Manually create a Data Catalog table for the streaming source.

1. Create an ETL job for the streaming data source. Define streaming-specific job properties, and supply your own script or optionally modify the generated script.

For more information, see [Streaming ETL in AWS Glue](components-overview.md#streaming-etl-intro).

When creating a streaming ETL job for Amazon Kinesis Data Streams, you don't have to create an AWS Glue connection. However, if there is a connection attached to the AWS Glue streaming ETL job that has Kinesis Data Streams as a source, then a virtual private cloud (VPC) endpoint to Kinesis is required. For more information, see [Creating an interface endpoint](https://docs.aws.amazon.com/vpc/latest/userguide/vpce-interface.html#create-interface-endpoint) in the *Amazon VPC User Guide*. When specifying a Amazon Kinesis Data Streams stream in another account, you must setup the roles and policies to allow cross-account access. For more information, see [ Example: Read From a Kinesis Stream in a Different Account](https://docs.aws.amazon.com/kinesisanalytics/latest/java/examples-cross.html).

AWS Glue streaming ETL jobs can auto-detect compressed data, transparently decompress the streaming data, perform the usual transformations on the input source, and load to the output store. 

AWS Glue supports auto-decompression for the following compression types given the input format:


| Compression type | Avro file | Avro datum | JSON | CSV | Grok | 
| --- | --- | --- | --- | --- | --- | 
| BZIP2 | Yes | Yes | Yes | Yes | Yes | 
| GZIP | No | Yes | Yes | Yes | Yes | 
| SNAPPY | Yes (raw Snappy) | Yes (framed Snappy) | Yes (framed Snappy) | Yes (framed Snappy) | Yes (framed Snappy) | 
| XZ | Yes | Yes | Yes | Yes | Yes | 
| ZSTD | Yes | No | No | No | No | 
| DEFLATE | Yes | Yes | Yes | Yes | Yes | 

**Topics**
+ [Creating an AWS Glue connection for an Apache Kafka data stream](#create-conn-streaming)
+ [Creating a Data Catalog table for a streaming source](#create-table-streaming)
+ [Notes and restrictions for Avro streaming sources](#streaming-avro-notes)
+ [Applying grok patterns to streaming sources](#create-table-streaming-grok)
+ [Defining job properties for a streaming ETL job](#create-job-streaming-properties)
+ [Streaming ETL notes and restrictions](#create-job-streaming-restrictions)

## Creating an AWS Glue connection for an Apache Kafka data stream
<a name="create-conn-streaming"></a>

To read from an Apache Kafka stream, you must create an AWS Glue connection. 

**To create an AWS Glue connection for a Kafka source (Console)**

1. Open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/).

1. In the navigation pane, under **Data catalog**, choose **Connections**.

1. Choose **Add connection**, and on the **Set up your connection’s properties** page, enter a connection name.
**Note**  
For more information about specifying connection properties, see [AWS Glue connection properties.](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-catalog-connections).

1. For **Connection type**, choose **Kafka**.

1. For **Kafka bootstrap servers URLs**, enter the host and port number for the bootstrap brokers for your Amazon MSK cluster or Apache Kafka cluster. Use only Transport Layer Security (TLS) endpoints for establishing the initial connection to the Kafka cluster. Plaintext endpoints are not supported.

   The following is an example list of hostname and port number pairs for an Amazon MSK cluster.

   ```
   myserver1.kafka.us-east-1.amazonaws.com:9094,myserver2.kafka.us-east-1.amazonaws.com:9094,
   myserver3.kafka.us-east-1.amazonaws.com:9094
   ```

   For more information about getting the bootstrap broker information, see [Getting the Bootstrap Brokers for an Amazon MSK Cluster](https://docs.aws.amazon.com/msk/latest/developerguide/msk-get-bootstrap-brokers.html) in the *Amazon Managed Streaming for Apache Kafka Developer Guide*. 

1. If you want a secure connection to the Kafka data source, select **Require SSL connection**, and for **Kafka private CA certificate location**, enter a valid Amazon S3 path to a custom SSL certificate.

   For an SSL connection to self-managed Kafka, the custom certificate is mandatory. It's optional for Amazon MSK.

   For more information about specifying a custom certificate for Kafka, see [AWS Glue SSL connection properties](connection-properties.md#connection-properties-SSL).

1. Use AWS Glue Studio or the AWS CLI to specify a Kafka client authentication method. To access AWS Glue Studio select **AWS Glue** from the **ETL** menu in the left navigation pane.

   For more information about Kafka client authentication methods, see [AWS Glue Kafka connection properties for client authentication](#connection-properties-kafka-client-auth).

1. Optionally enter a description, and then choose **Next**.

1. For an Amazon MSK cluster, specify its virtual private cloud (VPC), subnet, and security group. The VPC information is optional for self-managed Kafka.

1. Choose **Next** to review all connection properties, and then choose **Finish**.

For more information about AWS Glue connections, see [Connecting to data](glue-connections.md).

### AWS Glue Kafka connection properties for client authentication
<a name="connection-properties-kafka-client-auth"></a>

**SASL/GSSAPI (Kerberos) authentication**  
Choosing this authentication method will allow you to specify Kerberos properties.

**Kerberos Keytab**  
Choose the location of the keytab file. A keytab stores long-term keys for one or more principals. For more information, see [MIT Kerberos Documentation: Keytab ](https://web.mit.edu/kerberos/krb5-latest/doc/basic/keytab_def.html). 

**Kerberos krb5.conf file**  
Choose the krb5.conf file. This contains the default realm (a logical network, similar to a domain, that defines a group of systems under the same KDC) and the location of the KDC server. For more information, see [MIT Kerberos Documentation: krb5.conf ](https://web.mit.edu/kerberos/krb5-1.12/doc/admin/conf_files/krb5_conf.html). 

**Kerberos principal and Kerberos service name**  
Enter the Kerberos principal and service name. For more information, see [MIT Kerberos Documentation: Kerberos principal ](https://web.mit.edu/kerberos/krb5-1.5/krb5-1.5.4/doc/krb5-user/What-is-a-Kerberos-Principal_003f.html). 

**SASL/SCRAM-SHA-512 authentication**  
 Choosing this authentication method will allow you to specify authentication credentials. 

**AWS Secrets Manager**  
Search for your token in the Search box by typing the name or ARN. 

**Provider username and password directly**  
Search for your token in the Search box by typing the name or ARN. 

**SSL client authentication**  
Choosing this authentication method allows you to select the location of the Kafka client keystore by browsing Amazon S3. Optionally, you can enter the Kafka client keystore password and Kafka client key password. 

**IAM authentication**  
This authentication method does not require any additional specifications and is only applicable when the Streaming source is MSK Kafka. 

**SASL/PLAIN authentication**  
Choosing this authentication method allows you to specify authentication credentials. 

## Creating a Data Catalog table for a streaming source
<a name="create-table-streaming"></a>

A Data Catalog table that specifies source data stream properties, including the data schema can be manually created for a streaming source. This table is used as the data source for the streaming ETL job. 

If you don't know the schema of the data in the source data stream, you can create the table without a schema. Then when you create the streaming ETL job, you can turn on the AWS Glue schema detection function. AWS Glue determines the schema from the streaming data.

Use the [AWS Glue console](https://console.aws.amazon.com/glue/), the AWS Command Line Interface (AWS CLI), or the AWS Glue API to create the table. For information about creating a table manually with the AWS Glue console, see [Creating tables](tables-described.md).

**Note**  
You can't use the AWS Lake Formation console to create the table; you must use the AWS Glue console.

Also consider the following information for streaming sources in Avro format or for log data that you can apply Grok patterns to. 
+ [Notes and restrictions for Avro streaming sources](#streaming-avro-notes)
+ [Applying grok patterns to streaming sources](#create-table-streaming-grok)

**Topics**
+ [Kinesis data source](#kinesis-source)
+ [Kafka data source](#kafka-source)
+ [AWS Glue Schema Registry table source](#schema-registry-table)

### Kinesis data source
<a name="kinesis-source"></a>

When creating the table, set the following streaming ETL properties (console).

**Type of Source**  
**Kinesis**

**For a Kinesis source in the same account:**    
**Region**  
The AWS Region where the Amazon Kinesis Data Streams service resides. The Region and Kinesis stream name are together translated to a Stream ARN.  
Example: https://kinesis.us-east-1.amazonaws.com  
**Kinesis stream name**  
Stream name as described in [Creating a Stream](https://docs.aws.amazon.com/streams/latest/dev/kinesis-using-sdk-java-create-stream.html) in the* Amazon Kinesis Data Streams Developer Guide*.

**For a Kinesis source in another account, refer to [this example](https://docs.aws.amazon.com/kinesisanalytics/latest/java/examples-cross.html) to set up the roles and policies to allow cross-account access. Configure these settings:**    
**Stream ARN**  
The ARN of the Kinesis data stream that the consumer is registered with. For more information, see [Amazon Resource Names (ARNs) and AWS Service Namespaces](https://docs.aws.amazon.com/general/latest/gr/aws-arns-and-namespaces.html) in the *AWS General Reference*.  
**Assumed Role ARN**  
The Amazon Resource Name (ARN) of the role to assume.  
**Session name (optional)**  
An identifier for the assumed role session.  
Use the role session name to uniquely identify a session when the same role is assumed by different principals or for different reasons. In cross-account scenarios, the role session name is visible to, and can be logged by the account that owns the role. The role session name is also used in the ARN of the assumed role principal. This means that subsequent cross-account API requests that use the temporary security credentials will expose the role session name to the external account in their AWS CloudTrail logs.

**To set streaming ETL properties for Amazon Kinesis Data Streams (AWS Glue API or AWS CLI)**
+ To set up streaming ETL properties for a Kinesis source in the same account, specify the `streamName` and `endpointUrl` parameters in the `StorageDescriptor` structure of the `CreateTable` API operation or the `create_table` CLI command.

  ```
  "StorageDescriptor": {
  	"Parameters": {
  		"typeOfData": "kinesis",
  		"streamName": "sample-stream",
  		"endpointUrl": "https://kinesis.us-east-1.amazonaws.com"
  	}
  	...
  }
  ```

  Or, specify the `streamARN`.  
**Example**  

  ```
  "StorageDescriptor": {
  	"Parameters": {
  		"typeOfData": "kinesis",
  		"streamARN": "arn:aws:kinesis:us-east-1:123456789:stream/sample-stream"
  	}
  	...
  }
  ```
+ To set up streaming ETL properties for a Kinesis source in another account, specify the `streamARN`, `awsSTSRoleARN` and `awsSTSSessionName` (optional) parameters in the `StorageDescriptor` structure in the `CreateTable` API operation or the `create_table` CLI command.

  ```
  "StorageDescriptor": {
  	"Parameters": {
  		"typeOfData": "kinesis",
  		"streamARN": "arn:aws:kinesis:us-east-1:123456789:stream/sample-stream",
  		"awsSTSRoleARN": "arn:aws:iam::123456789:role/sample-assume-role-arn",
  		"awsSTSSessionName": "optional-session"
  	}
  	...
  }
  ```

### Kafka data source
<a name="kafka-source"></a>

When creating the table, set the following streaming ETL properties (console).

**Type of Source**  
 **Kafka**

**For a Kafka source:**    
**Topic name**  
Topic name as specified in Kafka.  
**Connection**  
An AWS Glue connection that references a Kafka source, as described in [Creating an AWS Glue connection for an Apache Kafka data stream](#create-conn-streaming).

### AWS Glue Schema Registry table source
<a name="schema-registry-table"></a>

To use AWS Glue Schema Registry for streaming jobs, follow the instructions at [Use case: AWS Glue Data Catalog](schema-registry-integrations.md#schema-registry-integrations-aws-glue-data-catalog) to create or update a Schema Registry table.

Currently, AWS Glue Streaming supports only Glue Schema Registry Avro format with schema inference set to `false`.

## Notes and restrictions for Avro streaming sources
<a name="streaming-avro-notes"></a>

The following notes and restrictions apply for streaming sources in the Avro format:
+ When schema detection is turned on, the Avro schema must be included in the payload. When turned off, the payload should contain only data.
+ Some Avro data types are not supported in dynamic frames. You can't specify these data types when defining the schema with the **Define a schema** page in the create table wizard in the AWS Glue console. During schema detection, unsupported types in the Avro schema are converted to supported types as follows:
  + `EnumType => StringType`
  + `FixedType => BinaryType`
  + `UnionType => StructType`
+ If you define the table schema using the **Define a schema** page in the console, the implied root element type for the schema is `record`. If you want a root element type other than `record`, for example `array` or `map`, you can't specify the schema using the **Define a schema** page. Instead you must skip that page and specify the schema either as a table property or within the ETL script.
  + To specify the schema in the table properties, complete the create table wizard, edit the table details, and add a new key-value pair under **Table properties**. Use the key `avroSchema`, and enter a schema JSON object for the value, as shown in the following screenshot.  
![\[Under the Table properties heading, there are two columns of text fields. The left-hand column heading is Key, and the right-hand column heading is Value. The key/value pair in the first row is classification/avro. The key/value pair in the second row is avroSchema/{"type":"array","items":"string"}.\]](http://docs.aws.amazon.com/glue/latest/dg/images/table_properties_avro.png)
  + To specify the schema in the ETL script, modify the `datasource0` assignment statement and add the `avroSchema` key to the `additional_options` argument, as shown in the following Python and Scala examples.

------
#### [ Python ]

    ```
    SCHEMA_STRING = ‘{"type":"array","items":"string"}’
    datasource0 = glueContext.create_data_frame.from_catalog(database = "database", table_name = "table_name", transformation_ctx = "datasource0", additional_options = {"startingPosition": "TRIM_HORIZON", "inferSchema": "false", "avroSchema": SCHEMA_STRING})
    ```

------
#### [ Scala ]

    ```
    val SCHEMA_STRING = """{"type":"array","items":"string"}"""
    val datasource0 = glueContext.getCatalogSource(database = "database", tableName = "table_name", redshiftTmpDir = "", transformationContext = "datasource0", additionalOptions = JsonOptions(s"""{"startingPosition": "TRIM_HORIZON", "inferSchema": "false", "avroSchema":"$SCHEMA_STRING"}""")).getDataFrame()
    ```

------

## Applying grok patterns to streaming sources
<a name="create-table-streaming-grok"></a>

You can create a streaming ETL job for a log data source and use Grok patterns to convert the logs to structured data. The ETL job then processes the data as a structured data source. You specify the Grok patterns to apply when you create the Data Catalog table for the streaming source.

For information about Grok patterns and custom pattern string values, see [Writing grok custom classifiers](custom-classifier.md#custom-classifier-grok).

**To add grok patterns to the Data Catalog table (console)**
+ Use the create table wizard, and create the table with the parameters specified in [Creating a Data Catalog table for a streaming source](#create-table-streaming). Specify the data format as Grok, fill in the **Grok pattern** field, and optionally add custom patterns under **Custom patterns (optional)**.  
![\[*\]](http://docs.aws.amazon.com/glue/latest/dg/images/grok-data-format-create-table.png)

  Press **Enter** after each custom pattern.

**To add grok patterns to the Data Catalog table (AWS Glue API or AWS CLI)**
+ Add the `GrokPattern` parameter and optionally the `CustomPatterns` parameter to the `CreateTable` API operation or the `create_table` CLI command.

  ```
   "Parameters": {
  ...
      "grokPattern": "string",
      "grokCustomPatterns": "string",
  ...
  },
  ```

  Express `grokCustomPatterns` as a string and use "\$1n" as the separator between patterns.

  The following is an example of specifying these parameters.  
**Example**  

  ```
  "parameters": {
  ...
      "grokPattern": "%{USERNAME:username} %{DIGIT:digit:int}",
      "grokCustomPatterns": "digit \d",
  ...
  }
  ```

## Defining job properties for a streaming ETL job
<a name="create-job-streaming-properties"></a>

When you define a streaming ETL job in the AWS Glue console, provide the following streams-specific properties. For descriptions of additional job properties, see [Defining job properties for Spark jobs](add-job.md#create-job). 

**IAM role**  
Specify the AWS Identity and Access Management (IAM) role that is used for authorization to resources that are used to run the job, access streaming sources, and access target data stores.  
For access to Amazon Kinesis Data Streams, attach the `AmazonKinesisFullAccess` AWS managed policy to the role, or attach a similar IAM policy that permits more fine-grained access. For sample policies, see [Controlling Access to Amazon Kinesis Data Streams Resources Using IAM](https://docs.aws.amazon.com/streams/latest/dev/controlling-access.html).  
For more information about permissions for running jobs in AWS Glue, see [Identity and access management for AWS Glue](security-iam.md).

**Type**  
Choose **Spark streaming**.

**AWS Glue version**  
The AWS Glue version determines the versions of Apache Spark, and Python or Scala, that are available to the job. Choose a selection that specifies the version of Python or Scala available to the job. AWS Glue Version 2.0 with Python 3 support is the default for streaming ETL jobs.

**Maintenance window**  
Specifies a window where a streaming job can be restarted. See [Maintenance windows for AWS Glue Streaming](glue-streaming-maintenance.md).

**Job timeout**  
Optionally enter a duration in minutes. The default value is blank.  
+ Streaming jobs must have a timeout value less than 7 days or 10080 minutes.
+ When the value is left blank, the job will be restarted after 7 days, if you have not set up a maintenance window. If you have set up a maintenance window, the job will be restarted during the maintenance window after 7 days.

**Data source**  
Specify the table that you created in [Creating a Data Catalog table for a streaming source](#create-table-streaming).

**Data target**  
Do one of the following:  
+ Choose **Create tables in your data target** and specify the following data target properties.  
**Data store**  
Choose Amazon S3 or JDBC.  
**Format**  
Choose any format. All are supported for streaming.
+ Choose **Use tables in the data catalog and update your data target**, and choose a table for a JDBC data store.

**Output schema definition**  
Do one of the following:  
+ Choose **Automatically detect schema of each record** to turn on schema detection. AWS Glue determines the schema from the streaming data.
+ Choose **Specify output schema for all records** to use the Apply Mapping transform to define the output schema.

**Script**  
Optionally supply your own script or modify the generated script to perform operations that the Apache Spark Structured Streaming engine supports. For information on the available operations, see [Operations on streaming DataFrames/Datasets](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#operations-on-streaming-dataframesdatasets).

## Streaming ETL notes and restrictions
<a name="create-job-streaming-restrictions"></a>

Keep in mind the following notes and restrictions:
+ Auto-decompression for AWS Glue streaming ETL jobs is only available for the supported compression types. Also note the following:
  + Framed Snappy refers to the official [framing format](https://github.com/google/snappy/blob/main/framing_format.txt) for Snappy.
  + Deflate is supported in Glue version 3.0, not Glue version 2.0.
+ When using schema detection, you cannot perform joins of streaming data.
+ AWS Glue streaming ETL jobs do not support the Union data type for AWS Glue Schema Registry with Avro format.
+ Your ETL script can use AWS Glue's built-in transforms and the transforms native to Apache Spark Structured Streaming. For more information, see [Operations on streaming DataFrames/Datasets](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#operations-on-streaming-dataframesdatasets) on the Apache Spark website or [AWS Glue PySpark transforms reference](aws-glue-programming-python-transforms.md).
+ AWS Glue streaming ETL jobs use checkpoints to keep track of the data that has been read. Therefore, a stopped and restarted job picks up where it left off in the stream. If you want to reprocess data, you can delete the checkpoint folder referenced in the script.
+ Job bookmarks aren't supported.
+ To use enhanced fan-out feature of Kinesis Data Streams in your job, consult [Using enhanced fan-out in Kinesis streaming jobs](aws-glue-programming-etl-connect-kinesis-efo.md).
+ If you use a Data Catalog table created from AWS Glue Schema Registry, when a new schema version becomes available, to reflect the new schema, you need to do the following:

  1. Stop the jobs associated with the table.

  1. Update the schema for the Data Catalog table.

  1. Restart the jobs associated with the table.

# Record matching with AWS Lake Formation FindMatches
<a name="machine-learning"></a>

**Note**  
Record matching is currently unavailable in the following Regions in the AWS Glue console: Middle East (UAE), Europe (Spain), Asia Pacific (Jakarta), and Europe (Zurich).

AWS Lake Formation provides machine learning capabilities to create custom transforms to cleanse your data. There is currently one available transform named FindMatches. The FindMatches transform enables you to identify duplicate or matching records in your dataset, even when the records do not have a common unique identifier and no fields match exactly. This will not require writing any code or knowing how machine learning works. FindMatches can be useful in many different problems, such as: 
+ **Matching customers**: Linking customer records across different customer databases, even when many customer fields do not match exactly across the databases (e.g. different name spelling, address differences, missing or inaccurate data, etc).
+ **Matching products**: Matching products in your catalog against other product sources, such as product catalog against a competitor's catalog, where entries are structured differently.
+ **Improving fraud detection**: Identifying duplicate customer accounts, determining when a newly created account is (or might be) a match for a previously known fraudulent user.
+ **Other matching problems**: Match addresses, movies, parts lists, etc etc. In general, if a human being could look at your database rows and determine that they were a match, there is a really good chance that the FindMatches transform can help you.

 You can create these transforms when you create a job. The transform that you create is based on a source data store schema and example data from the source data set that you label (we call this process "teaching" a transform). The records that you label must be present in the source dataset. In this process we generate a file which you label and then upload back which the transform would in a manner learn from. After you teach your transform, you can call it from your Spark-based AWS Glue job (PySpark or Scala Spark) and use it in other scripts with a compatible source data store. 

 After the transform is created, it is stored in AWS Glue. On the AWS Glue console, you can manage the transforms that you create. In the navigation pane under **Data Integration and ETL**, **Data classification tools > Record Matching**, you can edit and continue to teach your machine learning transform. For more information about managing transforms on the console, see [Working with machine learning transforms](console-machine-learning-transforms.md). 

**Note**  
AWS Glue version 2.0 FindMatches jobs use the Amazon S3 bucket `aws-glue-temp-<accountID>-<region>` to store temporary files while the transform is processing data. You can delete this data after the run has completed, either manually or by setting an Amazon S3 Lifecycle rule.

## Types of machine learning transforms
<a name="machine-learning-transforms"></a>

You can create machine learning transforms to cleanse your data. You can call these transforms from your ETL script. Your data passes from transform to transform in a data structure called a *DynamicFrame*, which is an extension to an Apache Spark SQL `DataFrame`. The `DynamicFrame` contains your data, and you reference its schema to process your data.

The following types of machine learning transforms are available:

*Find matches*  
Finds duplicate records in the source data. You teach this machine learning transform by labeling example datasets to indicate which rows match. The machine learning transform learns which rows should be matches the more you teach it with example labeled data. Depending on how you configure the transform, the output is one of the following:  
+ A copy of the input table plus a `match_id` column filled in with values that indicate matching sets of records. The `match_id` column is an arbitrary identifier. Any records which have the same `match_id` have been identified as matching to each other. Records with different `match_id`'s do not match.
+ A copy of the input table with duplicate rows removed. If multiple duplicates are found, then the record with the lowest primary key is kept.

*Find incremental matches*  
The Find matches transform can also be configured to find matches across the existing and incremental frames and return as output a column containing a unique ID per match group.   
For more information, see: [Finding incremental matches](machine-learning-incremental-matches.md)

### Using the FindMatches transform
<a name="machine-learning-find-matches"></a>

You can use the `FindMatches` transform to find duplicate records in the source data. A labeling file is generated or provided to help teach the transform.

**Note**  
Currently, `FindMatches` transforms that use a custom encryption key aren't supported in the following Regions:  
Asia Pacific (Osaka) - `ap-northeast-3`

 To get started with the FindMatches transform, you can follow the steps below. For a more advanced and detailed example, see the **AWS Big Data Blog**: [ Harmonize data using AWS Glue and AWS Lake Formation FindMatches ML to build a customer 360 view ](https://aws.amazon.com/blogs/big-data/harmonize-data-using-aws-glue-and-aws-lake-formation-findmatches-ml-to-build-a-customer-360-view/). 

#### Getting started using the Find Matches transform
<a name="machine-learning-find-mathes-workflow"></a>

Follow these steps to get started with the `FindMatches` transform:

1. Create a table in the AWS Glue Data Catalog for the source data that is to be cleaned. For information about how to create a crawler, see [Working with Crawlers on the AWS Glue Console](https://docs.aws.amazon.com/glue/latest/dg/console-crawlers.html).

   If your source data is a text-based file such as a comma-separated values (CSV) file, consider the following: 
   + Keep your input record CSV file and labeling file in separate folders. Otherwise, the AWS Glue crawler might consider them as multiple parts of the same table and create tables in the Data Catalog incorrectly. 
   + Unless your CSV file includes ASCII characters only, ensure that UTF-8 without BOM (byte order mark) encoding is used for the CSV files. Microsoft Excel often adds a BOM in the beginning of UTF-8 CSV files. To remove it, open the CSV file in a text editor, and resave the file as **UTF-8 without BOM**. 

1. On the AWS Glue console, create a job, and choose the **Find matches** transform type.
**Important**  
The data source table that you choose for the job can't have more than 100 columns.

1. Tell AWS Glue to generate a labeling file by choosing **Generate labeling file**. AWS Glue takes the first pass at grouping similar records for each `labeling_set_id` so that you can review those groupings. You label matches in the `label` column.
   + If you already have a labeling file, that is, an example of records that indicate matching rows, upload the file to Amazon Simple Storage Service (Amazon S3). For information about the format of the labeling file, see [Labeling file format](#machine-learning-labeling-file). Proceed to step 4.

1. Download the labeling file and label the file as described in the [Labeling](#machine-learning-labeling) section.

1. Upload the corrected labelled file. AWS Glue runs tasks to teach the transform how to find matches.

   On the **Machine learning transforms** list page, choose the **History** tab. This page indicates when AWS Glue performs the following tasks:
   + **Import labels**
   + **Export labels**
   + **Generate labels**
   + **Estimate quality**

1. To create a better transform, you can iteratively download, label, and upload the labelled file. In the initial runs, a lot more records might be mismatched. But AWS Glue learns as you continue to teach it by verifying the labeling file.

1. Evaluate and tune your transform by evaluating performance and results of finding matches. For more information, see [Tuning machine learning transforms in AWS Glue](add-job-machine-learning-transform-tuning.md).

#### Labeling
<a name="machine-learning-labeling"></a>

When `FindMatches` generates a labeling file, records are selected from your source table. Based on previous training, `FindMatches` identifies the most valuable records to learn from.

The act of *labeling* is editing a labeling file (we suggest using a spreadsheet such as Microsoft Excel) and adding identifiers, or labels, into the `label` column that identifies matching and nonmatching records. It is important to have a clear and consistent definition of a match in your source data. `FindMatches` learns from which records you designate as matches (or not) and uses your decisions to learn how to find duplicate records.

When a labeling file is generated by `FindMatches`, approximately 100 records are generated. These 100 records are typically divided into 10 *labeling sets*, where each labeling set is identified by a unique `labeling_set_id` generated by `FindMatches`. Each labeling set should be viewed as a separate labeling task independent of the other labeling sets. Your task is to identify matching and non-matching records within each labeling set.

##### Tips for editing labeling files in a spreadsheet
<a name="machine-learning-labeling-tips"></a>

When editing the labeling file in a spreadsheet application, consider the following:
+ The file might not open with column fields fully expanded. You might need to expand the `labeling_set_id` and `label` columns to see content in those cells.
+ If the primary key column is a number, such as a `long` data type, the spreadsheet might interpret it as a number and change the value. This key value must be treated as text. To correct this problem, format all the cells in the primary key column as **Text data**.

#### Labeling file format
<a name="machine-learning-labeling-file"></a>

The labeling file that is generated by AWS Glue to teach your `FindMatches` transform uses the following format. If you generate your own file for AWS Glue, it must follow this format as well:
+ It is a comma-separated values (CSV) file. 
+ It must be encoded in `UTF-8`. If you edit the file using Microsoft Windows, it might be encoded with `cp1252`.
+ It must be in an Amazon S3 location to pass it to AWS Glue.
+ Use a moderate number of rows for each labeling task. 10–20 rows per task are recommended, although 2–30 rows per task are acceptable. Tasks larger than 50 rows are not recommended and may cause poor results or system failure.
+ If you have already-labeled data consisting of pairs of records labeled as a "match" or a "no-match", this is fine. These labeled pairs can be represented as labeling sets of size 2. In this case label both records with, for instance, a letter "A" if they match, but label one as "A" and one as "B" if they do not match.
**Note**  
 Because it has additional columns, the labeling file has a different schema from a file that contains your source data. Place the labeling file in a different folder from any transform input CSV file so that the AWS Glue crawler does not consider it when it creates tables in the Data Catalog. Otherwise, the tables created by the AWS Glue crawler might not correctly represent your data. 
+ The first two columns (`labeling_set_id`, `label`) are required by AWS Glue. The remaining columns must match the schema of the data that is to be processed.
+ For each `labeling_set_id`, you identify all matching records by using the same label. A label is a unique string placed in the `label` column. We recommend using labels that contain simple characters, such as A, B, C, and so on. Labels are case sensitive and are entered in the `label` column.
+ Rows that contain the same `labeling_set_id` and the same label are understood to be labeled as a match.
+ Rows that contain the same `labeling_set_id` and a different label are understood to be labeled as *not* a match
+ Rows that contain a different `labeling_set_id` are understood to be conveying no information for or against matching.

  The following is an example of labeling the data:    
<a name="table-labeling-data"></a>[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/machine-learning.html)
+ In the above example we identify John/Johnny/Jon Doe as being a match and we teach the system that these records do not match Jane Smith. Separately, we teach the system that Richard and Rich Jones are the same person, but that these records are not a match to Sarah Jones/Jones-Walker and Richie Jones Jr.
+ As you can see, the scope of the labels is limited to the `labeling_set_id`. So labels do not cross `labeling_set_id` boundaries. For example, a label "A" in `labeling_set_id` 1 does not have any relation to label "A" in `labeling_set_id` 2.
+ If a record does not have any matches within a labeling set, then assign it a unique label. For instance, Jane Smith does not match any record in labeling set ABC123, so it is the only record in that labeling set with the label of B.
+ The labeling set "GHI678" shows that a labeling set can consist of just two records which are given the same label to show that they match. Similarly, "XYZABC" shows two records given different labels to show that they do not match.
+ Note that sometimes a labeling set may contain no matches (that is, you give every record in the labeling set a different label) or a labeling set might all be "the same" (you gave them all the same label). This is okay as long as your labeling sets collectively contain examples of records that are and are not "the same" by your criteria.

**Important**  
Confirm that the IAM role that you pass to AWS Glue has access to the Amazon S3 bucket that contains the labeling file. By convention, AWS Glue policies grant permission to Amazon S3 buckets or folders whose names are prefixed with **aws-glue-**. If your labeling files are in a different location, add permission to that location in the IAM role.

# Tuning machine learning transforms in AWS Glue
<a name="add-job-machine-learning-transform-tuning"></a>

You can tune your machine learning transforms in AWS Glue to improve the results of your data-cleansing jobs to meet your objectives. To improve your transform, you can teach it by generating a labeling set, adding labels, and then repeating these steps several times until you get your desired results. You can also tune by changing some machine learning parameters. 

For more information about machine learning transforms, see [Record matching with AWS Lake Formation FindMatches](machine-learning.md).

**Topics**
+ [Machine learning measurements](machine-learning-terminology.md)
+ [Deciding between precision and recall](machine-learning-precision-recall-tradeoff.md)
+ [Deciding Between accuracy and cost](machine-learning-accuracy-cost-tradeoff.md)
+ [Estimating the quality of matches using match confidence scores](match-scoring.md)
+ [Teaching the Find Matches transform](machine-learning-teaching.md)

# Machine learning measurements
<a name="machine-learning-terminology"></a>

To understand the measurements that are used to tune your machine learning transform, you should be familiar with the following terminology:

**True positive (TP)**  
A match in the data that the transform correctly found, sometimes called a *hit*.

**True negative (TN)**  
A nonmatch in the data that the transform correctly rejected.

**False positive (FP)**  
A nonmatch in the data that the transform incorrectly classified as a match, sometimes called a *false alarm*.

**False negative (FN)**  
A match in the data that the transform didn't find, sometimes called a *miss*.

For more information about the terminology that is used in machine learning, see [Confusion matrix](https://en.wikipedia.org/wiki/Confusion_matrix) in Wikipedia.

To tune your machine learning transforms, you can change the value of the following measurements in the **Advanced properties** of the transform.
+ **Precision** measures how well the transform finds true positives among the total number of records that it identifies as positive (true positives and false positives). For more information, see [Precision and recall](https://en.wikipedia.org/wiki/Precision_and_recall) in Wikipedia.
+ **Recall** measures how well the transform finds true positives from the total records in the source data. For more information, see [Precision and recall](https://en.wikipedia.org/wiki/Precision_and_recall) in Wikipedia.
+ **Accuracy ** measures how well the transform finds true positives and true negatives. Increasing accuracy requires more machine resources and cost. But it also results in increased recall. For more information, see [Accuracy and precision](https://en.wikipedia.org/wiki/Accuracy_and_precision#In_information_systems) in Wikipedia.
+ **Cost** measures how many compute resources (and thus money) are consumed to run the transform.

# Deciding between precision and recall
<a name="machine-learning-precision-recall-tradeoff"></a>

Each `FindMatches` transform contains a `precision-recall` parameter. You use this parameter to specify one of the following:
+ If you are more concerned about the transform falsely reporting that two records match when they actually don't match, then you should emphasize *precision*. 
+ If you are more concerned about the transform failing to detect records that really do match, then you should emphasize *recall*.

You can make this trade-off on the AWS Glue console or by using the AWS Glue machine learning API operations.

**When to favor precision**  
Favor precision if you are more concerned about the risk that `FindMatches` results in a pair of records matching when they don't actually match. To favor precision, choose a *higher* precision-recall trade-off value. With a higher value, the `FindMatches` transform requires more evidence to decide that a pair of records should be matched. The transform is tuned to bias toward saying that records do not match.

For example, suppose that you're using `FindMatches` to detect duplicate items in a video catalog, and you provide a higher precision-recall value to the transform. If your transform incorrectly detects that *Star Wars: A New Hope* is the same as *Star Wars: The Empire Strikes Back*, a customer who wants *A New Hope* might be shown *The Empire Strikes Back*. This would be a poor customer experience. 

However, if the transform fails to detect that *Star Wars: A New Hope* and *Star Wars: Episode IV—A New Hope* are the same item, the customer might be confused at first but might eventually recognize them as the same. It would be a mistake, but not as bad as the previous scenario.

**When to favor recall**  
Favor recall if you are more concerned about the risk that the `FindMatches` transform results might fail to detect a pair of records that actually do match. To favor recall, choose a *lower* precision-recall trade-off value. With a lower value, the `FindMatches` transform requires less evidence to decide that a pair of records should be matched. The transform is tuned to bias toward saying that records do match.

For example, this might be a priority for a security organization. Suppose that you are matching customers against a list of known defrauders, and it is important to determine whether a customer is a defrauder. You are using `FindMatches` to match the defrauder list against the customer list. Every time `FindMatches` detects a match between the two lists, a human auditor is assigned to verify that the person is, in fact, a defrauder. Your organization might prefer to choose recall over precision. In other words, you would rather have the auditors manually review and reject some cases when the customer is not a defrauder than fail to identify that a customer is, in fact, on the defrauder list.

**How to favor both precision and recall**  
The best way to improve both precision and recall is to label more data. As you label more data, the overall accuracy of the `FindMatches` transform improves, thus improving both precision and recall. Nevertheless, even with the most accurate transform, there is always a gray area where you need to experiment with favoring precision or recall, or choose a value in the middle. 

# Deciding Between accuracy and cost
<a name="machine-learning-accuracy-cost-tradeoff"></a>

Each `FindMatches` transform contains an `accuracy-cost` parameter. You can use this parameter to specify one of the following:
+ If you are more concerned with the transform accurately reporting that two records match, then you should emphasize *accuracy*.
+ If you are more concerned about the cost or speed of running the transform, then you should emphasize *lower cost*.

You can make this trade-off on the AWS Glue console or by using the AWS Glue machine learning API operations.

**When to favor accuracy**  
Favor accuracy if you are more concerned about the risk that the `find matches` results won't contain matches. To favor accuracy, choose a *higher* accuracy-cost trade-off value. With a higher value, the `FindMatches` transform requires more time to do a more thorough search for correctly matching records. Note that this parameter doesn't make it less likely to falsely call a nonmatching record pair a match. The transform is tuned to bias towards spending more time finding matches.

**When to favor cost**  
Favor cost if you are more concerned about the cost of running the `find matches` transform and less about how many matches are found. To favor cost, choose a *lower* accuracy-cost trade-off value. With a lower value, the `FindMatches` transform requires fewer resources to run. The transform is tuned to bias towards finding fewer matches. If the results are acceptable when favoring lower cost, use this setting.

**How to favor both accuracy and lower cost**  
It takes more machine time to examine more pairs of records to determine whether they might be matches. If you want to reduce cost without reducing quality, here are some steps you can take: 
+ Eliminate records in your data source that you aren't concerned about matching.
+ Eliminate columns from your data source that you are sure aren't useful for making a match/no-match decision. A good way of deciding this is to eliminate columns that you don't think affect your own decision about whether a set of records is “the same.”

# Estimating the quality of matches using match confidence scores
<a name="match-scoring"></a>

Match confidence scores provide an estimate of the quality of matches found by FindMatches to distinguish between matched records in which the machine learning model is highly confident, uncertain, or unlikely. A match confidence score will be between 0 and 1, where a higher score means higher similarity. Examining match confidence scores lets you distinguish between clusters of matches in which the system is highly confident (which you may decide to merge), clusters about which the system is uncertain (which you may decide to have reviewed by a human), and clusters that the system deems to be unlikely (which you may decide to reject).

You may want to adjust your training data in situations where you see a high match confidence score, but determine there are not matches, or where you see a low score but determine there are, in fact, matches.

Confidence scores are particularly useful when there are large sized industrial datasets, where it is infeasible to review every FindMatches decision.

Match confidence scores are available in AWS Glue version 2.0 or later.

## Generating match confidence scores
<a name="specifying-match-scoring"></a>

You can generate match confidence scores by setting the Boolean value of `computeMatchConfidenceScores` to True when calling the `FindMatches` or `FindIncrementalMatches` API.

AWS Glue adds a new `column match_confidence_score` to the output.

## Match scoring examples
<a name="match-scoring-examples"></a>

For example, consider the following matched records:

**Score >= 0.9**  
Summary of matched records:

```
  primary_id  |   match_id  | match_confidence_score

3281355037663    85899345947   0.9823658302132061
1546188247619    85899345947   0.9823658302132061
```

Details:

![\[An example of a route table with an internet gateway.\]](http://docs.aws.amazon.com/glue/latest/dg/images/match_score1.png)


From this example, we can see that two records are very similar and share `display_position`, `primary_name`, and `street name`. 

**Score >= 0.8 and score < 0.9**  
Summary of matched records:

```
  primary_id  |   match_id  | match_confidence_score

309237680432     85899345928   0.8309852373674638
3590592666790    85899345928   0.8309852373674638
343597390617     85899345928   0.8309852373674638
249108124906     85899345928   0.8309852373674638
463856477937     85899345928   0.8309852373674638
```

Details:

![\[An example of a route table with an internet gateway.\]](http://docs.aws.amazon.com/glue/latest/dg/images/match_score2.png)


From this example, we can see that these records share the same `primary_name`, and `country`.

**Score >= 0.6 and score < 0.7**  
Summary of matched records:

```
  primary_id  |   match_id  | match_confidence_score

2164663519676    85899345930   0.6971099896480333
 317827595278    85899345930   0.6971099896480333
 472446424341    85899345930   0.6971099896480333
3118146262932    85899345930   0.6971099896480333
 214748380804    85899345930   0.6971099896480333
```

Details:

![\[An example of a route table with an internet gateway.\]](http://docs.aws.amazon.com/glue/latest/dg/images/match_score3.png)


From this example, we can see that these records share only the same `primary_name`.

For more information, see:
+ [Step 5: Add and run a job with your machine learning transform](machine-learning-transform-tutorial.md#ml-transform-tutorial-add-job)
+ PySpark: [FindMatches class](aws-glue-api-crawler-pyspark-transforms-findmatches.md)
+ PySpark: [FindIncrementalMatches class](aws-glue-api-crawler-pyspark-transforms-findincrementalmatches.md)
+ Scala: [FindMatches class](glue-etl-scala-apis-glue-ml-findmatches.md)
+ Scala: [FindIncrementalMatches class](glue-etl-scala-apis-glue-ml-findincrementalmatches.md)

# Teaching the Find Matches transform
<a name="machine-learning-teaching"></a>

Each `FindMatches` transform must be taught what should be considered a match and what should not be considered a match. You teach your transform by adding labels to a file and uploading your choices to AWS Glue. 

You can orchestrate this labeling on the AWS Glue console or by using the AWS Glue machine learning API operations.

**How many times should I add labels? How many labels do I need?**  
The answers to these questions are mostly up to you. You must evaluate whether `FindMatches` is delivering the level of accuracy that you need and whether you think the extra labeling effort is worth it for you. The best way to decide this is to look at the “Precision,” “Recall,” and “Area under the precision recall curve” metrics that you can generate when you choose **Estimate quality** on the AWS Glue console. After you label more sets of tasks, rerun these metrics and verify whether they have improved. If, after labeling a few sets of tasks, you don't see improvement on the metric that you are focusing on, the transform quality might have reached a plateau.

**Why are both true positive and true negative labels needed?**  
The `FindMatches` transform needs both positive and negative examples to learn what you think is a match. If you are labeling `FindMatches`-generated training data (for example, using the **I do not have labels** option), `FindMatches` tries to generate a set of “label set ids” for you. Within each task, you give the same “label” to some records and different “labels” to other records. In other words, the tasks generally are not either all the same or all different (but it's okay if a particular task is all “the same” or all “not the same”).

If you are teaching your `FindMatches` transform using the **Upload labels from S3** option, try to include both examples of matching and nonmatching records. It's acceptable to have only one type. These labels help you build a more accurate `FindMatches` transform, but you still need to label some records that you generate using the **Generate labeling file** option.

**How can I enforce that the transform matches exactly as I taught it?**  
The `FindMatches` transform learns from the labels that you provide, so it might generate records pairs that don't respect the provided labels. To enforce that the `FindMatches` transform respects your labels, select **EnforceProvidedLabels** in **FindMatchesParameter**.

**What techniques can you use when an ML transform identifies items as matches that are not true matches?**  
You can use the following techniques:
+ Increase the `precisionRecallTradeoff` to a higher value. This eventually results in finding fewer matches, but it should also break up your big cluster when it reaches a high enough value. 
+ Take the output rows corresponding to the incorrect results and reformat them as a labeling set (removing the `match_id` column and adding a `labeling_set_id` and `label` column). If necessary, break up (subdivide) into multiple labeling sets to ensure that the labeler can keep each labeling set in mind while assigning labels. Then, correctly label the matching sets and upload the label file and append it to your existing labels. This might teach your transformer enough about what it is looking for to understand the pattern. 
+ (Advanced) Finally, look at that data to see if there is a pattern that you can detect that the system is not noticing. Preprocess that data using standard AWS Glue functions to *normalize* the data. Highlight what you want the algorithm to learn from by separating data that you know to be differently important into their own columns. Or construct combined columns from columns whose data you know to be related. 

# Working with machine learning transforms
<a name="console-machine-learning-transforms"></a>

You can use AWS Glue to create custom machine learning transforms that can be used to cleanse your data. You can use these transforms when you create a job on the AWS Glue console. 

For information about how to create a machine learning transform, see [Record matching with AWS Lake Formation FindMatches](machine-learning.md).

**Topics**
+ [Transform properties](#console-machine-learning-properties)
+ [Adding and editing machine learning transforms](#console-machine-learning-transforms-actions)
+ [Viewing transform details](#console-machine-learning-transforms-details)
+ [Teach transforms using labels](#console-machine-learning-transforms-teaching-transforms)

## Transform properties
<a name="console-machine-learning-properties"></a>

To view an existing machine learning transform, sign in to the AWS Management Console, and open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/). In the navigation pane under **Data Integration and ETL**, choose **Data classification tools > Record Matching**.

The properties for each transform:

**Transform name**  
The unique name you gave the transform when you created it.

**ID**  
A unique identifier of the transform. 

**Label count**  
The number of labels in the labeling file that was provided to help teach the transform. 

**Status**  
Indicates whether the transform is **Ready** or **Needs training**. To run a machine learning transform successfully in a job, it must be **Ready**. 

**Created**  
The date the transform was created.

**Modified**  
The date the transform was last updated.

**Description**  
The description supplied for the transform, if one was provided.

**AWS Glue version**  
The version of AWS Glue used.

**Run ID**  
The unique name you gave the transform when you created it.

**Task type**  
The type of machine learning transform; for example, **Find matching records**.

**Status**  
Indicates the status of the task run. Possible statuses include:  
+ Starting
+ Running
+ Stopping
+ Stopped
+ Succeeded
+ Failed
+ Timeout

**Error**  
If the status is Failed, an error message is displayed describing the reason for the failure.

## Adding and editing machine learning transforms
<a name="console-machine-learning-transforms-actions"></a>

 You can view, delete, set up and teach, or tune a transform on the AWS Glue console. Select the check box next to the transform in the list, choose **Action**, and then choose the action that you want to take. 

### Creating a new ML transform
<a name="w2aac37c11c24c23c11b5"></a>

 To add a new machine learning transform, choose **Create transform**. Follow the instructions in the **Add job** wizard. For more information, see [Record matching with AWS Lake Formation FindMatches](machine-learning.md). 

#### Step 1. Set transform properties.
<a name="w2aac37c11c24c23c11b5b7"></a>

1. Enter the name and description (optional).

1. Optionally, set security configuration. See [Using data encryption with machine learning transforms](#ml_transform_sec_config). 

1. Optionally, set Task execution settings. Task execution settings allow you to customize how the task is run. Select the Worker type, number of workers, task timeout (in minutes), the number of retries, and the AWS Glue version.

1. Optionally, set Tags. Tags are labels that you can assign to an AWS resource. Each tag consists of a key and an optional value. Tags can be used to search and filter your resource or track your AWS costs.

#### Step 2. Choose table and primary key.
<a name="w2aac37c11c24c23c11b5b9"></a>

1. Choose the AWS Glue Catalog database and table.

1. Choose a primary key from the selected table. The primary key column typically contains a unique identifier for every record in the data source. 

#### Step 3. Select tuning options.
<a name="w2aac37c11c24c23c11b5c11"></a>

1.  For **Recall vs. precision**, choose the tuning value to tune the transform to favor recall or precision. By default, **Balanced** is selected, but you can choose to favor recall or favor precision, or choose **Custom** and enter a value between 0.0 and 1.0 (inclusive). 

1.  For **Lower cost vs. accuracy**, choose the tuning value to favor lower cost or accuracy, or choose **Custom** and enter a value between 0.0 and 1.0 (inclusive). 

1.  For **Match enforcement**, choose **Force output to match labels** if you want to teach the ML transform by forcing the output to match the labels used. 

#### Step 4. Review and create.
<a name="w2aac37c11c24c23c11b5c13"></a>

1.  Review the options for steps 1 – 3. 

1.  Choose **Edit** for any step that needs to be modified. Choose **Create transform** to complete the create transform wizard. 

### Using data encryption with machine learning transforms
<a name="ml_transform_sec_config"></a>

When adding a machine learning transform to AWS Glue, you can optionally specify a security configuration that is associated with the data source or data target. If the Amazon S3 bucket used to store the data is encrypted with a security configuration, specify the same security configuration when creating the transform.

You can also choose to use server-side encryption with AWS KMS (SSE-KMS) to encrypt the model and labels to prevent unauthorized persons from inspecting it. If you choose this option, you're prompted to choose the AWS KMS key by name, or you can choose **Enter a key ARN**. If you choose to enter the ARN for the KMS key, a second field appears where you can enter the KMS key ARN.

**Note**  
Currently, ML transforms that use a custom encryption key aren't supported in the following Regions:  
Asia Pacific (Osaka) - `ap-northeast-3`

## Viewing transform details
<a name="console-machine-learning-transforms-details"></a>

### Viewing transform properties
<a name="console-machine-learning-transforms-details"></a>

The **Transform properties** page includes attributes of your transform. It shows you the details about the transform definition, including the following:
+ **Transform name** shows the name of the transform.
+ **Type** lists the type of the transform.
+ **Status** displays whether the transform is ready to be used in a script or job.
+ **Force output to match labels** displays whether the transform forces the output to match the labels provided by the user.
+ **Spark version** is related to the AWS Glue version you chose in the **Task run properties** when adding the transform. AWS Glue 1.0 and Spark 2.4 is recommended for most customers. For more information, see [AWS Glue Versions](https://docs.aws.amazon.com/glue/latest/dg/release-notes.html#release-notes-versions).

### History, Estimate quality and Tags tabs
<a name="w2aac37c11c24c23c13b5"></a>

 Transform details include the information that you defined when you created the transform. To view the details of a transform, select the transform in the **Machine learning transforms** list, and review the information on the following tabs: 
+ History
+ Estimate quality
+ Tags

#### History
<a name="console-machine-learning-transforms-history"></a>

The **History** tab shows your transform task run history. Several types of tasks are run to teach a transform. For each task, the run metrics include the following:
+ **Run ID** is an identifier created by AWS Glue for each run of this task.
+ **Task type** shows the type of task run.
+ **Status** shows the success of each task listed with the most recent run at the top.
+ **Error** shows the details of an error message if the run was not successful.
+ **Start time** shows the date and time (local time) that the task started.
+ **End time** shows the date and time (local time) that the task ended.
+ **Logs** links to the logs written to `stdout` for this job run.

  The **Logs** link takes you to Amazon CloudWatch Logs. There you can view the details about the tables that were created in the AWS Glue Data Catalog and any errors that were encountered. You can manage your log retention period on the CloudWatch console. The default log retention is `Never Expire`. For more information about how to change the retention period, see [Change Log Data Retention in CloudWatch Logs](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/Working-with-log-groups-and-streams.html#SettingLogRetention) in the *Amazon CloudWatch Logs User Guide*.
+ **Label file** shows a link to Amazon S3 for a generated labeling file.

#### Estimate quality
<a name="console-machine-learning-transforms-metrics"></a>

 The **Estimate quality** tab shows the metrics that you use to measure the quality of the transform. Estimates are calculated by comparing the transform match predictions using a subset of your labeled data against the labels you have provided. These estimates are approximate. You can invoke an **Estimate quality** task run from this tab.

The **Estimate quality** tab shows the metrics from the last **Estimate quality** run including the following properties:
+ **Area under the Precision-Recall curve** is a single number estimating the upper bound of the overall quality of the transform. It is independent of the choice made for the precision-recall parameter. Higher values indicate that you have a more attractive precision-recall tradeoff. 
+ **Precision** estimates how often the transform is correct when it predicts a match.
+ **Recall upper limit** estimates that for an actual match, how often the transform predicts the match.
+ **F1** estimates the transform's accuracy between 0 and 1, where 1 is the best accuracy. For more information, see [F1 score](https://en.wikipedia.org/wiki/F1_score) in Wikipedia.
+ The **Column importance** table show the column names and importance score for each column. Column importance helps you understand how columns contribute to your model, by identifying which columns in your records are being used the most to do the matching. This data may prompt you to add to or change your labelset to raise or lower the importance of columns.

  The Importance column provides a numerical score for each column, as a decimal not greater than 1.0.

For information about understanding quality estimates versus true quality, see [Quality estimates versus end-to-end (true) quality](#console-machine-learning-quality-estimates-true-quality).

For more information about tuning your transform, see [Tuning machine learning transforms in AWS Glue](add-job-machine-learning-transform-tuning.md).

#### Quality estimates versus end-to-end (true) quality
<a name="console-machine-learning-quality-estimates-true-quality"></a>

AWS Glue estimates the quality of your transform by presenting the internal machine-learned model with a number of pairs of records that you provided matching labels for but that the model has not seen before. These quality estimates are a function of the quality of the machine-learned model (which is influenced by the number of records that you label to “teach” the transform). The end-to-end, or *true* recall (which is not automatically calculated by the `ML transform`) is also influenced by the `ML transform` filtering mechanism that proposes a wide variety of possible matches to the machine-learned model. 

You can tune this filtering method primarily by specifying the **Lower Cost-Accuracy** tuning value. As the tuning value gets closer to favor **Accuracy**, the system does a more thorough and expensive search for pairs of records that might be matches. More pairs of records are fed to your machine-learned model, and your `ML transform`'s end-to-end or true recall gets closer to the estimated recall metric. As a result, changes in the end-to-end quality of your matches as a result of changes in the cost/accuracy tradeoff for your matches will typically not be reflected in the quality estimate.

#### Tags
<a name="w2aac37c11c24c23c13b5c13"></a>

 Tags are labels that you can assign to an AWS resource. Each tag consists of a key and an optional value. Tags can be used to search and filter your resource or track your AWS costs. 

## Teach transforms using labels
<a name="console-machine-learning-transforms-teaching-transforms"></a>

 You can teach your ML transform using labels (examples) by choosing **Teach transform** from the ML transform details page. When you teach your machine learning algorithm by providing examples (called labels), you can choose existing labels to use, or create a labeling file. 

![\[The screenshot shows a wizard screen for Teach the transform using labels.\]](http://docs.aws.amazon.com/glue/latest/dg/images/machine-learning-teach-transform.png)

+  **Labeling** – If you have labels, choose **I have labels**. If you do not have labels, you can still continue with the next step in generating a labeling file. 
+  **Generate labeling file** – AWS Glue extracts records from your source data and suggest potential matching records. You choose the Amazon S3 bucket to store the generated label file. Choose **Generate labeling file** to start the process. When done, choose **Download labeling file**. The downloaded file will have a column for labels where you can fill in the labels. 
+  **Upload labels from Amazon S3** – Choose the completed labeling file from the Amazon S3 bucket where the label file is stored. Then, choose to either append the labels to your existing labels or to overwrite your existing labels. Choose **Upload labeling file from Amazon S3**. 

# Tutorial: Creating a machine learning transform with AWS Glue
<a name="machine-learning-transform-tutorial"></a>

This tutorial guides you through the actions to create and manage a machine learning (ML) transform using AWS Glue. Before using this tutorial, you should be familiar with using the AWS Glue console to add crawlers and jobs and edit scripts. You should also be familiar with finding and downloading files on the Amazon Simple Storage Service (Amazon S3) console.

In this example, you create a `FindMatches` transform to find matching records, teach it how to identify matching and nonmatching records, and use it in an AWS Glue job. The AWS Glue job writes a new Amazon S3 file with an additional column named `match_id`. 

The source data used by this tutorial is a file named `dblp_acm_records.csv`. This file is a modified version of academic publications (DBLP and ACM) available from the original [DBLP ACM dataset](https://doi.org/10.3886/E100843V2). The `dblp_acm_records.csv` file is a comma-separated values (CSV) file in UTF-8 format with no byte-order mark (BOM). 

A second file, `dblp_acm_labels.csv`, is an example labeling file that contains both matching and nonmatching records used to teach the transform as part of the tutorial. 

**Topics**
+ [Step 1: Crawl the source data](#ml-transform-tutorial-crawler)
+ [Step 2: Add a machine learning transform](#ml-transform-tutorial-create)
+ [Step 3: Teach your machine learning transform](#ml-transform-tutorial-teach)
+ [Step 4: Estimate the quality of your machine learning transform](#ml-transform-tutorial-estimate-quality)
+ [Step 5: Add and run a job with your machine learning transform](#ml-transform-tutorial-add-job)
+ [Step 6: Verify output data from Amazon S3](#ml-transform-tutorial-data-output)

## Step 1: Crawl the source data
<a name="ml-transform-tutorial-crawler"></a>

First, crawl the source Amazon S3 CSV file to create a corresponding metadata table in the Data Catalog.

**Important**  
To direct the crawler to create a table for only the CSV file, store the CSV source data in a different Amazon S3 folder from other files.

1. Sign in to the AWS Management Console and open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/).

1. In the navigation pane, choose **Crawlers**, **Add crawler**. 

1. Follow the wizard to create and run a crawler named `demo-crawl-dblp-acm` with output to database `demo-db-dblp-acm`. When running the wizard, create the database `demo-db-dblp-acm` if it doesn't already exist. Choose an Amazon S3 include path to sample data in the current AWS Region. For example, for `us-east-1`, the Amazon S3 include path to the source file is `s3://ml-transforms-public-datasets-us-east-1/dblp-acm/records/dblp_acm_records.csv`. 

   If successful, the crawler creates the table `dblp_acm_records_csv` with the following columns: id, title, authors, venue, year, and source.

## Step 2: Add a machine learning transform
<a name="ml-transform-tutorial-create"></a>

Next, add a machine learning transform that is based on the schema of your data source table created by the crawler named `demo-crawl-dblp-acm`.

1. On the AWS Glue console, in the navigation pane under **Data Integration and ETL**, choose **Data classification tools > Record Matching**, then **Add transform**. Follow the wizard to create a `Find matches` transform with the following properties. 

   1. For **Transform name**, enter **demo-xform-dblp-acm**. This is the name of the transform that is used to find matches in the source data.

   1. For **IAM role**, choose an IAM role that has permission to the Amazon S3 source data, labeling file, and AWS Glue API operations. For more information, see [Create an IAM Role for AWS Glue](https://docs.aws.amazon.com/glue/latest/dg/create-an-iam-role.html) in the *AWS Glue Developer Guide*.

   1. For **Data source**, choose the table named **dblp\$1acm\$1records\$1csv** in database **demo-db-dblp-acm**.

   1. For **Primary key**, choose the primary key column for the table, **id**.

1. In the wizard, choose **Finish** and return to the **ML transforms** list.

## Step 3: Teach your machine learning transform
<a name="ml-transform-tutorial-teach"></a>

Next, you teach your machine learning transform using the tutorial sample labeling file.

You can't use a machine language transform in an extract, transform, and load (ETL) job until its status is **Ready for use**. To get your transform ready, you must teach it how to identify matching and nonmatching records by providing examples of matching and nonmatching records. To teach your transform, you can **Generate a label file**, add labels, and then **Upload label file**. In this tutorial, you can use the example labeling file named `dblp_acm_labels.csv`. For more information about the labeling process, see [Labeling](machine-learning.md#machine-learning-labeling).

1. On the AWS Glue console, in the navigation pane, choose **Record Matching**.

1. Choose the `demo-xform-dblp-acm` transform, and then choose **Action**, **Teach**. Follow the wizard to teach your `Find matches` transform. 

1. On the transform properties page, choose **I have labels**. Choose an Amazon S3 path to the sample labeling file in the current AWS Region. For example, for `us-east-1`, upload the provided labeling file from the Amazon S3 path `s3://ml-transforms-public-datasets-us-east-1/dblp-acm/labels/dblp_acm_labels.csv` with the option to **overwrite** existing labels. The labeling file must be located in Amazon S3 in the same Region as the AWS Glue console.

   When you upload a labeling file, a task is started in AWS Glue to add or overwrite the labels used to teach the transform how to process the data source.

1. On the final page of the wizard, choose **Finish**, and return to the **ML transforms** list.

## Step 4: Estimate the quality of your machine learning transform
<a name="ml-transform-tutorial-estimate-quality"></a>

Next, you can estimate the quality of your machine learning transform. The quality depends on how much labeling you have done. For more information about estimating quality, see [Estimate quality](console-machine-learning-transforms.md#console-machine-learning-transforms-metrics).

1. On the AWS Glue console, in the navigation pane under **Data Integration and ETL**, choose **Data classification tools > Record Matching**. 

1. Choose the `demo-xform-dblp-acm` transform, and choose the **Estimate quality** tab. This tab displays the current quality estimates, if available, for the transform. 

1. Choose **Estimate quality** to start a task to estimate the quality of the transform. The accuracy of the quality estimate is based on the labeling of the source data.

1. Navigate to the **History** tab. In this pane, task runs are listed for the transform, including the **Estimating quality** task. For more details about the run, choose **Logs**. Check that the run status is **Succeeded** when it finishes.

## Step 5: Add and run a job with your machine learning transform
<a name="ml-transform-tutorial-add-job"></a>

In this step, you use your machine learning transform to add and run a job in AWS Glue. When the transform `demo-xform-dblp-acm` is **Ready for use**, you can use it in an ETL job.

1. On the AWS Glue console, in the navigation pane, choose **Jobs**.

1. Choose **Add job**, and follow the steps in the wizard to create an ETL Spark job with a generated script. Choose the following property values for your transform:

   1. For **Name**, choose the example job in this tutorial, **demo-etl-dblp-acm**.

   1. For **IAM role**, choose an IAM role with permission to the Amazon S3 source data, labeling file, and AWS Glue API operations. For more information, see [Create an IAM Role for AWS Glue](https://docs.aws.amazon.com/glue/latest/dg/create-an-iam-role.html) in the *AWS Glue Developer Guide*.

   1. For **ETL language**, choose **Scala**. This is the programming language in the ETL script.

   1. For **Script file name**, choose **demo-etl-dblp-acm**. This is the file name of the Scala script (same as the job name).

   1. For **Data source**, choose **dblp\$1acm\$1records\$1csv**. The data source you choose must match the machine learning transform data source schema.

   1. For **Transform type**, choose **Find matching records** to create a job using a machine learning transform.

   1. Clear **Remove duplicate records**. You don't want to remove duplicate records because the output records written have an additional `match_id` field added. 

   1. For **Transform**, choose **demo-xform-dblp-acm**, the machine learning transform used by the job.

   1. For **Create tables in your data target**, choose to create tables with the following properties:
      + **Data store type** — **Amazon S3**
      + **Format** — **CSV**
      + **Compression type** — **None**
      + **Target path** — The Amazon S3 path where the output of the job is written (in the current console AWS Region)

1. Choose **Save job and edit script** to display the script editor page.

1. Edit the script to add a statement to cause the job output to the **Target path** to be written to a single partition file. Add this statement immediately following the statement that runs the `FindMatches` transform. The statement is similar to the following.

   ```
   val single_partition = findmatches1.repartition(1) 
   ```

   You must modify the `.writeDynamicFrame(findmatches1)` statement to write the output as `.writeDynamicFrame(single_partion)`. 

1. After you edit the script, choose **Save**. The modified script looks similar to the following code, but customized for your environment.

   ```
   import com.amazonaws.services.glue.GlueContext
   import com.amazonaws.services.glue.errors.CallSite
   import com.amazonaws.services.glue.ml.FindMatches
   import com.amazonaws.services.glue.util.GlueArgParser
   import com.amazonaws.services.glue.util.Job
   import com.amazonaws.services.glue.util.JsonOptions
   import org.apache.spark.SparkContext
   import scala.collection.JavaConverters._
   
   object GlueApp {
     def main(sysArgs: Array[String]) {
       val spark: SparkContext = new SparkContext()
       val glueContext: GlueContext = new GlueContext(spark)
       // @params: [JOB_NAME]
       val args = GlueArgParser.getResolvedOptions(sysArgs, Seq("JOB_NAME").toArray)
       Job.init(args("JOB_NAME"), glueContext, args.asJava)
       // @type: DataSource
       // @args: [database = "demo-db-dblp-acm", table_name = "dblp_acm_records_csv", transformation_ctx = "datasource0"]
       // @return: datasource0
       // @inputs: []
       val datasource0 = glueContext.getCatalogSource(database = "demo-db-dblp-acm", tableName = "dblp_acm_records_csv", redshiftTmpDir = "", transformationContext = "datasource0").getDynamicFrame()
       // @type: FindMatches
       // @args: [transformId = "tfm-123456789012", emitFusion = false, survivorComparisonField = "<primary_id>", transformation_ctx = "findmatches1"]
       // @return: findmatches1
       // @inputs: [frame = datasource0]
       val findmatches1 = FindMatches.apply(frame = datasource0, transformId = "tfm-123456789012", transformationContext = "findmatches1", computeMatchConfidenceScores = true)
     
     
       // Repartition the previous DynamicFrame into a single partition. 
       val single_partition = findmatches1.repartition(1)    
    
       
       // @type: DataSink
       // @args: [connection_type = "s3", connection_options = {"path": "s3://aws-glue-ml-transforms-data/sal"}, format = "csv", transformation_ctx = "datasink2"]
       // @return: datasink2
       // @inputs: [frame = findmatches1]
       val datasink2 = glueContext.getSinkWithFormat(connectionType = "s3", options = JsonOptions("""{"path": "s3://aws-glue-ml-transforms-data/sal"}"""), transformationContext = "datasink2", format = "csv").writeDynamicFrame(single_partition)
       Job.commit()
     }
   }
   ```

1. Choose **Run job** to start the job run. Check the status of the job in the jobs list. When the job finishes, in the **ML transform**, **History** tab, there is a new **Run ID** row added of type **ETL job**.

1. Navigate to the **Jobs**, **History** tab. In this pane, job runs are listed. For more details about the run, choose **Logs**. Check that the run status is **Succeeded** when it finishes.

## Step 6: Verify output data from Amazon S3
<a name="ml-transform-tutorial-data-output"></a>

In this step, you check the output of the job run in the Amazon S3 bucket that you chose when you added the job. You can download the output file to your local machine and verify that matching records were identified.

1. Open the Amazon S3 console at [https://console.aws.amazon.com/s3/](https://console.aws.amazon.com/s3/).

1. Download the target output file of the job `demo-etl-dblp-acm`. Open the file in a spreadsheet application (you might need to add a file extension `.csv` for the file to properly open).

   The following image shows an excerpt of the output in Microsoft Excel.  
![\[Excel spreadsheet showing the output of the transform.\]](http://docs.aws.amazon.com/glue/latest/dg/images/demo_output_dblp_acm.png)

   The data source and target file both have 4,911 records. However, the `Find matches` transform adds another column named `match_id` to identify matching records in the output. Rows with the same `match_id` are considered matching records. The `match_confidence_score` is a number between 0 and 1 that provides an estimate of the quality of matches found by `Find matches`.

1. Sort the output file by `match_id` to easily see which records are matches. Compare the values in the other columns to see if you agree with the results of the `Find matches` transform. If you don't, you can continue to teach the transform by adding more labels. 

   You can also sort the file by another field, such as `title`, to see if records with similar titles have the same `match_id`. 

# Finding incremental matches
<a name="machine-learning-incremental-matches"></a>

The Find matches feature allows you to identify duplicate or matching records in your dataset, even when the records don’t have a common unique identifier and no fields match exactly. The initial release of the Find matches transform identified matching records within a single dataset. When you add new data to the dataset, you had to merge it with the existing clean dataset and rerun matching against the complete merged dataset.

The incremental matching feature makes it simpler to match to incremental records against existing matched datasets. Suppose that you want to match prospects data with existing customer datasets. The incremental match capability provides you the flexibility to match hundreds of thousands of new prospects with an existing database of prospects and customers by merging the results into a single database or table. By matching only between the new and existing datasets, the find incremental matches optimization reduces computation time, which also reduces cost.

The usage of incremental matching is similar to Find matches as described in [Tutorial: Creating a machine learning transform with AWS Glue](machine-learning-transform-tutorial.md). This topic identifies only the differences with incremental matching.

For more information, see the blog post on [Incremental data matching](https://aws.amazon.com/blogs/big-data/incremental-data-matching-using-aws-lake-formation/).

## Running an incremental matching job
<a name="machine-learning-incremental-matches-add"></a>

For the following procedure, suppose the following: 
+ You have crawled the existing dataset into the table *first\$1records*. The *first\$1records* dataset must be a matched dataset, or the output of the matched job.
+ You have created and trained a Find matches transform with AWS Glue version 2.0. This is the only version of AWS Glue that supports incremental matches.
+ The ETL language is Scala. Note that Python is also supported.
+ The model already generated is called `demo-xform`.

1. Crawl the incremental dataset to the table *second\$1records*.

1. On the AWS Glue console, in the navigation pane, choose **Jobs**.

1. Choose **Add job**, and follow the steps in the wizard to create an ETL Spark job with a generated script. Choose the following property values for your transform:

   1. For **Name**, choose **demo-etl**.

   1. For **IAM role**, choose an IAM role with permission to the Amazon S3 source data, labeling file, and [AWS Glue API operations](https://docs.aws.amazon.com/glue/latest/dg/create-an-iam-role.html).

   1. For **ETL language**, choose **Scala**.

   1. For **Script file name**, choose **demo-etl**. This is the file name of the Scala script.

   1. For **Data source**, choose **first\$1records**. The data source you choose must match the machine learning transform data source schema.

   1. For **Transform type**, choose **Find matching records** to create a job using a machine learning transform.

   1. Select the incremental matching option, and for **Data Source** select the table named **second\$1records**.

   1. For **Transform**, choose **demo-xform**, the machine learning transform used by the job.

   1. Choose **Create tables in your data target** or **Use tables in the data catalog and update your data target**.

1. Choose **Save job and edit script** to display the script editor page.

1. Choose **Run job** to start the job run.

# Using FindMatches in a visual job
<a name="find-matches-visual-job"></a>

 To use the **FindMatches** transform in AWS Glue Studio, you can use the **Custom Transform** node that invokes the FindMatches API. For more information on how to use a custom transform, see [Creating a custom transformation](https://docs.aws.amazon.com/glue/latest/ug/transforms-custom.html) 

**Note**  
 Currently, the FindMatches API only works with `Glue 2.0`. In order to run a job with the Custom transform that invokes the FindMatches API, ensure the AWS Glue version is `Glue 2.0` in the **Job details** tab. If the version of AWS Glue is not `Glue 2.0`, the job will fail at runtime with the following error message: “cannot import name 'FindMatches' from 'awsglueml.transforms'”. 

## Prerequisites
<a name="adding-find-matches-to-a-visual-job-prerequisites"></a>
+  In order to use the **Find Matches** transform, open the AWS Glue Studio console at [https://console.aws.amazon.com/gluestudio/](https://console.aws.amazon.com/gluestudio/). 
+  Create a machine learning transform. When created, a transformId is generated. You will need this ID for the steps below. For more information on how to create a machine learning transform, see [Adding and editing machine learning transforms](https://docs.aws.amazon.com/glue/latest/dg/console-machine-learning-transforms.html#console-machine-learning-transforms-actions). 

## Adding a FindMatches transform
<a name="adding-find-matches-to-a-visual-job"></a>

**To add a FindMatches transform:**

1.  In the AWS Glue Studio job editor, open the Resource panel by clicking on the cross symbol in the upper left-hand corner of the visual job graph and choose a Data source by choosing the **Data tab**. This is the data source you want to check for matches.   
![\[The screenshot shows a cross symbol inside a circle. When you click on this in the visual job editor, the resource panel opens.\]](http://docs.aws.amazon.com/glue/latest/dg/images/resource-panel-blank-canvas.png)

1.  Choose the data source node, then open the Resource panel by clicking on the cross symbol in the upper left-hand corner of the visual job graph and search for 'custom transform'. Choose the **Custom Transform** node to add it to the graph. The **Custom Transform** is linked to the data source node. If it is not, you can click on the **Custom Transform ** node and choose the **Node properties** tab, then under **Node parents**, choose the data source. 

1.  Click the **Custom Transform** node in the visual graph, then choose the **Node properties** tab and name the custom transform. It is recommended that you rename the transform so that the transform name is easily identifiable in the visual graph. 

1.  Choose the **Transform** tab, where you can edit the code block. This is where the code to invoke the FindMatches API can be added.   
![\[The screenshot shows the code block in the Transform tab when the Custom Transform node is selected.\]](http://docs.aws.amazon.com/glue/latest/dg/images/custom-transform-code-block.png)

    The code block contains pre-populated code to get you started. Overwrite the pre-populated code with the template below. The template has a placeholder for the **transformId**, which you can provide. 

   ```
   def MyTransform (glueContext, dfc) -> DynamicFrameCollection:
       dynf = dfc.select(list(dfc.keys())[0])
       from awsglueml.transforms import FindMatches
       findmatches = FindMatches.apply(frame = dynf, transformId = "<your id>")
       return(DynamicFrameCollection({"FindMatches": findmatches}, glueContext))
   ```

1.  Click the **Custom Transform** node in the visual graph, then open the Resource panel by clicking on the cross symbol in the upper left-hand corner of the visual job graph and search for 'Select From Collection'. There is no need to change the default selection since there is only one DynamicFrame in the collection. 

1.  You can continue adding transformations or store the result, which is now enriched with the find matches additional columns. If you want to reference those new columns in downstream transforms, you need to add them to the transform output schema. the easiest way to do that is to choose the **Data preview** tab and then in the schema tab choose “Use datapreview schema”. 

1.  To customize FindMatches, you can add additional parameters to pass to the 'apply' method. See [FindMatches class](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-transforms-findmatches.html). 

## Adding a FindMatches incrementally transformation
<a name="find-matches-incrementally-visual-job"></a>

 In the case of incremental matches, the process is the same as **Adding a FindMatches transformation** with the following differences: 
+  Instead of a parent node for the custom transform, you need two parent nodes. 
+  The first parent node should be the dataset. 
+  The second parent node should be the incremental dataset. 

   Replace the `transformId` with your `transformId` in the template code block: 

  ```
  def MyTransform (glueContext, dfc) -> DynamicFrameCollection:
      dfs = list(dfc.values())
      dynf = dfs[0]
      inc_dynf = dfs[1]
      from awsglueml.transforms import FindIncrementalMatches
      findmatches = FindIncrementalMatches.apply(existingFrame = dynf, incrementalFrame = inc_dynf,
                                      transformId = "<your id>")
      return(DynamicFrameCollection({"FindMatches": findmatches}, glueContext))
  ```
+  For optional parameters, see [FindIncrementalMatches class](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-transforms-findincrementalmatches.html). 

# Migrate Apache Spark programs to AWS Glue
<a name="glue-author-migrate-apache-spark"></a>

Apache Spark is an open-source platform for distributed computing workloads performed on large datasets. AWS Glue leverages Spark's capabilities to provide an optimized experience for ETL. You can migrate Spark programs to AWS Glue to take advantage of our features. AWS Glue provides the same performance enhancements you would expect from Apache Spark on Amazon EMR.

## Run Spark code
<a name="glue-author-migrate-apache-spark-run"></a>

Native Spark code can be run in a AWS Glue environment out of the box. Scripts are often developed by iteratively changing a piece of code, a workflow suited for an Interactive Session. However, existing code is more suited to run in a AWS Glue job, which allows you to schedule and consistently get logs and metrics for each script run. You can upload and edit an existing script through the console. 

1. Acquire the source to your script. For this example, you will use an example script from the Apache Spark repository. [Binarizer Example](https://github.com/apache/spark/blob/master/examples/src/main/python/ml/binarizer_example.py) 

1. In the AWS Glue Console, expand the left-side navigation pane and select **ETL** > **Jobs** 

   In the **Create job** panel, select **Spark script editor**. An **Options** section will appear. Under **Options**, select **Upload and edit an existing script**.

   A **File upload** section will appear. Under **File upload**, click **Choose file**. Your system file chooser will appear. Navigate to the location where you saved `binarizer_example.py`, select it and confirm your selection.

   A **Create** button will appear on the header for the **Create job** panel. Click it.  
![\[The AWS Glue Studio Jobs page with Spark script editor pane selected.\]](http://docs.aws.amazon.com/glue/latest/dg/images/migrate-apache-spark-01-upload-job.png)

1. Your browser will navigate to the script editor. On the header, click the **Job details** tab. Set the **Name** and **IAM Role**. For guidance around AWS Glue IAM roles, consult [Setting up IAM permissions for AWS Glue](set-up-iam.md).

   Optionally - set **Requested number of workers** to `2` and **Number of retries** to `1`. These options are valuable when running production jobs, but turning them down will streamline your experience while testing out a feature.

   In the title bar, click **Save**, then **Run**  
![\[The job details page with options set as instructed.\]](http://docs.aws.amazon.com/glue/latest/dg/images/migrate-apache-spark-02-job-details.png)

1. Navigate to the **Runs** tab. You will see a panel corresponding to your job run. Wait a few minutes and the page should automatically refresh to show **Succeeded** under **Run status**.  
![\[The job runs page with a successful job run.\]](http://docs.aws.amazon.com/glue/latest/dg/images/migrate-apache-spark-03-job-runs.png)

1. You will want to examine your output to confirm that the Spark script ran as intended. This Apache Spark sample script should write a string to the output stream. You can find that by navigating to **Output logs** under **Cloudwatch logs** in the panel for the successful job run. Note the job run id, a generated id under the **Id** label beginning with `jr_`.

   This will open the CloudWatch console, set to visualize the contents of the default AWS Glue log group `/aws-glue/jobs/output`, filtered to the contents of the log streams for the job run id. Each worker will have generated a log stream, shown as rows under the **Log streams** . One worker should have run the requested code. You will need to open all the log streams to identify the correct worker. Once you find the right worker, you should see the output of the script, as seen in the following image:   
![\[The CloudWatch console page with the Spark program output.\]](http://docs.aws.amazon.com/glue/latest/dg/images/migrate-apache-spark-04-log-output.png)

## Common procedures needed for migrating Spark programs
<a name="glue-author-migrate-apache-spark-migrate"></a>

### Assess Spark version support
<a name="glue-author-migrate-apache-spark-migrate-versions"></a>

 AWS Glue release versions define the version of Apache Spark and Python available to the AWS Glue job. You can find our AWS Glue versions and what they support at [AWS Glue versions](release-notes.md#release-notes-versions). You may need to update your Spark program to be compatible with a newer version of Spark in order to access certain AWS Glue features.

### Include third-party libraries
<a name="glue-author-migrate-apache-spark-third-party-libraries"></a>

Many existing Spark programs will have dependencies, both on private and public artifacts. AWS Glue supports JAR style dependencies for Scala Jobs as well as Wheel and source pure-Python dependencies for Python jobs.

**Python** - For information about Python dependencies, see [Using Python libraries with AWS Glue](aws-glue-programming-python-libraries.md)

Common Python dependencies are provided in the AWS Glue environment, including the commonly requested [Pandas](https://pandas.pydata.org/) library. Dependencies are included in AWS Glue Version 2.0\$1. For more information about provided modules, see [Python modules already provided in AWS Glue](aws-glue-programming-python-libraries.md#glue-modules-provided). If you need to supply a Job with a different version of a dependency included by default, you can use `--additional-python-modules`. For information about job arguments, see [Using job parameters in AWS Glue jobs](aws-glue-programming-etl-glue-arguments.md).

You can supply additional Python dependencies with the `--extra-py-files` job argument. If you are migrating a job from a Spark program, this parameter is a good option because it is functionally equivalent to the `--py-files` flag in PySpark, and is subject to the same limitations. For more information about the `--extra-py-files` parameter, see [Including Python files with PySpark native features](aws-glue-programming-python-libraries.md#extra-py-files-support)

For new jobs, you can manage Python dependencies with the `--additional-python-modules` job argument. Using this argument allows for a more thorough dependency management experience. This parameter supports Wheel style dependencies, including those with native code bindings compatible with Amazon Linux 2.

**Scala**

You can supply additional Scala dependencies with the `--extra-jars` Job Argument. Dependencies must be hosted in Amazon S3 and the argument value should be a comma delimited list of Amazon S3 paths with no spaces. You may find it easier to manage your configuration by rebundling your dependencies before hosting and configuring them. AWS Glue JAR dependencies contain Java bytecode, which can be generated from any JVM language. You can use other JVM languages, such as Java, to write custom dependencies.

### Manage data source credentials
<a name="glue-author-migrate-apache-spark-credential-management"></a>

Existing Spark programs may come with complex or custom configuration to pull data from their datasources. Common datasource auth flows are supported by AWS Glue connections. For more information about AWS Glue connections, see [Connecting to data](glue-connections.md).

AWS Glue connections facilitate connecting your Job to a variety of types of data stores in two primary ways: through method calls to our libraries and setting the **Additional network connection** in the AWS console. You may also call the AWS SDK from within your job to retrieve information from a connection. 

 **Method calls** – AWS Glue Connections are tightly integrated with the AWS Glue Data Catalog, a service that allows you to curate information about your datasets, and the methods available to interact with AWS Glue connections reflect that. If you have an existing auth configuration you would like to reuse, for JDBC connections, you can access your AWS Glue connection configuration through the `extract_jdbc_conf` method on the `GlueContext`. For more information, see [extract\$1jdbc\$1conf](aws-glue-api-crawler-pyspark-extensions-glue-context.md#aws-glue-api-crawler-pyspark-extensions-glue-context-extract_jdbc_conf) 

**Console configuration** – AWS Glue Jobs use associated AWS Glue connections to configure connections to Amazon VPC subnets. If you directly manage your security materials, you may need to provide a `NETWORK` type **Additional network connection** in the AWS console to configure routing. For more information about the AWS Glue connection API, see [Connections API](aws-glue-api-catalog-connections.md)

If your Spark programs has a custom or uncommon auth flow, you may need to manage your security materials in a hands-on fashion. If AWS Glue connections do not seem like a good fit, you can securely host security materials in Secrets Manager and access them through the boto3 or AWS SDK, which are provided in the job.

### Configure Apache Spark
<a name="glue-author-migrate-apache-spark-spark-configuration"></a>

Complex migrations often alter Spark configuration to acommodate their workloads. Modern versions of Apache Spark allow runtime configuration to be set with the `SparkSession`. AWS Glue 3.0\$1 Jobs are provided a `SparkSession`, which can be modified to set runtime configuration. [Apache Spark Configuration](https://spark.apache.org/docs/latest/configuration.html). Tuning Spark is complex, and AWS Glue does not guarantee support for setting all Spark configuration. if your migration requires substantial Spark-level configuration, contact support.

### Set custom configuration
<a name="glue-author-migrate-apache-spark-custom-configuration"></a>

Migrated Spark programs may be designed to take custom configuration. AWS Glue allows configuration to be set on the job and job run level, through the job arguments. For information about job arguments, see [Using job parameters in AWS Glue jobs](aws-glue-programming-etl-glue-arguments.md). You can access job arguments within the context of a job through our libraries. AWS Glue provides a utility function to provide a consistent view between arguments set on the job and arguments set on the job run. See [Accessing parameters using `getResolvedOptions`](aws-glue-api-crawler-pyspark-extensions-get-resolved-options.md) in Python and [AWS Glue Scala GlueArgParser APIs](glue-etl-scala-apis-glue-util-glueargparser.md) in Scala.

### Migrate Java code
<a name="glue-author-migrate-apache-spark-java-code"></a>

As explained in [Include third-party libraries](#glue-author-migrate-apache-spark-third-party-libraries), your dependencies can contain classes generated by JVM languages, such as Java or Scala. Your dependencies can include a `main` method. You can use a `main` method in a dependency as the entrypoint for a AWS Glue Scala job. This allows you to write your `main` method in Java, or reuse a `main` method packaged to your own library standards. 

To use a `main` method from a dependency, perform the following: Clear the contents of the editing pane providing the default `GlueApp` object. Provide the fully qualified name of a class in a dependency as a job argument with the key `--class`. You should then be able to trigger a Job run.

You cannot configure the order or structure of the arguments AWS Glue passes to the `main` method. If your existing code needs to read configuration set in AWS Glue, this will likely cause incompatibility with prior code. If you use `getResolvedOptions`, you will also not have a good place to call this method. Consider invoking your dependency directly from a main method generated by AWS Glue. The following AWS Glue ETL script shows an example of this.

```
import com.amazonaws.services.glue.util.GlueArgParser

object GlueApp {
  def main(sysArgs: Array[String]) {
    val args = GlueArgParser.getResolvedOptions(sysArgs, Seq("JOB_NAME").toArray)
    
    // Invoke static method from JAR. Pass some sample arguments as a String[], one defined inline and one taken from the job arguments, using getResolvedOptions
    com.mycompany.myproject.MyClass.myStaticPublicMethod(Array("string parameter1", args("JOB_NAME")))
    
    // Alternatively, invoke a non-static public method.
    (new com.mycompany.myproject.MyClass).someMethod()
  }
}
```

# Working with Ray jobs in AWS Glue
<a name="ray-jobs-section"></a>

**Important**  
AWS Glue for Ray will no longer be open to new customers starting April 30, 2026. If you would like to use AWS Glue for Ray, sign up prior to that date. Existing customers can continue to use the service as normal. For capabilities similar to for AWS Glue for Ray, explore Amazon EKS. For more information, see [AWS Glue for Ray end of support](https://docs.aws.amazon.com/glue/latest/dg/awsglue-ray-jobs-availability-change.html).

This section provides information about using AWS Glue for Ray jobs. For more information about writing AWS Glue for Ray scripts, consult the [Programming Ray scripts](aws-glue-programming-ray.md) section.

**Topics**
+ [Getting started with AWS Glue for Ray](#author-job-ray-using)
+ [Supported Ray runtime environments](#author-job-ray-runtimes)
+ [Accounting for workers in Ray jobs](#author-job-ray-worker-accounting)
+ [Using job parameters in Ray jobs](author-job-ray-job-parameters.md)
+ [Monitoring Ray jobs with metrics](author-job-ray-monitor.md)

## Getting started with AWS Glue for Ray
<a name="author-job-ray-using"></a>

To work with AWS Glue for Ray, you use the same AWS Glue jobs and interactive sessions that you use with AWS Glue for Spark. AWS Glue jobs are designed for running the same script on a recurring cadence, while interactive sessions are designed to let you run snippets of code sequentially against the same provisioned resources. 

AWS Glue ETL and Ray are different underneath, so in your script, you have access to different tools, features, and configuration. As a new computation framework managed by AWS Glue, Ray has a different architecture and uses different vocabulary to describe what it does. For more information, see [Architecture Whitepapers](https://docs.ray.io/en/latest/ray-contribute/whitepaper.html) in the Ray documentation. 

**Note**  
AWS Glue for Ray is available in US East (N. Virginia), US East (Ohio), US West (Oregon), Asia Pacific (Tokyo), and Europe (Ireland).

### Ray jobs in the AWS Glue Studio console
<a name="author-job-ray-using-console"></a>

On the **Jobs** page in the AWS Glue Studio console, you can select a new option when you're creating a job in AWS Glue Studio—**Ray script editor**. Choose this option to create a Ray job in the console. For more information about jobs and how they're used, see [Building visual ETL jobs](author-job-glue.md).

![\[The Jobs page in AWS Glue Studio with the Ray script editor option selected.\]](http://docs.aws.amazon.com/glue/latest/dg/images/ray_job_setup.png)


### Ray jobs in the AWS CLI and SDK
<a name="author-job-ray-using-cli"></a>

Ray jobs in the AWS CLI use the same SDK actions and parameters as other jobs. AWS Glue for Ray introduces new values for certain parameters. For more information in the Jobs API, see [Jobs](aws-glue-api-jobs-job.md).

## Supported Ray runtime environments
<a name="author-job-ray-runtimes"></a>

In Spark jobs, `GlueVersion` determines the versions of Apache Spark and Python available in an AWS Glue for Spark job. The Python version indicates the version that is supported for jobs of type Spark. This is not how Ray runtime environments are configured.

For Ray jobs, you should set `GlueVersion` to `4.0` or greater. However, the versions of Ray, Python, and additional libraries that are available in your Ray job are determined by the `Runtime` field in the job definition.

The `Ray2.4` runtime environment will be available for a minimum of 6 months after release. As Ray rapidly evolves, you will be able to incorporate Ray updates and improvements through future runtime environment releases.

Valid values: `Ray2.4`


| Runtime value | Ray and Python versions | 
| --- | --- | 
| Ray2.4 (for AWS Glue 4.0\$1) |  Ray 2.4.0 Python 3.9  | 

**Additional information**
+ For release notes that accompany AWS Glue on Ray releases, see [AWS Glue versions](release-notes.md#release-notes-versions).
+ For Python libraries that are provided in a runtime environment, see [Modules provided with Ray jobs](edit-script-ray-env-dependencies.md#edit-script-ray-modules-provided).

## Accounting for workers in Ray jobs
<a name="author-job-ray-worker-accounting"></a>

AWS Glue runs Ray jobs on new Graviton-based EC2 worker types, which are only available for Ray jobs. To appropriately provision these workers for the workloads Ray is designed for, we provide a different ratio of compute resources to memory resources from most workers. In order to account for these resources, we use the memory-optimized data processing unit (M-DPU) rather than the standard data processing unit (DPU).
+ One M-DPU corresponds to 4 vCPUs and 32 GB of memory.
+ One DPU corresponds to 4 vCPUs and 16 GB of memory. DPUs are used to account for resources in AWS Glue with Spark jobs and corresponding workers.

Ray jobs currently have access to one worker type, `Z.2X`. The `Z.2X` worker maps to 2 M-DPUs (8 vCPUs, 64 GB of memory) and has 128 GB of disk space. A `Z.2X` machine provides 8 Ray workers (one per vCPU).

The number of M-DPUs that you can use concurrently in an account is subject to a service quota. For more information about your AWS Glue account limits, see [AWS Glue endpoints and quotas](https://docs.aws.amazon.com/general/latest/gr/glue.html).

You specify the number of worker nodes that are available to a Ray job with `--number-of-workers (NumberOfWorkers)` in the job definition. For more information about Ray values in the Jobs API, see [Jobs](aws-glue-api-jobs-job.md).

You can further specify a minimum number of workers that a Ray job must allocate with the `--min-workers` job parameter. For more information about job parameters, see [Reference](author-job-ray-job-parameters.md#author-job-ray-parameters-reference). 

# Using job parameters in Ray jobs
<a name="author-job-ray-job-parameters"></a>

**Important**  
AWS Glue for Ray will no longer be open to new customers starting April 30, 2026. If you would like to use AWS Glue for Ray, sign up prior to that date. Existing customers can continue to use the service as normal. For capabilities similar to for AWS Glue for Ray, explore Amazon EKS. For more information, see [AWS Glue for Ray end of support](https://docs.aws.amazon.com/glue/latest/dg/awsglue-ray-jobs-availability-change.html).

You set arguments for AWS Glue Ray jobs the same way you set arguments for AWS Glue for Spark jobs. For more information about the AWS Glue API, see [Jobs](aws-glue-api-jobs-job.md). You can configure AWS Glue Ray jobs with different arguments, which are listed in this reference. You can also provide your own arguments. 

You can configure a job through the console, on the **Job details** tab, under the **Job Parameters** heading. You can also configure a job through the AWS CLI by setting `DefaultArguments` on a job, or setting `Arguments` on a job run. Default arguments and job parameters stay with the job through multiple runs. 

For example, the following is the syntax for running a job using `--arguments` to set a special parameter.

```
$ aws glue start-job-run --job-name "CSV to CSV" --arguments='--scriptLocation="s3://my_glue/libraries/test_lib.py",--test-environment="true"'
```

After you set the arguments, you can access job parameters from within your Ray job through environment variables. This gives you a way to configure your job for each run. The name of the environment variable will be the job argument name without the `--` prefix. 

For instance, in the previous example, the variable names would be `scriptLocation` and `test-environment`. You would then retrieve the argument through methods available in the standard library: `test_environment = os.environ.get('test-environment')`. For more information about accessing environment variables with Python, see [os module](https://docs.python.org/3/library/os.html) in the Python documentation.

## Configure how Ray jobs generate logs
<a name="author-job-ray-logging-configuration"></a>

By default, Ray jobs generate logs and metrics that are sent to CloudWatch and Amazon S3. You can use the `--logging_configuration` parameter to alter how logs are generated, currently you can use it to stop Ray jobs from generating various types of logs. This parameter takes a JSON object, whose keys correspond to the logs/behaviors you would like to alter. It supports the following keys:
+ `CLOUDWATCH_METRICS` – Configures CloudWatch metrics series that can be used to visualize job health. For more information about metrics, see [Monitoring Ray jobs with metrics](author-job-ray-monitor.md).
+ `CLOUDWATCH_LOGS` – Configures CloudWatch logs that provide Ray application level details about the status the job run. For more information about logs, see [Troubleshooting AWS Glue for Ray errors from logs](troubleshooting-ray.md).
+ `S3` – Configures what AWS Glue writes to Amazon S3, primarily similar information to CloudWatch logs but as files rather than log streams.

To disable a Ray logging behavior, provide the value `{\"IS_ENABLED\": \"False\"}`. For example, to disable CloudWatch metrics and CloudWatch logs, provide the following configuration:

```
"--logging_configuration": "{\"CLOUDWATCH_METRICS\": {\"IS_ENABLED\": \"False\"}, \"CLOUDWATCH_LOGS\": {\"IS_ENABLED\": \"False\"}}"
```

## Reference
<a name="author-job-ray-parameters-reference"></a>

 Ray jobs recognize the following argument names that you can use to set up the script environment for your Ray jobs and job runs:
+ `--logging_configuration` – Used to stop the generation of various logs created by Ray jobs. These logs are generated by default on all Ray jobs. Format: String-escaped JSON object. For more information, see [Configure how Ray jobs generate logs](#author-job-ray-logging-configuration).
+ `--min-workers` – The minimum number of worker nodes that are allocated to a Ray job. A worker node can run multiple replicas, one per virtual CPU. Format: integer. Minimum: 0. Maximum: value specified in `--number-of-workers (NumberOfWorkers)` on the job definition. For more information about accounting for worker nodes, see [Accounting for workers in Ray jobs](ray-jobs-section.md#author-job-ray-worker-accounting).
+ `--object_spilling_config` – AWS Glue for Ray supports using Amazon S3 as a way of extending the space available to Ray's object store. To enable this behavior, you can provide Ray an *object spilling*JSON config object with this parameter. For more information about Ray object spilling configuration, see [Object Spilling](https://docs.ray.io/en/latest/ray-core/objects/object-spilling.html) in the Ray documentation. Format: JSON object.

  AWS Glue for Ray only supports spilling to disk or spilling to Amazon S3 at once. You can provide multiple locations for spilling, as long as they respect this limitation. When spilling to Amazon S3, you will also need to add IAM permissions to your job for this bucket.

  When providing a JSON object as configuration with the CLI, you must provide it as a string, with the JSON object string-escaped. For example, a string value for spilling to one Amazon S3 path would look like: `"{\"type\": \"smart_open\", \"params\": {\"uri\":\"s3path\"}}"`. In AWS Glue Studio, provide this parameter as a JSON object with no extra formatting. 
+ `--object_store_memory_head` – The memory allocated to the Plasma object store on the Ray head node. This instance runs cluster management services, as well as worker replicas. The value represents a percentage of free memory on the instance after a warm start. You use this parameter to tune memory intensive workloads—defaults are acceptable for most use cases. Format: positive integer. Minimum: 1. Maximum: 100.

  For more information about Plasma, see [The Plasma In-Memory Object Store](https://ray-project.github.io/2017/08/08/plasma-in-memory-object-store.html) in the Ray documentation.
+ `--object_store_memory_worker` – The memory allocated to the Plasma object store on the Ray worker nodes. These instances only run worker replicas. The value represents a percentage of free memory on the instance after a warm start. This parameter is used to tune memory intensive workloads—defaults are acceptable for most use cases. Format: positive integer. Minimum: 1. Maximum: 100.

  For more information about Plasma, see [The Plasma In-Memory Object Store](https://ray-project.github.io/2017/08/08/plasma-in-memory-object-store.html) in the Ray documentation.
+ `--pip-install` – A set of Python packages to be installed. You  can install packages from PyPI using this argument. Format: comma-delimited  list.

  A PyPI package entry is in the format `package==version`, with the PyPI name and  version of your target package. Entries use Python version matching to match the package and version, such as `==`, not the single equals `=`. There are other  version-matching operators. For more information, see [PEP 440](https://peps.python.org/pep-0440/#version-matching) on the Python website. You can also provide custom modules with `--s3-py-modules`. 
+ `--s3-py-modules` – A set of Amazon S3 paths that host Python module distributions. Format: comma-delimited list.

  You can use this to distribute your own modules to your Ray job. You can also provide modules from PyPI with `--pip-install`. Unlike with AWS Glue ETL, custom modules are not set up through pip, but are passed to Ray for distribution. For more information, see [Additional Python modules for Ray jobs](edit-script-ray-env-dependencies.md#edit-script-ray-python-libraries-additional).
+ `--working-dir` – A path to a .zip file hosted in Amazon S3 that contains files to be distributed to all nodes running your Ray job. Format: string. For more information, see [Providing files to your Ray job](edit-script-ray-env-dependencies.md#edit-script-ray-working-directory).

# Monitoring Ray jobs with metrics
<a name="author-job-ray-monitor"></a>

**Important**  
AWS Glue for Ray will no longer be open to new customers starting April 30, 2026. If you would like to use AWS Glue for Ray, sign up prior to that date. Existing customers can continue to use the service as normal. For capabilities similar to for AWS Glue for Ray, explore Amazon EKS. For more information, see [AWS Glue for Ray end of support](https://docs.aws.amazon.com/glue/latest/dg/awsglue-ray-jobs-availability-change.html).

You can monitor Ray jobs using AWS Glue Studio and Amazon CloudWatch. CloudWatch collects and processes raw metrics from AWS Glue with Ray, which makes them available for analysis. These metrics are visualized in the AWS Glue Studio console, so you can monitor your job as it runs.

For a general overview of how to monitor AWS Glue, see [Monitoring AWS Glue using Amazon CloudWatch metrics](monitoring-awsglue-with-cloudwatch-metrics.md). For a general overview of how to use CloudWatch metrics that are published by AWS Glue, see [Monitoring with Amazon CloudWatch](monitor-cloudwatch.md).

## Monitoring Ray jobs in the AWS Glue console
<a name="author-job-ray-monitor-console"></a>

On the details page for a job run, below the **Run details** section, you can view pre-built aggregated graphs that visualize your available job metrics. AWS Glue Studio sends job metrics to CloudWatch for every job run. With these, you can build a profile of your cluster and tasks, as well as access detailed information about each node.

For more information about available metrics graphs, see [Viewing Amazon CloudWatch metrics for a Ray job run](view-job-runs.md#monitoring-job-run-metrics-ray).

## Overview of Ray jobs metrics in CloudWatch
<a name="author-job-ray-monitor-cw"></a>

We publish Ray metrics when detailed monitoring is enabled in CloudWatch. Metrics are published to the `Glue/Ray` CloudWatch namespace.
+ **Instance metrics**

  We publish metrics about the CPU, memory and disk utilization of instances assigned to a job. These metrics are identified by features such as `ExecutorId`, `ExecutorType` and `host`. These metrics are a subset of the standard Linux CloudWatch agent metrics. You can find information about metric names and features in the CloudWatch documentation. For more information, see [Metrics collected by the CloudWatch agent](https://docs.aws.amazon.com//AmazonCloudWatch/latest/monitoring/metrics-collected-by-CloudWatch-agent.html).
+ **Ray cluster metrics**

  We forward metrics from the Ray processes that run your script to this namespace, then provide those most critical for you. The metrics that are available might differ by Ray version. For more information about which Ray version your job is running, see [AWS Glue versions](release-notes.md). 

  Ray collects metrics at the instance level. It also provides metrics for tasks and the cluster. For more information about Ray's underlying metric strategy, see [Metrics](https://docs.ray.io/en/latest/ray-observability/ray-metrics.html#system-metrics) in the Ray documentation.

**Note**  
 We don't publish Ray metrics to the `Glue/Job Metrics/` namespace, which is only used for AWS Glue ETL jobs.

# Configuring job properties for Python shell jobs in AWS Glue
<a name="add-job-python"></a>

 You can use a Python shell job to run Python scripts as a shell in AWS Glue. With a Python shell job, you can run scripts that are compatible with Python 3.6 or Python 3.9. 

**Note**  
 Support for Pyshell v3.6 will end on March 1, 2026. To migrate your workloads, see [Migrate from AWS Glue Python shell jobs](https://docs.aws.amazon.com/glue/latest/dg/pyshell-migration.html). If you wish to continue with Python shell v3.9 see [Migrating from Python shell 3.6 to Python shell 3.9](#migrating-version-pyshell36-to-pyshell39). 

**Topics**
+ [Limitations](#python-shell-limitations)
+ [Execution environment](#python-shell-execution-environment)
+ [Defining job properties for Python shell jobs](#create-job-python-properties)
+ [Supported libraries for Python shell jobs](#python-shell-supported-library)
+ [Providing your own Python library](#create-python-extra-library)
+ [Use AWS CloudFormation with Python shell jobs in AWS Glue](#python-shell-jobs-cloudformation)
+ [Migrating from Python shell 3.6 to Python shell 3.9](#migrating-version-pyshell36-to-pyshell39)
+ [Migrate from AWS Glue Python shell jobs](pyshell-migration.md)

## Limitations
<a name="python-shell-limitations"></a>

Note the following limitations of Python Shell jobs:
+  You can't use job bookmarks with Python shell jobs. 
+ You can't package any Python libraries as `.egg` files in Python 3.9\$1. Instead, use `.whl`.
+ The `--extra-files` option cannot be used, because of a limitation on temporary copies of S3 data.

## Execution environment
<a name="python-shell-execution-environment"></a>

Python shell jobs run in a managed execution environment that provides access to local storage for temporary data processing:

**Local temporary storage**  
The `/tmp` directory is available for temporary storage during job execution. This directory provides approximately 14 GiB of free space that you can use for:  
+ Temporary file processing
+ Intermediate data storage
+ Caching small datasets
The `/tmp` directory is ephemeral and is cleaned up after job completion. Do not use it for persistent storage of important data.

## Defining job properties for Python shell jobs
<a name="create-job-python-properties"></a>

These sections describe defining job properties in AWS Glue Studio, or using the AWS CLI.

### AWS Glue Studio
<a name="create-job-python-properties-studio"></a>

When you define your Python shell job in AWS Glue Studio, you provide some of the following properties: 

**IAM role**  
Specify the AWS Identity and Access Management (IAM) role that is used for authorization to resources that are used to run the job and access data stores. For more information about permissions for running jobs in AWS Glue, see [Identity and access management for AWS Glue](security-iam.md).

**Type**  
Choose **Python shell** to run a Python script with the job command named `pythonshell`.

**Python version**  
Choose the Python version. The default is Python 3.9. Valid versions are Python 3.6 and Python 3.9.

**Load common analytics libraries (Recommended)**  
Choose this option to include common libraries for Python 3.9 in the Python shell.  
If your libraries are either custom or they conflict with the pre-installed ones, you can choose not to install common libraries. However, you can install additional libraries besides the common libraries.  
When you select this option, the `library-set` option is set to `analytics`. When you de-select this option, the `library-set` option is set to `none`. 

**Script filename and Script path**  
The code in the script defines your job's procedural logic. You provide the script name and location in Amazon Simple Storage Service (Amazon S3). Confirm that there isn't a file with the same name as the script directory in the path. To learn more about using scripts, see [AWS Glue programming guide](edit-script.md).

**Script**  
The code in the script defines your job's procedural logic. You can code the script in Python 3.6 or Python 3.9. You can edit a script in AWS Glue Studio.

**Data processing units**  
The maximum number of AWS Glue data processing units (DPUs) that can be allocated when this job runs. A DPU is a relative measure of processing power that consists of 4 vCPUs of compute capacity and 16 GB of memory. For more information, see [AWS Glue pricing](https://aws.amazon.com/glue/pricing/).   
You can set the value to 0.0625 or 1. The default is 0.0625. In either case, the local disk for the instance will be 20GB.

### CLI
<a name="create-job-python-properties-cli"></a>

 You can also create a **Python shell** job using the AWS CLI, as in the following example. 

```
 aws glue create-job --name python-job-cli --role Glue_DefaultRole 
     --command '{"Name" :  "pythonshell", "PythonVersion": "3.9", "ScriptLocation" : "s3://amzn-s3-demo-bucket/scriptname.py"}'  
     --max-capacity 0.0625
```

**Note**  
 You don't need to specify the version of AWS Glue since the parameter `--glue-version` doesn't apply for AWS Glue shell jobs. Any version specified will be ignored. 

 Jobs that you create with the AWS CLI default to Python 3. Valid Python versions are 3 (corresponding to 3.6), and 3.9. To specify Python 3.6, add this tuple to the `--command` parameter: `"PythonVersion":"3"` 

 To specify Python 3.9, add this tuple to the `--command` parameter: `"PythonVersion":"3.9"` 

 To set the maximum capacity used by a Python shell job, provide the `--max-capacity` parameter. For Python shell jobs, the `--allocated-capacity` parameter can't be used. 

## Supported libraries for Python shell jobs
<a name="python-shell-supported-library"></a>

 In Python shell using Python 3.9, you can choose the library set to use pre-packaged library sets for your needs. You can use the `library-set` option to choose the library set. Valid values are `analytics`, and `none`. 

The environment for running a Python shell job supports the following libraries: 


| Python version | Python 3.6 | Python 3.9 | 
| --- | --- | --- | 
| Library set | N/A | analytics | none | 
| avro |  | 1.11.0 |  | 
| awscli | 116.242 | 1.23.5 | 1.23.5 | 
| awswrangler |  | 2.15.1 |  | 
| botocore | 1.12.232 | 1.24.21 | 1.23.5 | 
| boto3 | 1.9.203 | 1.21.21 |  | 
| elasticsearch |  | 8.2.0 |  | 
| numpy | 1.16.2 | 1.22.3 |  | 
| pandas | 0.24.2 | 1.4.2 |  | 
| psycopg2 |  | 2.9.3 |  | 
| pyathena |  | 2.5.3 |  | 
| PyGreSQL | 5.0.6 |  |  | 
| PyMySQL |  | 1.0.2 |  | 
| pyodbc |  | 4.0.32 |  | 
| pyorc |  | 0.6.0 |  | 
| redshift-connector |  | 2.0.907 |  | 
| requests | 2.22.0 | 2.27.1 |  | 
| scikit-learn | 0.20.3 | 1.0.2 |  | 
| scipy | 1.2.1 | 1.8.0 |  | 
| SQLAlchemy |  | 1.4.36 |  | 
| s3fs |  | 2022.3.0 |  | 

You can use the `NumPy` library in a Python shell job for scientific computing. For more information, see [NumPy](http://www.numpy.org). The following example shows a NumPy script that can be used in a Python shell job. The script prints "Hello world" and the results of several mathematical calculations.

```
import numpy as np
print("Hello world")

a = np.array([20,30,40,50])
print(a)

b = np.arange( 4 )

print(b)

c = a-b

print(c)

d = b**2

print(d)
```

## Providing your own Python library
<a name="create-python-extra-library"></a>

### Using PIP
<a name="create-python-extra-library-pip"></a>

Python shell using Python 3.9 lets you provide additional Python modules or different versions at the job level. You can use the `--additional-python-modules` option with a list of comma-separated Python modules to add a new module or change the version of an existing module. You cannot provide custom Python modules hosted on Amazon S3 with this parameter when using Python shell jobs.

For example to update or to add a new `scikit-learn` module use the following key and value: `"--additional-python-modules", "scikit-learn==0.21.3"`.

AWS Glue uses the Python Package Installer (pip3) to install the additional modules. You can pass additional pip3 options inside the `--additional-python-modules` value. For example, `"scikit-learn==0.21.3 -i https://pypi.python.org/simple/"`. Any incompatibilities or limitations from pip3 apply.

**Note**  
To avoid incompatibilities in the future, we recommend that you use libraries built for Python 3.9.

### Using an Egg or Whl file
<a name="create-python-extra-library-egg-whl"></a>

You might already have one or more Python libraries packaged as an `.egg` or a `.whl` file. If so, you can specify them to your job using the AWS Command Line Interface (AWS CLI) under the "`--extra-py-files`" flag, as in the following example.

```
aws glue create-job --name python-redshift-test-cli --role role --command '{"Name" :  "pythonshell", "ScriptLocation" : "s3://MyBucket/python/library/redshift_test.py"}' 
     --connections Connections=connection-name --default-arguments '{"--extra-py-files" : ["s3://amzn-s3-demo-bucket/EGG-FILE", "s3://amzn-s3-demo-bucket/WHEEL-FILE"]}'
```

If you aren't sure how to create an `.egg` or a `.whl` file from a Python library, use the following steps. This example is applicable on macOS, Linux, and Windows Subsystem for Linux (WSL).

**To create a Python .egg or .whl file**

1. Create an Amazon Redshift cluster in a virtual private cloud (VPC), and add some data to a table.

1. Create an AWS Glue connection for the VPC-SecurityGroup-Subnet combination that you used to create the cluster. Test that the connection is successful.

1. Create a directory named `redshift_example`, and create a file named `setup.py`. Paste the following code into `setup.py`.

   ```
   from setuptools import setup
   
   setup(
       name="redshift_module",
       version="0.1",
       packages=['redshift_module']
   )
   ```

1. In the `redshift_example` directory, create a `redshift_module` directory. In the `redshift_module` directory, create the files `__init__.py` and `pygresql_redshift_common.py`.

1. Leave the `__init__.py` file empty. In `pygresql_redshift_common.py`, paste the following code. Replace *port*, *db\$1name*, *user*, and *password\$1for\$1user* with details specific to your Amazon Redshift cluster. Replace *table\$1name* with the name of the table in Amazon Redshift.

   ```
   import pg
   
   
   def get_connection(host):
       rs_conn_string = "host=%s port=%s dbname=%s user=%s password=%s" % (
           host, port, db_name, user, password_for_user)
   
       rs_conn = pg.connect(dbname=rs_conn_string)
       rs_conn.query("set statement_timeout = 1200000")
       return rs_conn
   
   
   def query(con):
       statement = "Select * from table_name;"
       res = con.query(statement)
       return res
   ```

1. If you're not already there, change to the `redshift_example` directory.

1. Do one of the following:
   + To create an `.egg` file, run the following command.

     ```
     python setup.py bdist_egg
     ```
   + To create a `.whl` file, run the following command.

     ```
     python setup.py bdist_wheel
     ```

1. Install the dependencies that are required for the preceding command.

1. The command creates a file in the `dist` directory:
   + If you created an egg file, it's named `redshift_module-0.1-py2.7.egg`.
   + If you created a wheel file, it's named `redshift_module-0.1-py2.7-none-any.whl`.

   Upload this file to Amazon S3.

   In this example, the uploaded file path is either *s3://amzn-s3-demo-bucket/EGG-FILE* or *s3://amzn-s3-demo-bucket/WHEEL-FILE*. 

1. Create a Python file to be used as a script for the AWS Glue job, and add the following code to the file.

   ```
   from redshift_module import pygresql_redshift_common as rs_common
   
   con1 = rs_common.get_connection(redshift_endpoint)
   res = rs_common.query(con1)
   
   print "Rows in the table cities are: "
   
   print res
   ```

1. Upload the preceding file to Amazon S3. In this example, the uploaded file path is *s3://amzn-s3-demo-bucket/scriptname.py*. 

1. Create a Python shell job using this script. On the AWS Glue console, on the **Job properties** page, specify the path to the `.egg/.whl` file in the **Python library path** box. If you have multiple `.egg/.whl` files and Python files, provide a comma-separated list in this box. 

   When modifying or renaming `.egg` files, the file names must use the default names generated by the "python setup.py bdist\$1egg" command or must adhere to the Python module naming conventions. For more information, see the [Style Guide for Python Code](https://www.python.org/dev/peps/pep-0008/). 

   Using the AWS CLI, create a job with a command, as in the following example.

   ```
   aws glue create-job --name python-redshift-test-cli --role Role --command '{"Name" :  "pythonshell", "ScriptLocation" : "s3://amzn-s3-demo-bucket/scriptname.py"}' 
        --connections Connections="connection-name" --default-arguments '{"--extra-py-files" : ["s3://amzn-s3-demo-bucket/EGG-FILE", "s3://amzn-s3-demo-bucket/WHEEL-FILE"]}'
   ```

   When the job runs, the script prints the rows created in the *table\$1name* table in the Amazon Redshift cluster.

## Use AWS CloudFormation with Python shell jobs in AWS Glue
<a name="python-shell-jobs-cloudformation"></a>

 You can use AWS CloudFormation with Python shell jobs in AWS Glue. The following is an example: 

```
AWSTemplateFormatVersion: 2010-09-09
Resources:
  Python39Job:
    Type: 'AWS::Glue::Job'
    Properties:
      Command:
        Name: pythonshell
        PythonVersion: '3.9'
        ScriptLocation: 's3://bucket/location'
      MaxRetries: 0
      Name: python-39-job
      Role: RoleName
```

 The Amazon CloudWatch Logs group for Python shell jobs output is `/aws-glue/python-jobs/output`. For errors, see the log group `/aws-glue/python-jobs/error`. 

## Migrating from Python shell 3.6 to Python shell 3.9
<a name="migrating-version-pyshell36-to-pyshell39"></a>

 To migrate your Python shell jobs to the latest AWS Glue version: 

1.  In the AWS Glue console ([https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/)), choose your existing Python shell job. 

1.  In the **Job** details tab, set the Python version to `Python 3.9` and choose **Save**. 

1.  Ensure that your job script is compatible with Python 3.9 and that it runs successfully. 

# Migrate from AWS Glue Python shell jobs
<a name="pyshell-migration"></a>

 AWS launched the AWS Glue Python shell jobs in 2018 AWS launched the AWS Glue Python shell jobs in 2018 in order to give customers an easy way to run Python scripts for small-to-medium sized ETL jobs, and to trigger SQL queries. However, there are now more modern and flexible options to address the workloads currently running on PythonShell. This topic explains how to migrate your workloads from AWS Glue Python shell jobs to one of these alternative options in order to take advantage of the newer capabilities that are available. 

 This topic explains how to migrate from AWS Glue Python shell jobs to alternative options. 

## Migrating workload to AWS Glue Spark jobs
<a name="pyshell-migration-to-glue-spark-jobs"></a>

 [AWS Glue Spark and PySpark jobs](https://docs.aws.amazon.com/glue/latest/dg/spark_and_pyspark.html) allow you to run your workloads in a distributed fashion. Since both AWS Glue Python Shell jobs and AWS Glue Spark jobs run on the same platform, it's easy to migrate, and you can continue using existing AWS Glue features that you're using with Python Shell jobs, such as AWS Glue Workflows, AWS Glue Triggers, AWS Glue's Amazon EventBridge integration, \$1 PIP-based package installation, and so on. 

 However, AWS Glue Spark jobs are designed to run Spark workloads, and the minimum number of workers is 2. If you migrate from Python Shell jobs without modifying your scripts, only one worker will be actually used, and the other workers will remain idle. This will increase your costs. 

 To make it efficient, rewrite your Python job script to utilize Spark's capabilities and distribute the workload across multiple workers. If your Python script is Pandas-based, it's easy to migrate using the New Pandas API on Spark. Learn more about this in [the AWS Big Data Blog: Dive deep into AWS Glue 4.0 for Apache Spark](https://aws.amazon.com/blogs/big-data/dive-deep-into-aws-glue-4-0-for-apache-spark/). 

## Migrating workload to AWS Lambda
<a name="pyshell-migration-to-aws-lambda"></a>

 AWS Lambda is a serverless computing service that lets you run code without provisioning or managing servers. Because AWS Lambda has lower startup times and more flexible options for compute capacity, you can benefit from these advantages. For managing extra Python libraries, AWS Glue Python Shell jobs use PIP-based installation. However, for AWS Lambda, you need to choose one of the following options: a zip archive, a container image, or Lambda Layers. 

 On the other hand, AWS Lambda's maximum timeout is 900 seconds (15 minutes). If the job duration of your existing AWS Glue Python Shell job workload is more than that, or if your workload has a spiky pattern that may cause longer job durations, then we recommend exploring other options instead of AWS Lambda. 

## Migrating workload to Amazon ECS/Fargate
<a name="pyshell-migration-to-ecs-aws-fargate"></a>

 Amazon Elastic Container Service (Amazon ECS) is a fully managed service that simplifies the deployment, management, and scaling of containerized applications. AWS Fargate is a serverless compute engine for containerized workloads running on Amazon ECS and Amazon Elastic Kubernetes Service (Amazon EKS). There's no maximum timeout on Amazon ECS and Fargate, so this is a good option for long-running jobs. Since you have full control over your container image, you can bring your Python script and extra Python libraries into the container and use them. However, you need to containerize your Python script to use this approach. 

## Migrating workload to Amazon Managed Workflows for Apache Airflow Python Operator
<a name="pyshell-migration-to-amazon-mwaa-python-operator"></a>

 Amazon Managed Workflows for Apache Airflow (Managed Workflows for Apache Airflow) is a managed orchestration service for Apache Airflow that makes it easier to set up and operate end-to-end data pipelines in the cloud at scale. If you already have an MWAA environment, it will be straightforward to use the Python operator instead of AWS Glue Python Shell jobs. The Python operator is an operator that runs Python code inside an Airflow workflow. However, if you don't have an existing MWAA environment, we recommend exploring other options. 

## Migrating workload to Amazon SageMaker AI AI training jobs
<a name="pyshell-migration-to-amazon-sagemaker-ai"></a>

 Amazon SageMaker AI Training is a fully managed machine learning (ML) service offered by Amazon SageMaker AI that helps you efficiently train a wide range of ML models at scale. The core of Amazon SageMaker AI AI jobs is the containerization of ML workloads and the capability of managing AWS compute resources. If you prefer a serverless environment where there is no maximum timeout, Amazon SageMaker AI AI training jobs could be a good fit for you. However, the startup latency tends to be longer than AWS Glue Python Shell jobs. For jobs that are latency-sensitive, we recommend exploring other options. 

# Monitoring AWS Glue
<a name="monitor-glue"></a>

Monitoring is an important part of maintaining the reliability, availability, and performance of AWS Glue and your other AWS solutions. AWS provides monitoring tools that you can use to watch AWS Glue, report when something is wrong, and take action automatically when appropriate:

You can use the following automated monitoring tools to watch AWS Glue and report when something is wrong:
+ **Amazon CloudWatch Events** delivers a near real-time stream of system events that describe changes in AWS resources. CloudWatch Events enables automated event-driven computing. You can write rules that watch for certain events and trigger automated actions in other AWS services when these events occur. For more information, see the [Amazon CloudWatch Events User Guide](https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/).
+ **Amazon CloudWatch Logs** enables you to monitor, store, and access your log files from Amazon EC2 instances, AWS CloudTrail, and other sources. CloudWatch Logs can monitor information in the log files and notify you when certain thresholds are met. You can also archive your log data in highly durable storage. For more information, see the [Amazon CloudWatch Logs User Guide](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/).
+ **AWS CloudTrail** captures API calls and related events made by or on behalf of your AWS account and delivers the log files to an Amazon S3 bucket that you specify. You can identify which users and accounts call AWS, the source IP address from which the calls are made, and when the calls occur. For more information, see the [AWS CloudTrail User Guide](https://docs.aws.amazon.com/awscloudtrail/latest/userguide/).

Additionally, you have access to the following insights in the AWS Glue console to help you debug and profile jobs:
+ Spark jobs – you can see a visualization of selected CloudWatch metrics series, and newer jobs have access to the Spark UI. For more information, see [Monitoring AWS Glue Spark jobs](monitor-spark.md).
+ Ray jobs – you can see a visualization of selected CloudWatch metrics series. For more information, see [Monitoring Ray jobs with metrics](author-job-ray-monitor.md).

**Topics**
+ [AWS tags in AWS Glue](monitor-tags.md)
+ [Automating AWS Glue with EventBridge](automating-awsglue-with-cloudwatch-events.md)
+ [Monitoring AWS Glue resources](monitor-resource-metrics.md)
+ [Logging AWS Glue API calls with AWS CloudTrail](monitor-cloudtrail.md)

# AWS tags in AWS Glue
<a name="monitor-tags"></a>

To help you manage your AWS Glue resources, you can optionally assign your own tags to some AWS Glue resource types. A tag is a label that you assign to an AWS resource. Each tag consists of a key and an optional value, both of which you define. You can use tags in AWS Glue to organize and identify your resources. Tags can be used to create cost accounting reports and restrict access to resources. If you're using AWS Identity and Access Management, you can control which users in your AWS account have permission to create, edit, or delete tags. In addition to the permissions to call the tag-related APIs, you also need the `glue:GetConnection` permission to call tagging APIs on connections, and the `glue:GetDatabase` permission to call tagging APIs on databases. For more information, see [ABAC with AWS Glue](security_iam_service-with-iam.md#security_iam_service-with-iam-tags). 

In AWS Glue, you can tag the following resources:
+ Connection
+ Database
+ Crawler
+ Interactive session
+ Development endpoint
+ Job
+ Trigger
+ Workflow
+ Blueprint
+ Machine learning transform
+ Data quality ruleset
+ Stream schemas
+ Stream schema registries

**Note**  
As a best practice, to allow tagging of these AWS Glue resources, always include the `glue:TagResource` action in your policies.

Consider the following when using tags with AWS Glue.
+ A maximum of 50 tags are supported per entity.
+ In AWS Glue, you specify tags as a list of key-value pairs in the format `{"string": "string" ...}`
+ When you create a tag on an object, the tag key is required, and the tag value is optional.
+ The tag key and tag value are case sensitive.
+ The tag key and the tag value must not contain the prefix *aws*. No operations are allowed on such tags.
+ The maximum tag key length is 128 Unicode characters in UTF-8. The tag key must not be empty or null.
+ The maximum tag value length is 256 Unicode characters in UTF-8. The tag value may be empty or null.

## Tagging support for AWS Glue connections
<a name="tag-connections"></a>

You can restrict `CreateConnection`, `UpdateConnection`, `GetConnection` and, `DeleteConnection` action permission based on the resource tag. This enables you to implement the least privilege access control on AWS Glue jobs with JDBC data sources which need to fetch JDBC connection information from the Data Catalog. 

**Example usage**  
Create an AWS Glue connection with the tag ["connection-category", "dev-test"].

Specify the tag condition for the `GetConnection` action in the IAM policy.

```
{
   "Effect": "Allow",
   "Action": [
     "glue:GetConnection"
     ],
   "Resource": "*",
   "Condition": {
     "ForAnyValue:StringEquals": {
       "aws:ResourceTag/tagKey": "dev-test"
     }
   }
 }
```

## Examples
<a name="TagExamples"></a>

The following examples create a job with assigned tags. 

**AWS CLI**

```
aws glue create-job --name job-test-tags --role MyJobRole --command Name=glueetl,ScriptLocation=S3://aws-glue-scripts//prod-job1 
--tags key1=value1,key2=value2
```

**CloudFormation JSON**

```
{
  "Description": "AWS Glue Job Test Tags",
  "Resources": {
    "MyJobRole": {
      "Type": "AWS::IAM::Role",
      "Properties": {
        "AssumeRolePolicyDocument": {
          "Version": "2012-10-17",		 	 	 
          "Statement": [
            {
              "Effect": "Allow",
              "Principal": {
                "Service": [
                  "glue.amazonaws.com"
                ]
              },
              "Action": [
                "sts:AssumeRole"
              ]
            }
          ]
        },
        "Path": "/",
        "Policies": [
          {
            "PolicyName": "root",
            "PolicyDocument": {
              "Version": "2012-10-17",		 	 	 
              "Statement": [
                {
                  "Effect": "Allow",
                  "Action": "*",
                  "Resource": "*"
                }
              ]
            }
          }
        ]
      }
    },
    "MyJob": {
      "Type": "AWS::Glue::Job",
      "Properties": {
        "Command": {
          "Name": "glueetl",
          "ScriptLocation": "s3://aws-glue-scripts//prod-job1"
        },
        "DefaultArguments": {
          "--job-bookmark-option": "job-bookmark-enable"
        },
        "ExecutionProperty": {
          "MaxConcurrentRuns": 2
        },
        "MaxRetries": 0,
        "Name": "cf-job1",
        "Role": {
           "Ref": "MyJobRole",
           "Tags": {
                "key1": "value1", 
                "key2": "value2"
           } 
        }
      }
    }
  }
}
```

**CloudFormation YAML**

```
Description: AWS Glue Job Test Tags
Resources:
  MyJobRole:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal:
              Service:
                - glue.amazonaws.com
            Action:
              - sts:AssumeRole
      Path: "/"
      Policies:
        - PolicyName: root
          PolicyDocument:
            Version: '2012-10-17'
            Statement:
              - Effect: Allow
                Action: "*"
                Resource: "*"
  MyJob:
    Type: AWS::Glue::Job
    Properties:
      Command:
        Name: glueetl
        ScriptLocation: s3://aws-glue-scripts//prod-job1
      DefaultArguments:
        "--job-bookmark-option": job-bookmark-enable
      ExecutionProperty:
        MaxConcurrentRuns: 2
      MaxRetries: 0
      Name: cf-job1
      Role:
        Ref: MyJobRole
        Tags:
           key1: value1
           key2: value2
```

For more information, see [AWS Tagging Strategies](https://aws.amazon.com/answers/account-management/aws-tagging-strategies/). 

For information about how to control access using tags, see [ABAC with AWS Glue](security_iam_service-with-iam.md#security_iam_service-with-iam-tags).

# Automating AWS Glue with EventBridge
<a name="automating-awsglue-with-cloudwatch-events"></a>

You can use Amazon EventBridge to automate your AWS services and respond automatically to system events such as application availability issues or resource changes. Events from AWS services are delivered to EventBridge in near real time. You can write simple rules to indicate which events are of interest to you, and what automated actions to take when an event matches a rule. The actions that can be automatically triggered include the following:
+ Invoking an AWS Lambda function
+ Invoking Amazon EC2 Run Command
+ Relaying the event to Amazon Kinesis Data Streams
+ Activating an AWS Step Functions state machine
+ Notifying an Amazon SNS topic or an Amazon SQS queue

Some examples of using EventBridge with AWS Glue include the following:
+ Activating a Lambda function when an ETL job succeeds
+ Notifying an Amazon SNS topic when an ETL job fails

The following EventBridge are generated by AWS Glue.
+ Events for `"detail-type":"Glue Job State Change"` are generated for `SUCCEEDED`, `FAILED`, `TIMEOUT`, and `STOPPED`.
+ Events for `"detail-type":"Glue Job Run Status"` are generated for `RUNNING`, `STARTING`, and `STOPPING` job runs when they exceed the job delay notification threshold. You must set the job delay notification threshold property to receive these events.

  Only one event is generated per job run status when the job delay notification threshold is exceeded.
+ Events for `"detail-type":"Glue Crawler State Change"` are generated for `Started`, `Succeeded`, and `Failed`.
+ Events for `“detail_type”:“Glue Scheduled Crawler Invocation Failure”` are generated when the scheduled crawler fails to start. In the details of the notification:
  + `customerId` contains the account ID of the customer.
  + `crawlerName` contains the name of the crawler that failed to start.
  + `errorMessage` contains the exception message of the invocation failure.
+ Events for `“detail_type”:“Glue Auto Statistics Invocation Failure“` are generated when the auto-managed column statistics task run fails to start. In the details of the notification:
  + `catalogId` contains the ID associated with a catalog.
  + `databaseName` contains the name of the affected database.
  + `tableName` contains the name of the affected table.
  + `errorMessage` contains the exception message of the invocation failure.
+ Events for `“detail_type”:“Glue Scheduled Statistics Invocation Failure”` are generated when the (cron) scheduled column statistics task run fails to start. In the details of the notification:
  + `catalogId` contains the ID associated with a catalog.
  + `databaseName` contains the name of the affected database.
  + `tableName` contains the name of the affected table.
  + `errorMessage` contains the exception message of the invocation failure.
+ Events for `“detail_type”:“Glue Statistics Task Started”` are generated when the column statistics task run starts.
+ Events for `“detail_type”:“Glue Statistics Task Succeeded”` are generated when the column statistics task run succeeds.
+ Events for `“detail_type”:“Glue Statistics Task Failed”` are generated when the column statistics task run fails.
+ Events for `"detail-type":"Glue Data Catalog Database State Change"` are generated for `CreateDatabase`, `DeleteDatabase`, `CreateTable`, `DeleteTable` and `BatchDeleteTable`. For example, if a table is created or deleted, a notification is sent to EventBridge. Note that you cannot write a program that depends on the order or existence of notification events, as they might be out of sequence or missing. Events are emitted on a best effort basis. In the details of the notification:
  + The `typeOfChange` contains the name of the API operation.
  + The `databaseName` contains the name of the affected database.
  + The `changedTables` contains up to 100 names of affected tables per notification. When table names are long, multiple notifications might be created.
+ Events for `"detail-type":"Glue Data Catalog Table State Change"` are generated for `UpdateTable`, `CreatePartition`, `BatchCreatePartition`, `UpdatePartition`, `DeletePartition`, `BatchUpdatePartition` and `BatchDeletePartition`. For example, if a table or partition is updated, a notification is sent to EventBridge. Note that you cannot write a program that depends on the order or existence of notification events, as they might be out of sequence or missing. Events are emitted on a best effort basis. In the details of the notification:
  + The `typeOfChange` contains the name of the API operation.
  + The `databaseName` contains the name of the database that contains the affected resources.
  + The `tableName` contains the name of the affected table.
  + The `changedPartitions` specifies up to 100 affected partitions in one notification. When partition names are long, multiple notifications might be created.

    For example if there are two partition keys, `Year` and `Month`, then `"2018,01", "2018,02"` modifies the partition where `"Year=2018" and "Month=01"` and the partition where `"Year=2018" and "Month=02"`.

    ```
    {
        "version":"0",
        "id":"abcdef00-1234-5678-9abc-def012345678",
        "detail-type":"Glue Data Catalog Table State Change",
        "source":"aws.glue",
        "account":"123456789012",
        "time":"2017-09-07T18:57:21Z",
        "region":"us-west-2",
        "resources":["arn:aws:glue:us-west-2:123456789012:database/default/foo"],
        "detail":{
            "changedPartitions": [
                "2018,01",
                "2018,02"
            ],
            "databaseName": "default",
            "tableName": "foo",
            "typeOfChange": "BatchCreatePartition"
            }
    }
    ```

For more information, see the [Amazon CloudWatch Events User Guide](https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/). For events specific to AWS Glue, see [AWS Glue Events](https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/EventTypes.html#glue-event-types).

# Monitoring AWS Glue resources
<a name="monitor-resource-metrics"></a>

AWS Glue has service limits to protect customers from unexpected excessive provisioning and from malicious actions intended to increase your bill. The limits also protect the service. Logging into the AWS Service Quota console, customers can view their current resource limits and request an increase (where appropriate).

AWS Glue allows you to view the service's resource usage as a percentage in Amazon CloudWatch and to configure CloudWatch alarms on it to monitor usage. Amazon CloudWatch provides monitoring for AWS resources and the customer applications running on the Amazon infrastructure. The metrics are free of charge to you. The following metrics are supported:
+ Number of workflows per account
+ Number of triggers per account
+ Number of jobs per account
+ Number of concurrent job runs per account
+ Number of blueprints per account
+ Number of interactive sessions per account

## Configuring and using resource metrics
<a name="monitor-resource-metrics"></a>

To use this feature, you can go to the Amazon CloudWatch console to view the metrics and configure alarms. The metrics are under the AWS/Glue namespace and are a percentage of the actual resource usage count divided by the resource quota. The CloudWatch metrics are delivered to your accounts, which will be no cost for you. For example, if you have 10 workflows created, and your service quota allows you to have 200 workflows in maximum, then your usage is 10/200 = 5%, and in graph, you will see a datapoint of 5 as a percentage. To be more specific:

```
Namespace: AWS/Glue
Metric name: ResourceUsage
Type: Resource
Resource: Workflow (or Trigger, Job, JobRun, Blueprint, InteractiveSession)
Service: Glue
Class: None
```

![\[Resource metrics\]](http://docs.aws.amazon.com/glue/latest/dg/images/resource_monitoring_1.png)


To create an alarm on a metric in the CloudWatch console:

1. Once you locate the metric, go to **Graphed metrics**.

1. Click **Create alarm** under **Actions**.

1. Configure the alarm as needed.

We emit metrics whenever your resource usage has a change (such as an increase or decrease). But if your resource usage doesn't change, we emit metrics hourly, so that you have a continuous CloudWatch graph. To avoid having missing data points, we do not recommend you to configure a period less than 1 hour.

You can also configure alarms using AWS CloudFormation as in the following example. In this example, once the workflow resource usage reaches 80%, it triggers an alarm to send a message to the existing SNS topic, where you can subscribe to it to get notifications.

```
{
	"Type": "AWS::CloudWatch::Alarm",
	"Properties": {
		"AlarmName": "WorkflowUsageAlarm",
		"ActionsEnabled": true,
		"OKActions": [],
		"AlarmActions": [
			"arn:aws:sns:af-south-1:085425700061:Default_CloudWatch_Alarms_Topic"
		],
		"InsufficientDataActions": [],
		"MetricName": "ResourceUsage",
		"Namespace": "AWS/Glue",
		"Statistic": "Maximum",
		"Dimensions": [{
				"Name": "Type",
				"Value": "Resource"
			},
			{
				"Name": "Resource",
				"Value": "Workflow"
			},
			{
				"Name": "Service",
				"Value": "Glue"
			},
			{
				"Name": "Class",
				"Value": "None"
			}
		],
		"Period": 3600,
		"EvaluationPeriods": 1,
		"DatapointsToAlarm": 1,
		"Threshold": 80,
		"ComparisonOperator": "GreaterThanThreshold",
		"TreatMissingData": "notBreaching"
	}
}
```

# Logging AWS Glue API calls with AWS CloudTrail
<a name="monitor-cloudtrail"></a>

AWS Glue is integrated with AWS CloudTrail, a service that provides a record of actions taken by a user, role, or an AWS service in AWS Glue. CloudTrail captures all API calls for AWS Glue as events. The calls captured include calls from the AWS Glue console and code calls to the AWS Glue API operations. If you create a trail, you can enable continuous delivery of CloudTrail events to an Amazon S3 bucket, including events for AWS Glue. If you don't configure a trail, you can still view the most recent events in the CloudTrail console in **Event history**. Using the information collected by CloudTrail, you can determine the request that was made to AWS Glue, the IP address from which the request was made, who made the request, when it was made, and additional details. 

To learn more about CloudTrail, see the [AWS CloudTrail User Guide](https://docs.aws.amazon.com/awscloudtrail/latest/userguide/).

## AWS Glue information in CloudTrail
<a name="monitor-cloudtrail-info"></a>

CloudTrail is enabled on your AWS account when you create the account. When activity occurs in AWS Glue, that activity is recorded in a CloudTrail event along with other AWS service events in **Event history**. You can view, search, and download recent events in your AWS account. For more information, see [Viewing Events with CloudTrail Event History](https://docs.aws.amazon.com/awscloudtrail/latest/userguide/view-cloudtrail-events.html). 

For an ongoing record of events in your AWS account, including events for AWS Glue, create a trail. A *trail* enables CloudTrail to deliver log files to an Amazon S3 bucket. By default, when you create a trail in the console, the trail applies to all AWS Regions. The trail logs events from all Regions in the AWS partition and delivers the log files to the Amazon S3 bucket that you specify. Additionally, you can configure other AWS services to further analyze and act upon the event data collected in CloudTrail logs. For more information, see the following: 
+ [Creating a trail for your AWS account](https://docs.aws.amazon.com/awscloudtrail/latest/userguide/cloudtrail-create-and-update-a-trail.html)
+ [CloudTrail supported services and integrations](https://docs.aws.amazon.com/awscloudtrail/latest/userguide/cloudtrail-aws-service-specific-topics.html#cloudtrail-aws-service-specific-topics-integrations)
+ [Configuring Amazon SNS Notifications for CloudTrail](https://docs.aws.amazon.com/awscloudtrail/latest/userguide/getting_notifications_top_level.html)
+ [Receiving CloudTrail Log Files from Multiple Regions](https://docs.aws.amazon.com/awscloudtrail/latest/userguide/receive-cloudtrail-log-files-from-multiple-regions.html) and [Receiving CloudTrail Log Files from Multiple Accounts](https://docs.aws.amazon.com/awscloudtrail/latest/userguide/cloudtrail-receive-logs-from-multiple-accounts.html)

All AWS Glue actions are logged by CloudTrail and are documented in the [AWS Glue API](aws-glue-api.md) . For example, calls to the `CreateDatabase`, `CreateTable` and `CreateScript` actions generate entries in the CloudTrail log files. 

Every event or log entry contains information about who generated the request. The identity information helps you determine the following: 
+ Whether the request was made with root or IAM user credentials.
+ Whether the request was made with temporary security credentials for a role or federated user.
+ Whether the request was made by another AWS service.

For more information, see the [CloudTrail userIdentity Element](https://docs.aws.amazon.com/awscloudtrail/latest/userguide/cloudtrail-event-reference-user-identity.html).

However, CloudTrail doesn't log all information regarding calls. For example, it doesn't log certain sensitive information, such as the `ConnectionProperties` used in connection requests, and it logs a `null` instead of the responses returned by the following APIs:

```
BatchGetPartition       GetCrawlers          GetJobs          GetTable
CreateScript            GetCrawlerMetrics    GetJobRun        GetTables
GetCatalogImportStatus  GetDatabase          GetJobRuns       GetTableVersions
GetClassifier           GetDatabases         GetMapping       GetTrigger
GetClassifiers          GetDataflowGraph     GetObjects       GetTriggers
GetConnection           GetDevEndpoint       GetPartition     GetUserDefinedFunction
GetConnections          GetDevEndpoints      GetPartitions    GetUserDefinedFunctions
GetCrawler              GetJob               GetPlan
```

## Understanding AWS Glue log file entries
<a name="monitor-cloudtrail-logs"></a>

A trail is a configuration that enables delivery of events as log files to an Amazon S3 bucket that you specify. CloudTrail log files contain one or more log entries. An event represents a single request from any source and includes information about the requested action, the date and time of the action, request parameters, and so on. CloudTrail log files aren't an ordered stack trace of the public API calls, so they don't appear in any specific order. 

The following example shows a CloudTrail log entry that demonstrates the `DeleteCrawler` action.

```
{
  "eventVersion": "1.05",
  "userIdentity": {
    "type": "IAMUser",
    "principalId": "AKIAIOSFODNN7EXAMPLE",
    "arn": "arn:aws:iam::123456789012:user/johndoe",
    "accountId": "123456789012",
    "accessKeyId": "AKIAIOSFODNN7EXAMPLE",
    "userName": "johndoe"
  },
  "eventTime": "2017-10-11T22:29:49Z",
  "eventSource": "glue.amazonaws.com",
  "eventName": "DeleteCrawler",
  "awsRegion": "us-east-1",
  "sourceIPAddress": "72.21.198.64",
  "userAgent": "aws-cli/1.11.148 Python/3.6.1 Darwin/16.7.0 botocore/1.7.6",
  "requestParameters": {
    "name": "tes-alpha"
  },
  "responseElements": null,
  "requestID": "b16f4050-aed3-11e7-b0b3-75564a46954f",
  "eventID": "e73dd117-cfd1-47d1-9e2f-d1271cad838c",
  "eventType": "AwsApiCall",
  "recipientAccountId": "123456789012"
}
```

This example shows a CloudTrail log entry that demonstrates a `CreateConnection` action.

```
{
  "eventVersion": "1.05",
  "userIdentity": {
    "type": "IAMUser",
    "principalId": "AKIAIOSFODNN7EXAMPLE",
    "arn": "arn:aws:iam::123456789012:user/johndoe",
    "accountId": "123456789012",
    "accessKeyId": "AKIAIOSFODNN7EXAMPLE",
    "userName": "johndoe"
  },
  "eventTime": "2017-10-13T00:19:19Z",
  "eventSource": "glue.amazonaws.com",
  "eventName": "CreateConnection",
  "awsRegion": "us-east-1",
  "sourceIPAddress": "72.21.198.66",
  "userAgent": "aws-cli/1.11.148 Python/3.6.1 Darwin/16.7.0 botocore/1.7.6",
  "requestParameters": {
    "connectionInput": {
      "name": "test-connection-alpha",
      "connectionType": "JDBC",
      "physicalConnectionRequirements": {
        "subnetId": "subnet-323232",
        "availabilityZone": "us-east-1a",
        "securityGroupIdList": [
          "sg-12121212"
        ]
      }
    }
  },
  "responseElements": null,
  "requestID": "27136ebc-afac-11e7-a7d6-ab217e5c3f19",
  "eventID": "e8b3baeb-c511-4597-880f-c16210c60a4a",
  "eventType": "AwsApiCall",
  "recipientAccountId": "123456789012"
}
```