

# Amazon S3 connections
<a name="aws-glue-programming-etl-connect-s3-home"></a>

You can use AWS Glue for Spark to read and write files in Amazon S3. AWS Glue for Spark supports many common data formats stored in Amazon S3 out of the box, including CSV, Avro, JSON, Orc and Parquet. For more information about supported data formats, see [Data format options for inputs and outputs in AWS Glue for Spark](aws-glue-programming-etl-format.md). Each data format may support a different set of AWS Glue features. Consult the page for your data format for the specifics of feature support. Additionally, you can read and write versioned files stored in the Hudi, Iceberg and Delta Lake data lake frameworks. For more information about data lake frameworks, see [Using data lake frameworks with AWS Glue ETL jobs](aws-glue-programming-etl-datalake-native-frameworks.md). 

With AWS Glue you can partition your Amazon S3 objects into a folder structure while writing, then retrieve it by partition to improve performance using simple configuration. You can also set configuration to group small files together when transforming your data to improve performance. You can read and write `bzip2` and `gzip` archives in Amazon S3.

**Topics**
+ [Configuring S3 connections](#aws-glue-programming-etl-connect-s3-configure)
+ [Amazon S3 connection option reference](#aws-glue-programming-etl-connect-s3)
+ [Deprecated connection syntaxes for data formats](#aws-glue-programming-etl-connect-legacy-format)
+ [Excluding Amazon S3 storage classes](aws-glue-programming-etl-storage-classes.md)
+ [Managing partitions for ETL output in AWS Glue](aws-glue-programming-etl-partitions.md)
+ [Reading input files in larger groups](grouping-input-files.md)
+ [Amazon VPC endpoints for Amazon S3](vpc-endpoints-s3.md)

## Configuring S3 connections
<a name="aws-glue-programming-etl-connect-s3-configure"></a>

To connect to Amazon S3 in a AWS Glue with Spark job, you will need some prerequisites:
+ The AWS Glue job must have IAM permissions for relevant Amazon S3 buckets.

In certain cases, you will need to configure additional prerequisites:
+ When configuring cross-account access, appropriate access controls on the Amazon S3 bucket.
+ For security reasons, you may choose to route your Amazon S3 requests through an Amazon VPC. This approach can introduce bandwidth and availability challenges. For more information, see [Amazon VPC endpoints for Amazon S3](vpc-endpoints-s3.md). 

## Amazon S3 connection option reference
<a name="aws-glue-programming-etl-connect-s3"></a>

Designates a connection to Amazon S3.

Since Amazon S3 manages files rather than tables, in addition to specifying the connection properties provided in this document, you will need to specify additional configuration about your file type. You specify this information through data format options. For more information about format options, see [Data format options for inputs and outputs in AWS Glue for Spark](aws-glue-programming-etl-format.md). You can also specify this information by integrating with the AWS Glue Data Catalog.

For an example of the distinction between connection options and format options, consider how the [create\$1dynamic\$1frame\$1from\$1options](aws-glue-api-crawler-pyspark-extensions-glue-context.md#aws-glue-api-crawler-pyspark-extensions-glue-context-create_dynamic_frame_from_options) method takes `connection_type`, `connection_options`, `format` and `format_options`. This section specifically discusses parameters provided to `connection_options`.

Use the following connection options with `"connectionType": "s3"`:
+ `"paths"`: (Required) A list of the Amazon S3 paths to read from.
+ `"exclusions"`: (Optional) A string containing a JSON list of Unix-style glob patterns to exclude. For example, `"[\"**.pdf\"]"` excludes all PDF files. For more information about the glob syntax that AWS Glue supports, see [Include and Exclude Patterns](https://docs.aws.amazon.com/glue/latest/dg/define-crawler.html#crawler-data-stores-exclude).
+ `"compressionType"`: or "`compression`": (Optional) Specifies how the data is compressed. Use `"compressionType"` for Amazon S3 sources and `"compression"` for Amazon S3 targets. This is generally not necessary if the data has a standard file extension. Possible values are `"gzip"` and `"bzip2"`). Additional compression formats may be supported for specific formats. For the specifics of feature support, consult the data format page. 
+ `"groupFiles"`: (Optional) Grouping files is turned on by default when the input contains more than 50,000 files. To turn on grouping with fewer than 50,000 files, set this parameter to `"inPartition"`. To disable grouping when there are more than 50,000 files, set this parameter to `"none"`.
+ `"groupSize"`: (Optional) The target group size in bytes. The default is computed based on the input data size and the size of your cluster. When there are fewer than 50,000 input files, `"groupFiles"` must be set to `"inPartition"` for this to take effect.
+ `"recurse"`: (Optional) If set to true, recursively reads files in all subdirectories under the specified paths.
+ `"maxBand"`: (Optional, advanced) This option controls the duration in milliseconds after which the `s3` listing is likely to be consistent. Files with modification timestamps falling within the last `maxBand` milliseconds are tracked specially when using `JobBookmarks` to account for Amazon S3 eventual consistency. Most users don't need to set this option. The default is 900000 milliseconds, or 15 minutes.
+ `"maxFilesInBand"`: (Optional, advanced) This option specifies the maximum number of files to save from the last `maxBand` seconds. If this number is exceeded, extra files are skipped and only processed in the next job run. Most users don't need to set this option.
+ `"isFailFast"`: (Optional) This option determines if an AWS Glue ETL job throws reader parsing exceptions. If set to `true`, jobs fail fast if four retries of the Spark task fail to parse the data correctly.
+ `"catalogPartitionPredicate"`: (Optional) Used for Read. The contents of a SQL `WHERE` clause. Used when reading from Data Catalog tables with a very large quantity of partitions. Retrieves matching partitions from Data Catalog indices. Used with `push_down_predicate`, an option on the [create\$1dynamic\$1frame\$1from\$1catalog](aws-glue-api-crawler-pyspark-extensions-glue-context.md#aws-glue-api-crawler-pyspark-extensions-glue-context-create_dynamic_frame_from_catalog) method (and other similar methods). For more information, see [Server-side filtering using catalog partition predicates](aws-glue-programming-etl-partitions.md#aws-glue-programming-etl-partitions-cat-predicates).
+ `"partitionKeys"`: (Optional) Used for Write. An array of column label strings. AWS Glue will partition your data as specified by this configuration. For more information, see [Writing partitions](aws-glue-programming-etl-partitions.md#aws-glue-programming-etl-partitions-writing).
+ `"excludeStorageClasses"`: (Optional) Used for Read. An array of strings specifying Amazon S3 storage classes. AWS Glue will exclude Amazon S3 objects based on this configuration. For more information, see [Excluding Amazon S3 storage classes](aws-glue-programming-etl-storage-classes.md).

## Deprecated connection syntaxes for data formats
<a name="aws-glue-programming-etl-connect-legacy-format"></a>

Certain data formats can be accessed using a specific connection type syntax. This syntax is deprecated. We recommend you specify your formats using the `s3` connection type and the format options provided in [Data format options for inputs and outputs in AWS Glue for Spark](aws-glue-programming-etl-format.md) instead.

### "connectionType": "Orc"
<a name="aws-glue-programming-etl-connect-orc"></a>

Designates a connection to files stored in Amazon S3 in the [Apache Hive Optimized Row Columnar (ORC)](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC) file format.

Use the following connection options with `"connectionType": "orc"`:
+ `paths`: (Required) A list of the Amazon S3 paths to read from.
+ *(Other option name/value pairs)*: Any additional options, including formatting options, are passed directly to the SparkSQL `DataSource`.

### "connectionType": "parquet"
<a name="aws-glue-programming-etl-connect-parquet"></a>

Designates a connection to files stored in Amazon S3 in the [Apache Parquet](https://parquet.apache.org/docs/) file format.

Use the following connection options with `"connectionType": "parquet"`:
+ `paths`: (Required) A list of the Amazon S3 paths to read from.
+ *(Other option name/value pairs)*: Any additional options, including formatting options, are passed directly to the SparkSQL `DataSource`.

# Excluding Amazon S3 storage classes
<a name="aws-glue-programming-etl-storage-classes"></a>

If you're running AWS Glue ETL jobs that read files or partitions from Amazon Simple Storage Service (Amazon S3), you can exclude some Amazon S3 storage class types.

The following storage classes are available in Amazon S3:
+ `STANDARD` — For general-purpose storage of frequently accessed data.
+ `INTELLIGENT_TIERING` — For data with unknown or changing access patterns.
+ `STANDARD_IA` and `ONEZONE_IA` — For long-lived, but less frequently accessed data.
+ `GLACIER`, `DEEP_ARCHIVE`, and `REDUCED_REDUNDANCY` — For long-term archive and digital preservation.

For more information, see [Amazon S3 Storage Classes](https://docs.aws.amazon.com/AmazonS3/latest/userguide/storage-class-intro.html) in the *Amazon S3 Developer Guide*.

The examples in this section show how to exclude the `GLACIER` and `DEEP_ARCHIVE` storage classes. These classes allow you to list files, but they won't let you read the files unless they are restored. (For more information, see [Restoring Archived Objects](https://docs.aws.amazon.com/AmazonS3/latest/dev/restoring-objects.html) in the *Amazon S3 Developer Guide*.)

By using storage class exclusions, you can ensure that your AWS Glue jobs will work on tables that have partitions across these storage class tiers. Without exclusions, jobs that read data from these tiers fail with the following error: AmazonS3Exception: The operation is not valid for the object's storage class.

There are different ways that you can filter Amazon S3 storage classes in AWS Glue.

**Topics**
+ [Excluding Amazon S3 storage classes when creating a Dynamic Frame](#aws-glue-programming-etl-storage-classes-dynamic-frame)
+ [Excluding Amazon S3 storage classes on a Data Catalog table](#aws-glue-programming-etl-storage-classes-table)

## Excluding Amazon S3 storage classes when creating a Dynamic Frame
<a name="aws-glue-programming-etl-storage-classes-dynamic-frame"></a>

To exclude Amazon S3 storage classes while creating a dynamic frame, use `excludeStorageClasses` in `additionalOptions`. AWS Glue automatically uses its own Amazon S3 `Lister` implementation to list and exclude files corresponding to the specified storage classes.

The following Python and Scala examples show how to exclude the `GLACIER` and `DEEP_ARCHIVE` storage classes when creating a dynamic frame.

Python example:

```
glueContext.create_dynamic_frame.from_catalog(
    database = "my_database",
    tableName = "my_table_name",
    redshift_tmp_dir = "",
    transformation_ctx = "my_transformation_context",
    additional_options = {
        "excludeStorageClasses" : ["GLACIER", "DEEP_ARCHIVE"]
    }
)
```

Scala example:

```
val* *df = glueContext.getCatalogSource(
    nameSpace, tableName, "", "my_transformation_context",  
    additionalOptions = JsonOptions(
        Map("excludeStorageClasses" -> List("GLACIER", "DEEP_ARCHIVE"))
    )
).getDynamicFrame()
```

## Excluding Amazon S3 storage classes on a Data Catalog table
<a name="aws-glue-programming-etl-storage-classes-table"></a>

You can specify storage class exclusions to be used by an AWS Glue ETL job as a table parameter in the AWS Glue Data Catalog. You can include this parameter in the `CreateTable` operation using the AWS Command Line Interface (AWS CLI) or programmatically using the API. For more information, see [Table Structure](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-catalog-tables.html#aws-glue-api-catalog-tables-Table) and [CreateTable](https://docs.aws.amazon.com/glue/latest/webapi/API_CreateTable.html). 

You can also specify excluded storage classes on the AWS Glue console.

**To exclude Amazon S3 storage classes (console)**

1. Sign in to the AWS Management Console and open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/).

1. In the navigation pane on the left, choose **Tables**.

1. Choose the table name in the list, and then choose **Edit table**.

1. In **Table properties**, add **excludeStorageClasses** as a key and **[\$1"GLACIER\$1",\$1"DEEP\$1ARCHIVE\$1"]** as a value.

1. Choose **Apply**.

# Managing partitions for ETL output in AWS Glue
<a name="aws-glue-programming-etl-partitions"></a>

Partitioning is an important technique for organizing datasets so they can be queried efficiently. It organizes data in a hierarchical directory structure based on the distinct values of one or more columns.

For example, you might decide to partition your application logs in Amazon Simple Storage Service (Amazon S3) by date, broken down by year, month, and day. Files that correspond to a single day's worth of data are then placed under a prefix such as `s3://my_bucket/logs/year=2018/month=01/day=23/`. Systems like Amazon Athena, Amazon Redshift Spectrum, and now AWS Glue can use these partitions to filter data by partition value without having to read all the underlying data from Amazon S3.

Crawlers not only infer file types and schemas, they also automatically identify the partition structure of your dataset when they populate the AWS Glue Data Catalog. The resulting partition columns are available for querying in AWS Glue ETL jobs or query engines like Amazon Athena.

After you crawl a table, you can view the partitions that the crawler created. In the AWS Glue console, choose **Tables** in the left navigation pane. Choose the table created by the crawler, and then choose **View Partitions**.

For Apache Hive-style partitioned paths in `key=val` style, crawlers automatically populate the column name using the key name. Otherwise, it uses default names like `partition_0`, `partition_1`, and so on. You can change the default names on the console. To do so, navigate to the table. Check if indexes exist under the **Indexes** tab. If that's the case, you need to delete them to proceed (you can recreate them using the new column names afterwards). Then, choose **Edit Schema**, and modify the names of the partition columns there.

In your ETL scripts, you can then filter on the partition columns. Because the partition information is stored in the Data Catalog, use the `from_catalog` API calls to include the partition columns in the `DynamicFrame`. For example, use `create_dynamic_frame.from_catalog` instead of `create_dynamic_frame.from_options`.

Partitioning is an optimization technique that reduces data scan. For more information about the process of identifying when this technique is appropriate, consult [Reduce the amount of data scan](https://docs.aws.amazon.com/prescriptive-guidance/latest/tuning-aws-glue-for-apache-spark/reduce-data-scan.html) in the *Best practices for performance tuning AWS Glue for Apache Spark jobs* guide on AWS Prescriptive Guidance.

## Pre-filtering using pushdown predicates
<a name="aws-glue-programming-etl-partitions-pushdowns"></a>

In many cases, you can use a pushdown predicate to filter on partitions without having to list and read all the files in your dataset. Instead of reading the entire dataset and then filtering in a DynamicFrame, you can apply the filter directly on the partition metadata in the Data Catalog. Then you only list and read what you actually need into a DynamicFrame.

For example, in Python, you could write the following.

```
glue_context.create_dynamic_frame.from_catalog(
    database = "my_S3_data_set",
    table_name = "catalog_data_table",
    push_down_predicate = my_partition_predicate)
```

This creates a DynamicFrame that loads only the partitions in the Data Catalog that satisfy the predicate expression. Depending on how small a subset of your data you are loading, this can save a great deal of processing time.

The predicate expression can be any Boolean expression supported by Spark SQL. Anything you could put in a `WHERE` clause in a Spark SQL query will work. For example, the predicate expression `pushDownPredicate = "(year=='2017' and month=='04')"` loads only the partitions in the Data Catalog that have both `year` equal to 2017 and `month` equal to 04. For more information, see the [Apache Spark SQL documentation](https://spark.apache.org/docs/2.1.1/sql-programming-guide.html), and in particular, the [Scala SQL functions reference](https://spark.apache.org/docs/2.1.1/api/scala/index.html#org.apache.spark.sql.functions$).

## Server-side filtering using catalog partition predicates
<a name="aws-glue-programming-etl-partitions-cat-predicates"></a>

The `push_down_predicate` option is applied after listing all the partitions from the catalog and before listing files from Amazon S3 for those partitions. If you have a lot of partitions for a table, catalog partition listing can still incur additional time overhead. To address this overhead, you can use server-side partition pruning with the `catalogPartitionPredicate` option that uses [partition indexes](https://docs.aws.amazon.com/glue/latest/dg/partition-indexes.html) in the AWS Glue Data Catalog. This makes partition filtering much faster when you have millions of partitions in one table. You can use both `push_down_predicate` and `catalogPartitionPredicate` in `additional_options` together if your `catalogPartitionPredicate` requires predicate syntax that is not yet supported with the catalog partition indexes.

Python:

```
dynamic_frame = glueContext.create_dynamic_frame.from_catalog(
    database=dbname, 
    table_name=tablename,
    transformation_ctx="datasource0",
    push_down_predicate="day>=10 and customer_id like '10%'",
    additional_options={"catalogPartitionPredicate":"year='2021' and month='06'"}
)
```

Scala:

```
val dynamicFrame = glueContext.getCatalogSource(
    database = dbname,
    tableName = tablename, 
    transformationContext = "datasource0",
    pushDownPredicate="day>=10 and customer_id like '10%'",
    additionalOptions = JsonOptions("""{
        "catalogPartitionPredicate": "year='2021' and month='06'"}""")
    ).getDynamicFrame()
```

**Note**  
The `push_down_predicate` and `catalogPartitionPredicate` use different syntaxes. The former one uses Spark SQL standard syntax and the later one uses JSQL parser.

## Writing partitions
<a name="aws-glue-programming-etl-partitions-writing"></a>

By default, a DynamicFrame is not partitioned when it is written. All of the output files are written at the top level of the specified output path. Until recently, the only way to write a DynamicFrame into partitions was to convert it to a Spark SQL DataFrame before writing.

However, DynamicFrames now support native partitioning using a sequence of keys, using the `partitionKeys` option when you create a sink. For example, the following Python code writes out a dataset to Amazon S3 in the Parquet format, into directories partitioned by the type field. From there, you can process these partitions using other systems, such as Amazon Athena.

```
glue_context.write_dynamic_frame.from_options(
    frame = projectedEvents,
    connection_type = "s3",    
    connection_options = {"path": "$outpath", "partitionKeys": ["type"]},
    format = "parquet")
```

# Reading input files in larger groups
<a name="grouping-input-files"></a>

You can set properties of your tables to enable an AWS Glue ETL job to group files when they are read from an Amazon S3 data store. These properties enable each ETL task to read a group of input files into a single in-memory partition, this is especially useful when there is a large number of small files in your Amazon S3 data store. When you set certain properties, you instruct AWS Glue to group files within an Amazon S3 data partition and set the size of the groups to be read. You can also set these options when reading from an Amazon S3 data store with the `create_dynamic_frame.from_options` method. 

To enable grouping files for a table, you set key-value pairs in the parameters field of your table structure. Use JSON notation to set a value for the parameter field of your table. For more information about editing the properties of a table, see [Viewing and managing table details](tables-described.md#console-tables-details). 

You can use this method to enable grouping for tables in the Data Catalog with Amazon S3 data stores. 

**groupFiles**  
Set **groupFiles** to `inPartition` to enable the grouping of files within an Amazon S3 data partition. AWS Glue automatically enables grouping if there are more than 50,000 input files, as in the following example.  

```
  'groupFiles': 'inPartition'
```

**groupSize**  
Set **groupSize** to the target size of groups in bytes. The **groupSize** property is optional, if not provided, AWS Glue calculates a size to use all the CPU cores in the cluster while still reducing the overall number of ETL tasks and in-memory partitions.   
For example, the following sets the group size to 1 MB.  

```
  'groupSize': '1048576'
```
Note that the `groupsize` should be set with the result of a calculation. For example 1024 \$1 1024 = 1048576.

**recurse**  
Set **recurse** to `True` to recursively read files in all subdirectories when specifying `paths` as an array of paths. You do not need to set **recurse** if `paths` is an array of object keys in Amazon S3, or if the input format is parquet/orc, as in the following example.  

```
  'recurse':True
```

If you are reading from Amazon S3 directly using the `create_dynamic_frame.from_options` method, add these connection options. For example, the following attempts to group files into 1 MB groups.

```
df = glueContext.create_dynamic_frame.from_options("s3", {'paths': ["s3://s3path/"], 'recurse':True, 'groupFiles': 'inPartition', 'groupSize': '1048576'}, format="json")
```

**Note**  
`groupFiles` is supported for DynamicFrames created from the following data formats: csv, ion, grokLog, json, and xml. This option is not supported for avro, parquet, and orc.

# Amazon VPC endpoints for Amazon S3
<a name="vpc-endpoints-s3"></a>

For security reasons, many AWS customers run their applications within an Amazon Virtual Private Cloud environment (Amazon VPC). With Amazon VPC, you can launch Amazon EC2 instances into a virtual private cloud, which is logically isolated from other networks—including the public internet. With an Amazon VPC, you have control over its IP address range, subnets, routing tables, network gateways, and security settings.

**Note**  
If you created your AWS account after 2013-12-04, you already have a default VPC in each AWS Region. You can immediately start using your default VPC without any additional configuration.  
For more information, see [Your Default VPC and Subnets](https://docs.aws.amazon.com/vpc/latest/userguide/default-vpc.html) in the Amazon VPC User Guide.

Many customers have legitimate privacy and security concerns about sending and receiving data across the public internet. Customers can address these concerns by using a virtual private network (VPN) to route all Amazon S3 network traffic through their own corporate network infrastructure. However, this approach can introduce bandwidth and availability challenges.

VPC endpoints for Amazon S3 can alleviate these challenges. A VPC endpoint for Amazon S3 enables AWS Glue to use private IP addresses to access Amazon S3 with no exposure to the public internet. AWS Glue does not require public IP addresses, and you don't need an internet gateway, a NAT device, or a virtual private gateway in your VPC. You use endpoint policies to control access to Amazon S3. Traffic between your VPC and the AWS service does not leave the Amazon network.

When you create a VPC endpoint for Amazon S3, any requests to an Amazon S3 endpoint within the Region (for example, *s3.us-west-2.amazonaws.com*) are routed to a private Amazon S3 endpoint within the Amazon network. You don't need to modify your applications running on Amazon EC2 instances in your VPC—the endpoint name remains the same, but the route to Amazon S3 stays entirely within the Amazon network, and does not access the public internet.

For more information about VPC endpoints, see [VPC Endpoints](https://docs.aws.amazon.com/vpc/latest/userguide/vpc-endpoints.html) in the Amazon VPC User Guide.

The following diagram shows how AWS Glue can use a VPC endpoint to access Amazon S3.

![\[Network traffic flow showing VPC connection to Amazon S3.\]](http://docs.aws.amazon.com/glue/latest/dg/images/PopulateCatalog-vpc-endpoint.png)


**To set up access for Amazon S3**

1. Sign in to the AWS Management Console and open the Amazon VPC console at [https://console.aws.amazon.com/vpc/](https://console.aws.amazon.com/vpc/).

1. In the left navigation pane, choose **Endpoints**.

1. Choose **Create Endpoint**, and follow the steps to create an Amazon S3 VPC endpoint of type Gateway. 