

# Using the Parquet format in AWS Glue
<a name="aws-glue-programming-etl-format-parquet-home"></a>

AWS Glue retrieves data from sources and writes data to targets stored and transported in various data formats. If your data is stored or transported in the Parquet data format, this document introduces you available features for using your data in AWS Glue. 

AWS Glue supports using the Parquet format. This format is a performance-oriented, column-based data format. For an introduction to the format by the standard authority see, [Apache Parquet Documentation Overview](https://parquet.apache.org/docs/overview/).

You can use AWS Glue to read Parquet files from Amazon S3 and from streaming sources as well as write Parquet files to Amazon S3. You can read and write `bzip` and `gzip` archives containing Parquet files from S3. You configure compression behavior on the [S3 connection parameters](aws-glue-programming-etl-connect-s3-home.md#aws-glue-programming-etl-connect-s3) instead of in the configuration discussed on this page.

The following table shows which common AWS Glue features support the Parquet format option.


| Read | Write | Streaming read | Group small files | Job bookmarks | 
| --- | --- | --- | --- | --- | 
| Supported | Supported | Supported | Unsupported | Supported\$1 | 

\$1 Supported in AWS Glue version 1.0\$1

## Example: Read Parquet files or folders from S3
<a name="aws-glue-programming-etl-format-parquet-read"></a>

** Prerequisites:** You will need the S3 paths (`s3path`) to the Parquet files or folders that you want to read. 

 **Configuration:** In your function options, specify `format="parquet"`. In your `connection_options`, use the `paths` key to specify your `s3path`. 

You can configure how the reader interacts with S3 in the `connection_options`. For details, see Connection types and options for ETL in AWS Glue: [S3 connection parameters](aws-glue-programming-etl-connect-s3-home.md#aws-glue-programming-etl-connect-s3).

 You can configure how the reader interprets Parquet files in your `format_options`. For details, see [Parquet Configuration Reference](#aws-glue-programming-etl-format-parquet-reference).

The following AWS Glue ETL script shows the process of reading Parquet files or folders from S3: 

------
#### [ Python ]

For this example, use the [create\$1dynamic\$1frame.from\$1options](aws-glue-api-crawler-pyspark-extensions-glue-context.md#aws-glue-api-crawler-pyspark-extensions-glue-context-create_dynamic_frame_from_options) method.

```
# Example: Read Parquet from S3

from pyspark.context import SparkContext
from awsglue.context import GlueContext

sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
spark = glueContext.spark_session

dynamicFrame = glueContext.create_dynamic_frame.from_options(
    connection_type = "s3", 
    connection_options = {"paths": ["s3://s3path/"]}, 
    format = "parquet"
)
```

You can also use DataFrames in a script (`pyspark.sql.DataFrame`).

```
dataFrame = spark.read.parquet("s3://s3path/")
```

------
#### [ Scala ]

For this example, use the [getSourceWithFormat](glue-etl-scala-apis-glue-gluecontext.md#glue-etl-scala-apis-glue-gluecontext-defs-getSourceWithFormat) method.

```
// Example: Read Parquet from S3

import com.amazonaws.services.glue.util.JsonOptions
import com.amazonaws.services.glue.{DynamicFrame, GlueContext}
import org.apache.spark.SparkContext

object GlueApp {
  def main(sysArgs: Array[String]): Unit = {
    val spark: SparkContext = new SparkContext()
    val glueContext: GlueContext = new GlueContext(spark)
    
    val dynamicFrame = glueContext.getSourceWithFormat(
      connectionType="s3",
      format="parquet",
      options=JsonOptions("""{"paths": ["s3://s3path"]}""")
    ).getDynamicFrame()
  }
}
```

You can also use DataFrames in a script (`org.apache.spark.sql.DataFrame`).

```
spark.read.parquet("s3://s3path/")
```

------

## Example: Write Parquet files and folders to S3
<a name="aws-glue-programming-etl-format-parquet-write"></a>

** Prerequisites:** You will need an initialized DataFrame (`dataFrame`) or DynamicFrame (`dynamicFrame`). You will also need your expected S3 output path, `s3path`. 

 **Configuration:** In your function options, specify `format="parquet"`. In your `connection_options`, use the `paths` key to specify `s3path`. 

You can further alter how the writer interacts with S3 in the `connection_options`. For details, see Connection types and options for ETL in AWS Glue: [S3 connection parameters](aws-glue-programming-etl-connect-s3-home.md#aws-glue-programming-etl-connect-s3). You can configure how your operation writes the contents of your files in `format_options`. For details, see [Parquet Configuration Reference](#aws-glue-programming-etl-format-parquet-reference).

The following AWS Glue ETL script shows the process of writing Parquet files and folders to S3. 

We provide a custom Parquet writer with performance optimizations for DynamicFrames, through the `useGlueParquetWriter` configuration key. To determine if this writer is right for your workload, see [Glue Parquet Writer](#aws-glue-programming-etl-format-glue-parquet-writer). 

------
#### [ Python ]

For this example, use the [write\$1dynamic\$1frame.from\$1options](aws-glue-api-crawler-pyspark-extensions-glue-context.md#aws-glue-api-crawler-pyspark-extensions-glue-context-write_dynamic_frame_from_options) method.

```
# Example: Write Parquet to S3
# Consider whether useGlueParquetWriter is right for your workflow.

from pyspark.context import SparkContext
from awsglue.context import GlueContext

sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)

glueContext.write_dynamic_frame.from_options(
    frame=dynamicFrame,
    connection_type="s3",
    format="parquet",
    connection_options={
        "path": "s3://s3path",
    },
    format_options={
        # "useGlueParquetWriter": True,
    },
)
```

You can also use DataFrames in a script (`pyspark.sql.DataFrame`).

```
df.write.parquet("s3://s3path/")
```

------
#### [ Scala ]

For this example, use the [getSinkWithFormat](glue-etl-scala-apis-glue-gluecontext.md#glue-etl-scala-apis-glue-gluecontext-defs-getSinkWithFormat) method.

```
// Example: Write Parquet to S3
// Consider whether useGlueParquetWriter is right for your workflow.

import com.amazonaws.services.glue.util.JsonOptions
import com.amazonaws.services.glue.{DynamicFrame, GlueContext}
import org.apache.spark.SparkContext

object GlueApp {
  def main(sysArgs: Array[String]): Unit = {
    val spark: SparkContext = new SparkContext()
    val glueContext: GlueContext = new GlueContext(spark)
    
    glueContext.getSinkWithFormat(
        connectionType="s3",
        options=JsonOptions("""{"path": "s3://s3path"}"""),
        format="parquet"
    ).writeDynamicFrame(dynamicFrame)
  }
}
```

You can also use DataFrames in a script (`org.apache.spark.sql.DataFrame`).

```
df.write.parquet("s3://s3path/")
```

------

## Parquet configuration reference
<a name="aws-glue-programming-etl-format-parquet-reference"></a>

You can use the following `format_options` wherever AWS Glue libraries specify `format="parquet"`: 
+ `useGlueParquetWriter` – Specifies the use of a custom Parquet writer that has performance optimizations for DynamicFrame workflows. For usage details, see [Glue Parquet Writer](#aws-glue-programming-etl-format-glue-parquet-writer). 
  + **Type:** Boolean, **Default:**`false`
+ `compression` – Specifies the compression codec used. Values are fully compatible with `org.apache.parquet.hadoop.metadata.CompressionCodecName`. 
  + **Type:** Enumerated Text, **Default:** `"snappy"`
  + Values: `"uncompressed"`, `"snappy"`, `"gzip"`, and `"lzo"`
+ `blockSize` – Specifies the size in bytes of a row group being buffered in memory. You use this for tuning performance. Size should divide exactly into a number of megabytes.
  + **Type:** Numerical, **Default:**`134217728`
  + The default value is equal to 128 MB.
+ `pageSize` – Specifies the size in bytes of a page. You use this for tuning performance. A page is the smallest unit that must be read fully to access a single record.
  + **Type:** Numerical, **Default:**`1048576`
  + The default value is equal to 1 MB.

**Note**  
Additionally, any options that are accepted by the underlying SparkSQL code can be passed to this format by way of the `connection_options` map parameter. For example, you can set a Spark configuration such as [mergeSchema](https://spark.apache.org/docs/latest/sql-data-sources-parquet.html#schema-merging) for the AWS Glue Spark reader to merge the schema for all files.

## Optimize write performance with AWS Glue Parquet writer
<a name="aws-glue-programming-etl-format-glue-parquet-writer"></a>

**Note**  
 The AWS Glue Parquet writer has historically been accessed through the `glueparquet` format type. This access pattern is no longer advocated. Instead, use the `parquet` type with `useGlueParquetWriter` enabled. 

The AWS Glue Parquet writer has performance enhancements that allow faster Parquet file writes. The traditional writer computes a schema before writing. The Parquet format doesn't store the schema in a quickly retrievable fashion, so this might take some time. With the AWS Glue Parquet writer, a pre-computed schema isn't required. The writer computes and modifies the schema dynamically, as data comes in.

Note the following limitations when you specify `useGlueParquetWriter`:
+ The writer supports only schema evolution (such as adding or removing columns), but not changing column types, such as with `ResolveChoice`.
+ The writer doesn't support writing empty DataFrames—for example, to write a schema-only file. When integrating with the AWS Glue Data Catalog by setting `enableUpdateCatalog=True`, attempting to write an empty DataFrame will not update the Data Catalog. This will result in creating a table in the Data Catalog without a schema.

If your transform doesn't require these limitations, turning on the AWS Glue Parquet writer should increase performance.