

# Using the CSV format in AWS Glue
<a name="aws-glue-programming-etl-format-csv-home"></a>

AWS Glue retrieves data from sources and writes data to targets stored and transported in various data formats. If your data is stored or transported in the CSV data format, this document introduces you available features for using your data in AWS Glue. 

 AWS Glue supports using the comma-separated value (CSV) format. This format is a minimal, row-based data format. CSVs often don't strictly conform to a standard, but you can refer to [RFC 4180](https://tools.ietf.org/html/rfc4180) and [RFC 7111](https://tools.ietf.org/html/rfc7111) for more information. 

You can use AWS Glue to read CSVs from Amazon S3 and from streaming sources as well as write CSVs to Amazon S3. You can read and write `bzip` and `gzip` archives containing CSV files from S3. You configure compression behavior on the [S3 connection parameters](aws-glue-programming-etl-connect-s3-home.md#aws-glue-programming-etl-connect-s3) instead of in the configuration discussed on this page. 

The following table shows which common AWS Glue features support the CSV format option.


| Read | Write | Streaming read | Group small files | Job bookmarks | 
| --- | --- | --- | --- | --- | 
| Supported | Supported | Supported | Supported | Supported | 

## Example: Read CSV files or folders from S3
<a name="aws-glue-programming-etl-format-csv-read"></a>

 **Prerequisites:** You will need the S3 paths (`s3path`) to the CSV files or folders that you want to read. 

 **Configuration:** In your function options, specify `format="csv"`. In your `connection_options`, use the `paths` key to specify `s3path`. You can configure how the reader interacts with S3 in `connection_options`. For details, see Connection types and options for ETL in AWS Glue: [S3 connection parameters](aws-glue-programming-etl-connect-s3-home.md#aws-glue-programming-etl-connect-s3). You can configure how the reader interprets CSV files in your `format_options`. For details, see [CSV Configuration Reference](#aws-glue-programming-etl-format-csv-reference). 

The following AWS Glue ETL script shows the process of reading CSV files or folders from S3.

 We provide a custom CSV reader with performance optimizations for common workflows through the `optimizePerformance` configuration key. To determine if this reader is right for your workload, see [Optimize read performance with vectorized SIMD CSV reader](#aws-glue-programming-etl-format-simd-csv-reader). 

------
#### [ Python ]

For this example, use the [create\$1dynamic\$1frame.from\$1options](aws-glue-api-crawler-pyspark-extensions-glue-context.md#aws-glue-api-crawler-pyspark-extensions-glue-context-create_dynamic_frame_from_options) method.

```
# Example: Read CSV from S3
# For show, we handle a CSV with a header row.  Set the withHeader option.
# Consider whether optimizePerformance is right for your workflow.

from pyspark.context import SparkContext
from awsglue.context import GlueContext

sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
spark = glueContext.spark_session

dynamicFrame = glueContext.create_dynamic_frame.from_options(
    connection_type="s3",
    connection_options={"paths": ["s3://s3path"]},
    format="csv",
    format_options={
        "withHeader": True,
        # "optimizePerformance": True,
    },
)
```

You can also use DataFrames in a script (`pyspark.sql.DataFrame`).

```
dataFrame = spark.read\
    .format("csv")\
    .option("header", "true")\
    .load("s3://s3path")
```

------
#### [ Scala ]

For this example, use the [getSourceWithFormat](glue-etl-scala-apis-glue-gluecontext.md#glue-etl-scala-apis-glue-gluecontext-defs-getSourceWithFormat) operation.

```
// Example: Read CSV from S3
// For show, we handle a CSV with a header row.  Set the withHeader option.
// Consider whether optimizePerformance is right for your workflow.

import com.amazonaws.services.glue.util.JsonOptions
import com.amazonaws.services.glue.{DynamicFrame, GlueContext}
import org.apache.spark.SparkContext

object GlueApp {
  def main(sysArgs: Array[String]): Unit = {
    val spark: SparkContext = new SparkContext()
    val glueContext: GlueContext = new GlueContext(spark)
    
    val dynamicFrame = glueContext.getSourceWithFormat(
      formatOptions=JsonOptions("""{"withHeader": true}"""),
      connectionType="s3",
      format="csv",
      options=JsonOptions("""{"paths": ["s3://s3path"], "recurse": true}""")
    ).getDynamicFrame()
  }
}
```

You can also use DataFrames in a script (`org.apache.spark.sql.DataFrame`).

```
val dataFrame = spark.read
  .option("header","true")
  .format("csv")
  .load("s3://s3path“)
```

------

## Example: Write CSV files and folders to S3
<a name="aws-glue-programming-etl-format-csv-write"></a>

 **Prerequisites:** You will need an initialized DataFrame (`dataFrame`) or a DynamicFrame (`dynamicFrame`). You will also need your expected S3 output path, `s3path`. 

 **Configuration:** In your function options, specify `format="csv"`. In your `connection_options`, use the `paths` key to specify `s3path`. You can configure how the writer interacts with S3 in `connection_options`. For details, see Connection types and options for ETL in AWS Glue: [S3 connection parameters](aws-glue-programming-etl-connect-s3-home.md#aws-glue-programming-etl-connect-s3). You can configure how your operation writes the contents of your files in `format_options`. For details, see [CSV Configuration Reference](#aws-glue-programming-etl-format-csv-reference). The following AWS Glue ETL script shows the process of writing CSV files and folders to S3. 

------
#### [ Python ]

For this example, use the [write\$1dynamic\$1frame.from\$1options](aws-glue-api-crawler-pyspark-extensions-glue-context.md#aws-glue-api-crawler-pyspark-extensions-glue-context-write_dynamic_frame_from_options) method.

```
# Example: Write CSV to S3
# For show, customize how we write string type values.  Set quoteChar to -1 so our values are not quoted.

from pyspark.context import SparkContext
from awsglue.context import GlueContext

sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)

glueContext.write_dynamic_frame.from_options(
    frame=dynamicFrame,
    connection_type="s3",
    connection_options={"path": "s3://s3path"},
    format="csv",
    format_options={
        "quoteChar": -1,
    },
)
```

You can also use DataFrames in a script (`pyspark.sql.DataFrame`).

```
dataFrame.write\
    .format("csv")\
    .option("quote", None)\
    .mode("append")\
    .save("s3://s3path")
```

------
#### [ Scala ]

For this example, use the [getSinkWithFormat](glue-etl-scala-apis-glue-gluecontext.md#glue-etl-scala-apis-glue-gluecontext-defs-getSinkWithFormat) method.

```
// Example: Write CSV to S3
// For show, customize how we write string type values. Set quoteChar to -1 so our values are not quoted.

import com.amazonaws.services.glue.util.JsonOptions
import com.amazonaws.services.glue.{DynamicFrame, GlueContext}
import org.apache.spark.SparkContext

object GlueApp {
  def main(sysArgs: Array[String]): Unit = {
    val spark: SparkContext = new SparkContext()
    val glueContext: GlueContext = new GlueContext(spark)
    
    glueContext.getSinkWithFormat(
        connectionType="s3",
        options=JsonOptions("""{"path": "s3://s3path"}"""),
        format="csv"
    ).writeDynamicFrame(dynamicFrame)
  }
}
```

You can also use DataFrames in a script (`org.apache.spark.sql.DataFrame`).

```
dataFrame.write
    .format("csv")
    .option("quote", null)
    .mode("Append")
    .save("s3://s3path")
```

------

## CSV configuration reference
<a name="aws-glue-programming-etl-format-csv-reference"></a>

You can use the following `format_options` wherever AWS Glue libraries specify `format="csv"`: 
+ `separator` –Specifies the delimiter character. The default is a comma, but any other character can be specified.
  + **Type:** Text, **Default:** `","`
+ `escaper` – Specifies a character to use for escaping. This option is used only when reading CSV files, not writing. If enabled, the character that immediately follows is used as-is, except for a small set of well-known escapes (`\n`, `\r`, `\t`, and `\0`).
  + **Type:** Text, **Default:** none
+ `quoteChar` – Specifies the character to use for quoting. The default is a double quote. Set this to `-1` to turn off quoting entirely.
  + **Type:** Text, **Default:** `'"'`
+ `multiLine` – Specifies whether a single record can span multiple lines. This can occur when a field contains a quoted new-line character. You must set this option to `True` if any record spans multiple lines. Enabling `multiLine` might decrease performance because it requires more cautious file-splitting while parsing.
  + **Type:** Boolean, **Default:** `false`
+ `withHeader` – Specifies whether to treat the first line as a header. This option can be used in the `DynamicFrameReader` class.
  + **Type:** Boolean, **Default:** `false`
+ `writeHeader` – Specifies whether to write the header to output. This option can be used in the `DynamicFrameWriter` class.
  + **Type:** Boolean, **Default:** `true`
+ `skipFirst` – Specifies whether to skip the first data line.
  + **Type:** Boolean, **Default:** `false`
+ `optimizePerformance` – Specifies whether to use the advanced SIMD CSV reader along with Apache Arrow–based columnar memory formats. Only available in AWS Glue 3.0\$1.
  + **Type:** Boolean, **Default:** `false`
+ `strictCheckForQuoting` – When writing CSVs, Glue may add quotes to values it interprets as strings. This is done to prevent ambiguity in what is written out. To save time when deciding what to write, Glue may quote in certain situations where quotes are not necessary. Enabling a strict check will perform a more intensive computation and will only quote when strictly necessary. Only available in AWS Glue 3.0\$1.
  + **Type:** Boolean, **Default:** `false`

## Optimize read performance with vectorized SIMD CSV reader
<a name="aws-glue-programming-etl-format-simd-csv-reader"></a>

AWS Glue version 3.0 adds an optimized CSV reader that can significantly speed up overall job performance compared to row-based CSV readers. 

 The optimized reader:
+ Uses CPU SIMD instructions to read from disk
+ Immediately writes records to memory in a columnar format (Apache Arrow) 
+ Divides the records into batches

This saves processing time when records would be batched or converted to a columnar format later on. Some examples are when changing schemas or retrieving data by column. 

To use the optimized reader, set `"optimizePerformance"` to `true` in the `format_options` or table property.

```
glueContext.create_dynamic_frame.from_options(
    frame = datasource1,
    connection_type = "s3", 
    connection_options = {"paths": ["s3://s3path"]}, 
    format = "csv", 
    format_options={
        "optimizePerformance": True, 
        "separator": ","
        }, 
    transformation_ctx = "datasink2")
```

**Limitations for the vectorized CSV reader**  
Note the following limitations of the vectorized CSV reader:
+ It doesn't support the `multiLine` and `escaper` format options. It uses the default `escaper` of double quote char `'"'`. When these options are set, AWS Glue automatically falls back to using the row-based CSV reader.
+ It doesn't support creating a DynamicFrame with [ChoiceType](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-types.html#aws-glue-api-crawler-pyspark-extensions-types-awsglue-choicetype). 
+ It doesn't support creating a DynamicFrame with [error records](https://docs.aws.amazon.com/glue/latest/dg/glue-etl-scala-apis-glue-dynamicframe-class.html#glue-etl-scala-apis-glue-dynamicframe-class-defs-errorsAsDynamicFrame).
+ It doesn't support reading CSV files with multibyte characters such as Japanese or Chinese characters.