# AWS Glue에서 CSV 형식 사용
<a name="aws-glue-programming-etl-format-csv-home"></a>

AWS Glue는 소스에서 데이터를 검색하고 다양한 데이터 형식으로 저장 및 전송되는 대상에 데이터를 씁니다. 데이터가 CSV 데이터 형식으로 저장 또는 전송되는 경우 이 문서에서는 AWS Glue에서 데이터를 사용하는 데 사용할 수 있는 기능을 소개합니다.

 AWS Glue는 CSV(쉼표로 분리된 값) 형식 사용만 지원합니다. 이 형식은 최소 행 기반 데이터 형식입니다. CSV는 표준을 엄격하게 준수하지 않는 경우가 많지만 [RFC 4180](https://tools.ietf.org/html/rfc4180)과 [RFC 7111](https://tools.ietf.org/html/rfc7111)에서 자세한 정보를 참조할 수 있습니다.

AWS Glue를 사용하여 Amazon S3와 스트리밍 소스에서 CSV를 읽을 수 있을 뿐만 아니라 Amazon S3에 CSV를 쓸 수 있습니다. S3에서 CSV 파일이 포함된 `bzip` 및 `gzip` 아카이브를 읽고 쓸 수 있습니다. 이 페이지에서 설명하는 구성 대신 [S3 연결 파라미터](aws-glue-programming-etl-connect-s3-home.md#aws-glue-programming-etl-connect-s3)에서 압축 동작을 구성할 수 있습니다.

다음 표에서는 CSV 형식 옵션을 지원하는 일반적인 AWS Glue 기능을 보여줍니다.


| 읽기 | 쓰기 | 스트리밍 읽기 | 작은 파일 그룹화 | 작업 북마크 | 
| --- | --- | --- | --- | --- | 
| 지원됨 | 지원됨 | 지원됨 | 지원됨 | 지원됨 | 

## 예: S3에서 CSV 파일 또는 폴더 읽기
<a name="aws-glue-programming-etl-format-csv-read"></a>

 **사전 조건:** 읽고자 하는 CSV 파일 또는 폴더에 대한 S3 경로(`s3path`)가 필요합니다.

 **구성:** 함수 옵션에서 `format="csv"`를 지정합니다. `connection_options`에서 `paths` 키를 사용하여 `s3path`를 지정합니다. `connection_options`에서 리더와 S3가 상호 작용하는 방식을 구성할 수 있습니다. 자세한 내용은 AWS Glue에서 ETL 관련 연결 유형 및 옵션 참조: [S3 연결 파라미터](aws-glue-programming-etl-connect-s3-home.md#aws-glue-programming-etl-connect-s3) 리더에서 `format_options`의 CSV 파일을 해석하는 방법을 구성할 수 있습니다. 자세한 내용은 [CSV 구성 참조](#aws-glue-programming-etl-format-csv-reference)를 참조하십시오.

다음 AWS Glue ETL 스크립트는 S3에서 CSV 파일 또는 폴더를 읽는 프로세스를 보여줍니다.

 `optimizePerformance` 구성 키를 통해 사용자 지정 CSV 리더에 일반적인 워크플로우에 대한 성능 최적화가 제공됩니다. 이 리더가 워크로드에 적합한지 확인하려면 [벡터화된 SIMD CSV 리더로 읽기 성능 최적화](#aws-glue-programming-etl-format-simd-csv-reader) 단원을 참조하십시오.

------
#### [ Python ]

이 예에서는 [create\$1dynamic\$1frame.from\$1options](aws-glue-api-crawler-pyspark-extensions-glue-context.md#aws-glue-api-crawler-pyspark-extensions-glue-context-create_dynamic_frame_from_options) 메서드를 사용합니다.

```
# Example: Read CSV from S3
# For show, we handle a CSV with a header row.  Set the withHeader option.
# Consider whether optimizePerformance is right for your workflow.

from pyspark.context import SparkContext
from awsglue.context import GlueContext

sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
spark = glueContext.spark_session

dynamicFrame = glueContext.create_dynamic_frame.from_options(
    connection_type="s3",
    connection_options={"paths": ["s3://s3path"]},
    format="csv",
    format_options={
        "withHeader": True,
        # "optimizePerformance": True,
    },
)
```

또한 스크립트(`pyspark.sql.DataFrame`)에서 DataFrame을 사용합니다.

```
dataFrame = spark.read\
    .format("csv")\
    .option("header", "true")\
    .load("s3://s3path")
```

------
#### [ Scala ]

이 예에서는 [getSourceWithFormat](glue-etl-scala-apis-glue-gluecontext.md#glue-etl-scala-apis-glue-gluecontext-defs-getSourceWithFormat) 작업을 사용합니다.

```
// Example: Read CSV from S3
// For show, we handle a CSV with a header row.  Set the withHeader option.
// Consider whether optimizePerformance is right for your workflow.

import com.amazonaws.services.glue.util.JsonOptions
import com.amazonaws.services.glue.{DynamicFrame, GlueContext}
import org.apache.spark.SparkContext

object GlueApp {
  def main(sysArgs: Array[String]): Unit = {
    val spark: SparkContext = new SparkContext()
    val glueContext: GlueContext = new GlueContext(spark)
    
    val dynamicFrame = glueContext.getSourceWithFormat(
      formatOptions=JsonOptions("""{"withHeader": true}"""),
      connectionType="s3",
      format="csv",
      options=JsonOptions("""{"paths": ["s3://s3path"], "recurse": true}""")
    ).getDynamicFrame()
  }
}
```

또한 스크립트(`org.apache.spark.sql.DataFrame`)에서 DataFrame을 사용합니다.

```
val dataFrame = spark.read
  .option("header","true")
  .format("csv")
  .load("s3://s3path“)
```

------

## 예: S3에 CSV 파일 및 폴더 쓰기
<a name="aws-glue-programming-etl-format-csv-write"></a>

 **사전 조건:** 초기화된 DataFrame(`dataFrame`) 또는 DynamicFrame(`dynamicFrame`)이 필요합니다. 예상되는 S3 출력 경로(`s3path`)도 필요합니다.

 **구성:** 함수 옵션에서 `format="csv"`를 지정합니다. `connection_options`에서 `paths` 키를 사용하여 `s3path`를 지정합니다. `connection_options`에서 라이터와 S3가 상호 작용하는 방식을 구성할 수 있습니다. 자세한 내용은 AWS Glue에서 ETL 관련 연결 유형 및 옵션 참조: [S3 연결 파라미터](aws-glue-programming-etl-connect-s3-home.md#aws-glue-programming-etl-connect-s3) 작업에서 `format_options`에 있는 파일의 내용을 쓰는 방법을 구성할 수 있습니다. 자세한 내용은 [CSV 구성 참조](#aws-glue-programming-etl-format-csv-reference)를 참조하십시오. 다음 AWS Glue ETL 스크립트는 S3로 CSV 파일 및 폴더를 쓰는 프로세스를 보여줍니다.

------
#### [ Python ]

이 예에서는 [write\$1dynamic\$1frame.from\$1options](aws-glue-api-crawler-pyspark-extensions-glue-context.md#aws-glue-api-crawler-pyspark-extensions-glue-context-write_dynamic_frame_from_options) 메서드를 사용합니다.

```
# Example: Write CSV to S3
# For show, customize how we write string type values.  Set quoteChar to -1 so our values are not quoted.

from pyspark.context import SparkContext
from awsglue.context import GlueContext

sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)

glueContext.write_dynamic_frame.from_options(
    frame=dynamicFrame,
    connection_type="s3",
    connection_options={"path": "s3://s3path"},
    format="csv",
    format_options={
        "quoteChar": -1,
    },
)
```

또한 스크립트(`pyspark.sql.DataFrame`)에서 DataFrame을 사용합니다.

```
dataFrame.write\
    .format("csv")\
    .option("quote", None)\
    .mode("append")\
    .save("s3://s3path")
```

------
#### [ Scala ]

이 예에서는 [getSinkWithFormat](glue-etl-scala-apis-glue-gluecontext.md#glue-etl-scala-apis-glue-gluecontext-defs-getSinkWithFormat) 메서드를 사용합니다.

```
// Example: Write CSV to S3
// For show, customize how we write string type values. Set quoteChar to -1 so our values are not quoted.

import com.amazonaws.services.glue.util.JsonOptions
import com.amazonaws.services.glue.{DynamicFrame, GlueContext}
import org.apache.spark.SparkContext

object GlueApp {
  def main(sysArgs: Array[String]): Unit = {
    val spark: SparkContext = new SparkContext()
    val glueContext: GlueContext = new GlueContext(spark)
    
    glueContext.getSinkWithFormat(
        connectionType="s3",
        options=JsonOptions("""{"path": "s3://s3path"}"""),
        format="csv"
    ).writeDynamicFrame(dynamicFrame)
  }
}
```

또한 스크립트(`org.apache.spark.sql.DataFrame`)에서 DataFrame을 사용합니다.

```
dataFrame.write
    .format("csv")
    .option("quote", null)
    .mode("Append")
    .save("s3://s3path")
```

------

## CSV 구성 참조
<a name="aws-glue-programming-etl-format-csv-reference"></a>

AWS Glue 라이브러리가 `format="csv"`를 지정한 곳이라면 어디에서든 다음 `format_options`을 사용할 수 있습니다.
+ `separator` - 구분 기호 문자열을 지정합니다. 기본값은 쉼표(,)이지만, 다른 문자도 지정할 수 있습니다.
  + **유형:** 텍스트, **기본값:** `","`
+ `escaper` - 이스케이프 처리에 사용할 문자를 지정합니다. 이 옵션은 CSV 파일을 읽을 때만 사용되며 쓸 때는 사용되지 않습니다. 활성화된 경우 바로 다음에 나오는 문자가 잘 알려진 이스케이프 세트(`\n`, `\r`, `\t` 및 `\0`)를 제외하고는 있는 그대로 사용됩니다.
  + **유형:** 텍스트, **기본값:** 없음
+ `quoteChar` - 인용에 사용할 문자를 지정합니다. 기본값은 큰 따옴표(")입니다. 전체 인용을 해제하려면 이 값을 `-1`로 설정합니다.
  + **유형:** 텍스트, **기본값:** `'"'`
+ `multiLine` - 단일 기록이 다양한 라인을 포괄할 수 있는지 여부를 지정합니다. 필드가 인용된 새로운 라인 문자를 포함할 때 발생합니다. 이 옵션을 `True`로 설정해야 기록이 여러 라인을 포괄할 수 있습니다. `multiLine`을 활성화하면 구문 분석하는 동안 더 신중한 파일 분할이 필요하므로 성능이 저하될 수 있습니다.
  + **유형:** 부울, **기본값:** `false`
+ `withHeader` - 첫 번째 라인을 헤더로 취급할지 여부를 지정합니다. 이 옵션은 `DynamicFrameReader` 클래스에서 사용할 수 있습니다.
  + **유형:** 부울, **기본값:** `false`
+ `writeHeader` - 헤더를 작성하여 출력할지 여부를 지정합니다. 이 옵션은 `DynamicFrameWriter` 클래스에서 사용할 수 있습니다.
  + **유형:** 부울, **기본값:** `true`
+ `skipFirst`- 첫 번째 데이터 라인을 건너뛸지 여부를 지정합니다.
  + **유형:** 부울, **기본값:** `false`
+ `optimizePerformance` - Apache Arrow 기반 열 포맷 메모리 포맷과 함께 고급 SIMD CSV 리더를 사용할지 여부를 지정합니다. AWS Glue 3.0 이상에서만 사용 가능합니다.
  + **유형:** 부울, **기본값:** `false`
+ `strictCheckForQuoting` - CSV를 쓸 때 Glue는 문자열로 해석되는 값에 따옴표를 추가할 수 있습니다. 이는 기록된 내용이 모호하지 않도록 하기 위해서입니다. 기록할 내용을 결정할 때 시간을 절약하기 위해 Glue는 따옴표가 필요하지 않은 특정 상황에서 인용할 수 있습니다. 엄격한 검사를 활성화하면 보다 컴퓨팅 집약적인 작업을 수행하고 필요한 경우에만 인용합니다. AWS Glue 3.0 이상에서만 사용 가능합니다.
  + **유형:** 부울, **기본값:** `false`

## 벡터화된 SIMD CSV 리더로 읽기 성능 최적화
<a name="aws-glue-programming-etl-format-simd-csv-reader"></a>

AWS Glue 버전 3.0에는 행 기반 CSV 리더에 비해 전반적인 작업 속도를 크게 높일 수 있는 최적화된 CSV 리더가 추가되었습니다.

 최적화된 리더:
+ CPU SIMD 명령을 사용하여 디스크에서 읽기
+ 레코드를 열 기반 형식으로 메모리에 즉시 쓰기(Apache Arrow) 
+ 레코드를 배치로 분할

이렇게 하면 나중에 레코드가 일괄 처리되거나 열 기반 형식으로 변환될 때 처리 시간이 절약됩니다. 스키마를 변경하거나 열별로 데이터를 검색하는 경우를 예로 들 수 있습니다.

최적화된 리더를 사용하려면 `format_options` 또는 테이블 속성에서 `"optimizePerformance"`를 `true`로 설정합니다.

```
glueContext.create_dynamic_frame.from_options(
    frame = datasource1,
    connection_type = "s3", 
    connection_options = {"paths": ["s3://s3path"]}, 
    format = "csv", 
    format_options={
        "optimizePerformance": True, 
        "separator": ","
        }, 
    transformation_ctx = "datasink2")
```

**벡터화된 CSV 리더에 대한 제한 사항**  
벡터화된 CSV 리더의 제한 사항:
+ `multiLine` 및 `escaper` 포맷 옵션은 지원되지 않습니다. 큰따옴표 문자(`'"'`)의 기본값 `escaper`가 사용됩니다. 이러한 옵션을 설정하면 AWS Glue는 행 기반 CSV 리더를 사용하는 상태로 자동으로 돌아갑니다.
+ [ChoiceType](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-types.html#aws-glue-api-crawler-pyspark-extensions-types-awsglue-choicetype)으로 DynamicFrame 생성이 지원되지 않습니다.
+ [오류 레코드](https://docs.aws.amazon.com/glue/latest/dg/glue-etl-scala-apis-glue-dynamicframe-class.html#glue-etl-scala-apis-glue-dynamicframe-class-defs-errorsAsDynamicFrame)로 DynamicFrame 생성이 지원되지 않습니다.
+ 일본어 또는 중국어와 같은 멀티바이트 문자가 포함된 CSV 파일 읽기가 지원되지 않습니다.