기계 번역으로 제공되는 번역입니다. 제공된 번역과 원본 영어의 내용이 상충하는 경우에는 영어 버전이 우선합니다.

# Amazon EMR에서 Apache Spark용 Amazon Redshift 통합 사용
<a name="emr-spark-redshift"></a>

Amazon EMR 릴리스 6.4.0 이상에서 모든 릴리스 이미지에 [Apache Spark](https://aws.amazon.com/emr/features/spark/)와 Amazon Redshift 간 커넥터가 포함됩니다. 이 커넥터를 사용하면 Amazon EMR에서 Spark를 사용하여 Amazon Redshift에 저장된 데이터를 처리할 수 있습니다. Amazon EMR 릴리스 6.4.0\$16.8.0의 경우 통합은 [`spark-redshift` 오픈 소스 커넥터](https://github.com/spark-redshift-community/spark-redshift#readme)를 기반으로 합니다. Amazon EMR 릴리스 6.9.0 이상의 경우 [Apache Spark용 Amazon Redshift 통합](https://docs.aws.amazon.com/redshift/latest/mgmt/spark-redshift-connector.html)이 커뮤니티 버전에서 네이티브 통합으로 마이그레이션되었습니다.

**Topics**
+ [Apache Spark용 Amazon Redshift 통합을 사용하여 Spark 애플리케이션 시작](emr-spark-redshift-launch.md)
+ [Apache Spark용 Amazon Redshift 통합으로 인증](emr-spark-redshift-auth.md)
+ [Amazon Redshift에서 읽고 쓰기](emr-spark-redshift-readwrite.md)
+ [Spark 커넥터 사용 시 고려 사항 및 제한 사항](emr-spark-redshift-considerations.md)

# Apache Spark용 Amazon Redshift 통합을 사용하여 Spark 애플리케이션 시작
<a name="emr-spark-redshift-launch"></a>

Amazon EMR 릴리스 6.4\$16.9의 경우 `--jars` 또는 `--packages` 옵션을 사용하여 다음 JAR 파일 중 사용하려는 파일을 지정해야 합니다. `--jars` 옵션은 로컬, HDFS 또는 HTTP 및 HTTPS를 사용하여 저장되는 종속 항목을 지정합니다. `--jars` 옵션에서 지원하는 다른 파일 위치를 보려면 Spark 설명서에서 [Advanced Dependency Management](https://spark.apache.org/docs/latest/submitting-applications.html#advanced-dependency-management)를 참조하세요. `--packages` 옵션은 퍼블릭 Maven 리포지토리에 저장된 종속성을 지정합니다.
+ `spark-redshift.jar`
+ `spark-avro.jar`
+ `RedshiftJDBC.jar`
+ `minimal-json.jar`

Amazon EMR 릴리스 6.10.0 이상에서는 `minimal-json.jar` 종속성이 필요하지 않으며 기본적으로 다른 종속성을 각 클러스터에 자동으로 설치합니다. 다음 예제에서는 Apache Spark용 Amazon Redshift 통합을 사용하여 Spark 애플리케이션을 시작하는 방법을 보여줍니다.

------
#### [ Amazon EMR 6.10.0 \$1 ]

다음 예제는 Amazon EMR 릴리스 6.10 이상에서 `spark-redshift` 커넥터를 사용하여 Spark 애플리케이션을 시작하는 방법을 보여줍니다.

```
spark-submit my_script.py
```

------
#### [ Amazon EMR 6.4.0 - 6.9.x ]

Amazon EMR 릴리스 6.4\$16.9에서 `spark-redshift` 커넥터를 사용하여 Spark 애플리케이션을 시작하려면 다음 예제와 같이 `--jars` 또는 `--packages` 옵션을 사용해야 합니다. `--jars` 옵션과 함께 나열된 경로는 JAR 파일의 기본 경로입니다.

```
spark-submit \
  --jars /usr/share/aws/redshift/jdbc/RedshiftJDBC.jar,/usr/share/aws/redshift/spark-redshift/lib/spark-redshift.jar,/usr/share/aws/redshift/spark-redshift/lib/spark-avro.jar,/usr/share/aws/redshift/spark-redshift/lib/minimal-json.jar \
  my_script.py
```

------

# Apache Spark용 Amazon Redshift 통합으로 인증
<a name="emr-spark-redshift-auth"></a>

## AWS Secrets Manager 를 사용하여 자격 증명을 검색하고 Amazon Redshift에 연결
<a name="emr-spark-redshift-secrets"></a>

다음 코드 샘플은 AWS Secrets Manager 를 사용하여 자격 증명을 검색하여 Python의 Apache Spark용 PySpark 인터페이스를 사용하여 Amazon Redshift 클러스터에 연결하는 방법을 보여줍니다.

```
from pyspark.sql import SQLContext
import boto3

sc = # existing SparkContext
sql_context = SQLContext(sc)

secretsmanager_client = boto3.client('secretsmanager')
secret_manager_response = secretsmanager_client.get_secret_value(
    SecretId='string',
    VersionId='string',
    VersionStage='string'
)
username = # get username from secret_manager_response
password = # get password from secret_manager_response
url = "jdbc:redshift://redshifthost:5439/database?user=" + username + "&password=" + password

# Read data from a table
df = sql_context.read \
    .format("io.github.spark_redshift_community.spark.redshift") \
    .option("url", url) \
    .option("dbtable", "my_table") \
    .option("tempdir", "s3://path/for/temp/data") \
    .load()
```

## IAM을 사용하여 보안 인증을 검색하고 Amazon Redshift에 연결
<a name="emr-spark-redshift-iam"></a>

Amazon Redshift 제공 JDBC 드라이버 버전 2 드라이버를 사용하여 Spark 커넥터로 Amazon Redshift에 연결할 수 있습니다. AWS Identity and Access Management (IAM)[를 사용하려면 IAM 인증을 사용하도록 JDBC URL을 구성합니다](https://docs.aws.amazon.com/redshift/latest/mgmt/generating-iam-credentials-configure-jdbc-odbc.html). Amazon EMR에서 Redshift 클러스터에 연결하려면 임시 IAM 보안 인증을 검색할 권한을 IAM 역할에 부여해야 합니다. 보안 인증을 검색하고 Amazon S3 작업을 실행할 수 있도록 IAM 역할에 다음 권한을 할당합니다.
+  [Redshift:GetClusterCredentials](https://docs.aws.amazon.com/redshift/latest/APIReference/API_GetClusterCredentials.html)(프로비저닝된 Amazon Redshift 클러스터용) 
+  [Redshift:DescribeClusters](https://docs.aws.amazon.com/redshift/latest/APIReference/API_DescribeClusters.html)(프로비저닝된 Amazon Redshift 클러스터용) 
+ [Redshift:GetWorkgroup](https://docs.aws.amazon.com/redshift-serverless/latest/APIReference/API_GetWorkgroup.html)(Amazon Redshift Serverless 작업 그룹용)
+  [Redshift:GetCredentials](https://docs.aws.amazon.com/redshift-serverless/latest/APIReference/API_GetCredentials.html)(Amazon Redshift Serverless 작업 그룹용) 
+  [s3:GetBucket](https://docs.aws.amazon.com/AmazonS3/latest/API/API_control_GetBucket.html) 
+  [s3:GetBucketLocation](https://docs.aws.amazon.com/AmazonS3/latest/API/API_GetBucketLocation.html) 
+  [s3:GetObject](https://docs.aws.amazon.com/AmazonS3/latest/API/API_GetObject.html) 
+  [s3:PutObject](https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObject.html) 
+  [s3:GetBucketLifecycleConfiguration](https://docs.aws.amazon.com/AmazonS3/latest/API/API_GetBucketLifecycleConfiguration.html) 

`GetClusterCredentials`에 대한 자세한 내용은 [`GetClusterCredentials`에 대한 리소스 정책](https://docs.aws.amazon.com/redshift/latest/mgmt/redshift-iam-access-control-identity-based.html#redshift-policy-resources.getclustercredentials-resources)을 참조하세요.

또한 `COPY` 및 `UNLOAD` 작업 중에 Amazon Redshift가 IAM 역할을 맡을 수 있는지 확인해야 합니다.

------
#### [ JSON ]

****  

```
{
  "Version":"2012-10-17",		 	 	 
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "sts:AssumeRole"
      ],
      "Resource": "arn:aws:iam::123456789012:role/RedshiftServiceRole",
      "Sid": "AllowSTSAssumerole"
    }
  ]
}
```

------

다음 예제에서는 Spark와 Amazon Redshift 사이에서 IAM 인증을 사용합니다.

```
from pyspark.sql import SQLContext
import boto3

sc = # existing SparkContext
sql_context = SQLContext(sc)

url = "jdbc:redshift:iam://redshift-host:redshift-port/db-name"
iam_role_arn = "arn:aws:iam::account-id:role/role-name"

# Read data from a table
df = sql_context.read \
    .format("io.github.spark_redshift_community.spark.redshift") \
    .option("url", url) \
    .option("aws_iam_role", iam_role_arn) \
    .option("dbtable", "my_table") \
    .option("tempdir", "s3a://path/for/temp/data") \
    .mode("error") \
    .load()
```

# Amazon Redshift에서 읽고 쓰기
<a name="emr-spark-redshift-readwrite"></a>

다음 코드 예제는 데이터 소스 API와 SparkSQL을 통해 Amazon Redshift 데이터베이스에서 샘플 데이터를 읽고 쓰는 데 PySpark를 사용합니다.

------
#### [ Data source API ]

PySpark를 사용하여 데이터 소스 API를 통해 Amazon Redshift 데이터베이스에서 샘플 데이터를 읽고 씁니다.

```
import boto3
from pyspark.sql import SQLContext

sc = # existing SparkContext
sql_context = SQLContext(sc)

url = "jdbc:redshift:iam://redshifthost:5439/database"
aws_iam_role_arn = "arn:aws:iam::accountID:role/roleName"

df = sql_context.read \
    .format("io.github.spark_redshift_community.spark.redshift") \
    .option("url", url) \
    .option("dbtable", "tableName") \
    .option("tempdir", "s3://path/for/temp/data") \
    .option("aws_iam_role", "aws_iam_role_arn") \
    .load()

df.write \
    .format("io.github.spark_redshift_community.spark.redshift") \
    .option("url", url) \
    .option("dbtable", "tableName_copy") \
    .option("tempdir", "s3://path/for/temp/data") \
    .option("aws_iam_role", "aws_iam_role_arn") \
    .mode("error") \
    .save()
```

------
#### [ SparkSQL ]

PySpark를 사용하여 SparkSQL을 통해 Amazon Redshift 데이터베이스에서 샘플 데이터를 읽고 씁니다.

```
import boto3
import json
import sys
import os
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .enableHiveSupport() \
    .getOrCreate()
    
url = "jdbc:redshift:iam://redshifthost:5439/database"
aws_iam_role_arn = "arn:aws:iam::accountID:role/roleName"
    
bucket = "s3://path/for/temp/data"
tableName = "tableName" # Redshift table name

s = f"""CREATE TABLE IF NOT EXISTS {tableName} (country string, data string) 
    USING io.github.spark_redshift_community.spark.redshift 
    OPTIONS (dbtable '{tableName}', tempdir '{bucket}', url '{url}', aws_iam_role '{aws_iam_role_arn}' ); """

spark.sql(s)
         
columns = ["country" ,"data"]
data = [("test-country","test-data")]
df = spark.sparkContext.parallelize(data).toDF(columns)

# Insert data into table
df.write.insertInto(tableName, overwrite=False)
df = spark.sql(f"SELECT * FROM {tableName}")
df.show()
```

------

# Spark 커넥터 사용 시 고려 사항 및 제한 사항
<a name="emr-spark-redshift-considerations"></a>
+ Amazon EMR의 Spark에서 Amazon Redshift로의 JDBC 연결을 위해 SSL을 켜는 것이 좋습니다.
+ 모범 사례로 AWS Secrets Manager 에서 Amazon Redshift 클러스터의 보안 인증을 관리하는 것이 좋습니다. 예제[AWS Secrets Manager 는를 사용하여 Amazon Redshift에 연결하기 위한 자격 증명 검색을](https://docs.aws.amazon.com/redshift/latest/mgmt/redshift-secrets-manager-integration.html) 참조하세요.
+ Amazon Redshift 인증 파라미터에 대해 `aws_iam_role` 파라미터를 사용하여 IAM 역할을 전달하는 것이 좋습니다.
+ `tempdir` URI는 Amazon S3 위치를 가리킵니다. 이 임시 디렉터리는 자동으로 정리되지 않으므로, 추가 비용이 발생할 수 있습니다.
+ Amazon Redshift에 대한 다음 권장 사항을 고려합니다.
  + Amazon Redshift 클러스터에 대한 퍼블릭 액세스를 차단하는 것이 좋습니다.
  + [Amazon Redshift 감사 로깅](https://docs.aws.amazon.com/redshift/latest/mgmt/db-auditing.html)을 켜는 것이 좋습니다.
  + [Amazon Redshift 저장 데이터 암호화](https://docs.aws.amazon.com/redshift/latest/mgmt/security-server-side-encryption.html)를 켜는 것이 좋습니다.
+ Amazon S3에 대한 다음 권장 사항을 고려합니다.
  + [Amazon S3 버킷에 대한 퍼블릭 액세스를 차단](https://docs.aws.amazon.com/AmazonS3/latest/userguide/access-control-block-public-access.html)하는 것이 좋습니다.
  + [Amazon S3 서버 측 암호화](https://docs.aws.amazon.com/AmazonS3/latest/userguide/serv-side-encryption.html)를 사용하여 사용된 Amazon S3 버킷을 암호화하는 것이 좋습니다.
  + [Amazon S3 수명 주기 정책](https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lifecycle-mgmt.html)을 사용하여 Amazon S3 버킷에 대한 보존 규칙을 정의하는 것이 좋습니다.
  + Amazon EMR은 오픈 소스에서 이미지로 가져온 코드를 항상 확인합니다. 보안을 위해 Spark에서 Amazon S3로의 다음 인증 방법은 지원되지 않습니다.
    + `hadoop-env` 구성 분류에서 AWS 액세스 키 설정
    + `tempdir` URI에서 AWS 액세스 키 인코딩

커넥터 사용 및 지원되는 파라미터에 대한 자세한 내용은 다음 리소스를 참조하세요.
+ *Amazon Redshift 관리 안내서*의 [Apache Spark용 Amazon Redshift 통합](https://docs.aws.amazon.com/redshift/latest/mgmt/spark-redshift-connector.html)
+ Github의 [`spark-redshift` community repository](https://github.com/spark-redshift-community/spark-redshift#readme)