

本文属于机器翻译版本。若本译文内容与英语原文存在差异，则一律以英文原文为准。

# 使用 Nvidia RAPIDS Accelerator for Apache Spark
<a name="emr-spark-rapids"></a>

对于 Amazon EMR 发行版 6.2.0 及更高版本，您可以使用 Nvidia 的 [RAPIDS Accelerator for Apache Spark](https://docs.nvidia.com/spark-rapids/user-guide/latest/overview.html) 插件来通过 EC2 图形处理器（GPU）实例类型加速 Spark。RAPIDS Accelerator 将通过 GPU 加速您的 Apache Spark 3.0 数据科学管道，无需更改代码，并将加快数据处理和模型训练，同时大幅降低基础设施成本。

以下部分将引导您完成配置 EMR 集群来使用 Spark-RAPIDS Plugin for Spark。

## 选择实例类型
<a name="emr-spark-rapids-instancetypes"></a>

要将 Nvidia Spark-RAPIDS 插件用于 Spark，核心实例组和任务实例组必须使用符合 Spark-RAPIDS 的[硬件要求](https://nvidia.github.io/spark-rapids/)的 EC2 GPU 实例类型。要查看 Amazon EMR 支持的 GPU 实例类型的完整列表，请参阅《Amazon EMR 管理指南》**中的[支持的实例类型](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-supported-instance-types.html)。主实例组的实例类型可以是 GPU 或非 GPU 类型，但不支持 ARM 实例类型。

## 为集群设置应用程序配置
<a name="emr-spark-rapids-appconfig"></a>

**1。使 Amazon EMR 能在您的新集群上安装插件**

要安装插件，请在创建集群时提供以下配置：

```
{
	"Classification":"spark",
	"Properties":{
		"enableSparkRapids":"true"
	}
}
```

**2。将 YARN 配置为使用 GPU**

有关如何在 YARN 上使用 GPU 的详细信息，请参阅 Apache Hadoop 文档中的[在 YARN 上使用 GPU](https://hadoop.apache.org/docs/r3.2.1/hadoop-yarn/hadoop-yarn-site/UsingGpus.html)。以下示例显示了 Amazon EMR 6.x 和 7.x 发行版的 YARN 配置示例：

------
#### [ Amazon EMR 7.x ]

**Amazon EMR 7.x 的 YARN 配置示例**

```
{
    "Classification":"yarn-site",
    "Properties":{
        "yarn.nodemanager.resource-plugins":"yarn.io/gpu",
        "yarn.resource-types":"yarn.io/gpu",
        "yarn.nodemanager.resource-plugins.gpu.allowed-gpu-devices":"auto",
        "yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables":"/usr/bin",
        "yarn.nodemanager.linux-container-executor.cgroups.mount":"true",
        "yarn.nodemanager.linux-container-executor.cgroups.mount-path":"/spark-rapids-cgroup",
        "yarn.nodemanager.linux-container-executor.cgroups.hierarchy":"yarn",
        "yarn.nodemanager.container-executor.class":"org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor"
    }
},{
    "Classification":"container-executor",
    "Properties":{
        
    },
    "Configurations":[
        {
            "Classification":"gpu",
            "Properties":{
                "module.enabled":"true"
            }
        },
        {
            "Classification":"cgroups",
            "Properties":{
                "root":"/spark-rapids-cgroup",
                "yarn-hierarchy":"yarn"
            }
        }
    ]
}
```

------
#### [ Amazon EMR 6.x ]

**Amazon EMR 6.x 的 YARN 配置示例**

```
{
    "Classification":"yarn-site",
    "Properties":{
        "yarn.nodemanager.resource-plugins":"yarn.io/gpu",
        "yarn.resource-types":"yarn.io/gpu",
        "yarn.nodemanager.resource-plugins.gpu.allowed-gpu-devices":"auto",
        "yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables":"/usr/bin",
        "yarn.nodemanager.linux-container-executor.cgroups.mount":"true",
        "yarn.nodemanager.linux-container-executor.cgroups.mount-path":"/sys/fs/cgroup",
        "yarn.nodemanager.linux-container-executor.cgroups.hierarchy":"yarn",
        "yarn.nodemanager.container-executor.class":"org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor"
    }
},{
    "Classification":"container-executor",
    "Properties":{
        
    },
    "Configurations":[
        {
            "Classification":"gpu",
            "Properties":{
                "module.enabled":"true"
            }
        },
        {
            "Classification":"cgroups",
            "Properties":{
                "root":"/sys/fs/cgroup",
                "yarn-hierarchy":"yarn"
            }
        }
    ]
}
```

------

**3。将 Spark 配置为使用 RAPIDS**

以下是使 Spark 能够使用 RAPIDS 插件所需的配置：

```
{
	"Classification":"spark-defaults",
	"Properties":{
		"spark.plugins":"com.nvidia.spark.SQLPlugin",
		"spark.executor.resource.gpu.discoveryScript":"/usr/lib/spark/scripts/gpu/getGpusResources.sh",
		"spark.executor.extraLibraryPath":"/usr/local/cuda/targets/x86_64-linux/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/compat/lib:/usr/local/cuda/lib:/usr/local/cuda/lib64:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/docker/usr/lib/hadoop/lib/native:/docker/usr/lib/hadoop-lzo/lib/native"
	}
}
```

XGBoost4在集群上启用 [Spark RAPIDS 插件后， XGBoost 文档中的 J-Spark 库](https://xgboost.readthedocs.io/en/latest/jvm/xgboost4j_spark_tutorial.html)也可用。您可以使用以下配置 XGBoost 与您的 Spark 作业集成：

```
{
	"Classification":"spark-defaults",
	"Properties":{
		"spark.submit.pyFiles":"/usr/lib/spark/jars/xgboost4j-spark_3.0-1.4.2-0.3.0.jar"
	}
}
```

有关可用于优化 GPU 加速的 EMR 集群的其他 Spark 配置，请参阅 Nvidia.github.io 文档中的 [RAPIDS Accelerator for Apache Spark Tuning Guide](https://docs.nvidia.com/spark-rapids/user-guide/latest/tuning-guide.html)。

**4。配置 YARN 容量调度器**

必须配置 `DominantResourceCalculator` 来启用 GPU 调度和隔离。有关详细信息，请参阅 Apache Hadoop 文档中的 [Using GPU On YARN](https://hadoop.apache.org/docs/r3.2.1/hadoop-yarn/hadoop-yarn-site/UsingGpus.html)。

```
{
	"Classification":"capacity-scheduler",
	"Properties":{
		"yarn.scheduler.capacity.resource-calculator":"org.apache.hadoop.yarn.util.resource.DominantResourceCalculator"
	}
}
```

**5。创建一个 JSON 文件以包含您的配置**

您可以创建一个 JSON 文件，在其中包含您的配置，以便为 Spark 集群使用 RAPIDS 插件。您稍后在启动集群时需提供该文件。

您可以将文件存储在本地或 S3 上。有关如何为集群提供应用程序配置的详细信息，请参阅[配置应用程序](emr-configure-apps.md)。

使用以下示例文件作为模板来构建自己的配置。

------
#### [ Amazon EMR 7.x ]

**Amazon EMR 7.x 的示例 `my-configurations.json` 文件**

```
[
    {
        "Classification":"spark",
        "Properties":{
            "enableSparkRapids":"true"
        }
    },
    {
        "Classification":"yarn-site",
        "Properties":{
            "yarn.nodemanager.resource-plugins":"yarn.io/gpu",
            "yarn.resource-types":"yarn.io/gpu",
            "yarn.nodemanager.resource-plugins.gpu.allowed-gpu-devices":"auto",
            "yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables":"/usr/bin",
            "yarn.nodemanager.linux-container-executor.cgroups.mount":"true",
            "yarn.nodemanager.linux-container-executor.cgroups.mount-path":"/spark-rapids-cgroup",
            "yarn.nodemanager.linux-container-executor.cgroups.hierarchy":"yarn",
            "yarn.nodemanager.container-executor.class":"org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor"
        }
    },
    {
        "Classification":"container-executor",
        "Properties":{
            
        },
        "Configurations":[
            {
                "Classification":"gpu",
                "Properties":{
                    "module.enabled":"true"
                }
            },
            {
                "Classification":"cgroups",
                "Properties":{
                    "root":"/spark-rapids-cgroup",
                    "yarn-hierarchy":"yarn"
                }
            }
        ]
    },
    {
        "Classification":"spark-defaults",
        "Properties":{
            "spark.plugins":"com.nvidia.spark.SQLPlugin",
            "spark.executor.resource.gpu.discoveryScript":"/usr/lib/spark/scripts/gpu/getGpusResources.sh",
            "spark.executor.extraLibraryPath":"/usr/local/cuda/targets/x86_64-linux/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/compat/lib:/usr/local/cuda/lib:/usr/local/cuda/lib64:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/docker/usr/lib/hadoop/lib/native:/docker/usr/lib/hadoop-lzo/lib/native",
            "spark.submit.pyFiles":"/usr/lib/spark/jars/xgboost4j-spark_3.0-1.4.2-0.3.0.jar",
            "spark.rapids.sql.concurrentGpuTasks":"1",
            "spark.executor.resource.gpu.amount":"1",
            "spark.executor.cores":"2",
            "spark.task.cpus":"1",
            "spark.task.resource.gpu.amount":"0.5",
            "spark.rapids.memory.pinnedPool.size":"0",
            "spark.executor.memoryOverhead":"2G",
            "spark.locality.wait":"0s",
            "spark.sql.shuffle.partitions":"200",
            "spark.sql.files.maxPartitionBytes":"512m"
        }
    },
    {
        "Classification":"capacity-scheduler",
        "Properties":{
            "yarn.scheduler.capacity.resource-calculator":"org.apache.hadoop.yarn.util.resource.DominantResourceCalculator"
        }
    }
]
```

------
#### [ Amazon EMR 6.x ]

**Amazon EMR 6.x 的示例 `my-configurations.json` 文件**

```
[
    {
        "Classification":"spark",
        "Properties":{
            "enableSparkRapids":"true"
        }
    },
    {
        "Classification":"yarn-site",
        "Properties":{
            "yarn.nodemanager.resource-plugins":"yarn.io/gpu",
            "yarn.resource-types":"yarn.io/gpu",
            "yarn.nodemanager.resource-plugins.gpu.allowed-gpu-devices":"auto",
            "yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables":"/usr/bin",
            "yarn.nodemanager.linux-container-executor.cgroups.mount":"true",
            "yarn.nodemanager.linux-container-executor.cgroups.mount-path":"/sys/fs/cgroup",
            "yarn.nodemanager.linux-container-executor.cgroups.hierarchy":"yarn",
            "yarn.nodemanager.container-executor.class":"org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor"
        }
    },
    {
        "Classification":"container-executor",
        "Properties":{
            
        },
        "Configurations":[
            {
                "Classification":"gpu",
                "Properties":{
                    "module.enabled":"true"
                }
            },
            {
                "Classification":"cgroups",
                "Properties":{
                    "root":"/sys/fs/cgroup",
                    "yarn-hierarchy":"yarn"
                }
            }
        ]
    },
    {
        "Classification":"spark-defaults",
        "Properties":{
            "spark.plugins":"com.nvidia.spark.SQLPlugin",
            "spark.executor.resource.gpu.discoveryScript":"/usr/lib/spark/scripts/gpu/getGpusResources.sh",
            "spark.executor.extraLibraryPath":"/usr/local/cuda/targets/x86_64-linux/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/compat/lib:/usr/local/cuda/lib:/usr/local/cuda/lib64:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/docker/usr/lib/hadoop/lib/native:/docker/usr/lib/hadoop-lzo/lib/native",
            "spark.submit.pyFiles":"/usr/lib/spark/jars/xgboost4j-spark_3.0-1.4.2-0.3.0.jar",
            "spark.rapids.sql.concurrentGpuTasks":"1",
            "spark.executor.resource.gpu.amount":"1",
            "spark.executor.cores":"2",
            "spark.task.cpus":"1",
            "spark.task.resource.gpu.amount":"0.5",
            "spark.rapids.memory.pinnedPool.size":"0",
            "spark.executor.memoryOverhead":"2G",
            "spark.locality.wait":"0s",
            "spark.sql.shuffle.partitions":"200",
            "spark.sql.files.maxPartitionBytes":"512m"
        }
    },
    {
        "Classification":"capacity-scheduler",
        "Properties":{
            "yarn.scheduler.capacity.resource-calculator":"org.apache.hadoop.yarn.util.resource.DominantResourceCalculator"
        }
    }
]
```

------

## 为您的集群添加引导操作
<a name="emr-spark-rapids-bootstrap"></a>

有关如何在创建集群时提供引导操作脚本的更多信息，请参阅《Amazon EMR 管理指南》**中的[引导操作基础](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-bootstrap.html#bootstrapUses)。

以下示例脚本展示了如何为 Amazon EMR 6.x 和 7.x 制作引导操作文件：

------
#### [ Amazon EMR 7.x ]

**Amazon EMR 7.x 的示例 `my-bootstrap-action.sh` 文件**

要使用 YARN 管理 Amazon EMR 7.x 发行版的 GPU 资源，您必须在集群上手动挂载 CGroup v1。您可以使用引导操作脚本来执行此操作，如本示例所示。

```
#!/bin/bash
set -ex
 
sudo mkdir -p /spark-rapids-cgroup/devices
sudo mount -t cgroup -o devices cgroupv1-devices /spark-rapids-cgroup/devices
sudo chmod a+rwx -R /spark-rapids-cgroup
```

------
#### [ Amazon EMR 6.x ]

**Amazon EMR 6.x 的示例 `my-bootstrap-action.sh` 文件**

对于 Amazon EMR 6.x 发行版，您必须在集群上打开 YARN 的 CGroup 权限。您可以使用引导操作脚本来执行此操作，如本示例所示。

```
#!/bin/bash
set -ex
 
sudo chmod a+rwx -R /sys/fs/cgroup/cpu,cpuacct
sudo chmod a+rwx -R /sys/fs/cgroup/devices
```

------

## 启动您的集群。
<a name="emr-spark-rapids-launchcluster"></a>

最后一步是使用上述集群配置启动您的集群。以下是一个通过 Amazon EMR CLI 启动集群的命令示例：

```
 aws emr create-cluster \
--release-label emr-7.12.0 \
--applications Name=Hadoop Name=Spark \
--service-role EMR_DefaultRole_V2 \
--ec2-attributes KeyName=my-key-pair,InstanceProfile=EMR_EC2_DefaultRole \
--instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m4.4xlarge \                 
                  InstanceGroupType=CORE,InstanceCount=1,InstanceType=g4dn.2xlarge \    
                  InstanceGroupType=TASK,InstanceCount=1,InstanceType=g4dn.2xlarge \
--configurations file:///my-configurations.json \
--bootstrap-actions Name='My Spark Rapids Bootstrap action',Path=s3://amzn-s3-demo-bucket/my-bootstrap-action.sh
```