本文為英文版的機器翻譯版本，如內容有任何歧義或不一致之處，概以英文版為準。

# Apache Spark 搭配 Amazon SageMaker AI
<a name="apache-spark"></a>

Amazon SageMaker AI Spark 是開放原始碼 Spark 程式庫，可協助您使用 SageMaker AI 建置 Spark 機器學習 (ML) 管道。這簡化了 Spark ML 階段與 SageMaker AI 階段的整合，例如模型訓練和託管。如需 SageMaker AI Spark 的相關資訊，請參閱 [SageMaker AI Spark](https://github.com/aws/sagemaker-spark) GitHub 儲存庫。下列主題提供資訊，讓您了解如何使用 Apache Spark 搭配 SageMaker AI。

SageMaker AI Spark 程式庫可在 Python 和 Scala 中使用。您可以使用 SageMaker AI Spark，在 Spark 叢集中使用 `org.apache.spark.sql.DataFrame` 資料框架來訓練 SageMaker AI 中的模型。模型訓練完成後，您還可以使用 SageMaker AI 託管服務來託管模型。

其餘先不論，SageMaker AI Spark 程式庫 `com.amazonaws.services.sagemaker.sparksdk` 提供下列類別：
+ `SageMakerEstimator` - 延伸 `org.apache.spark.ml.Estimator` 介面。您可以使用此估算器，在 SageMaker AI 中進行模型訓練。
+ `KMeansSageMakerEstimator`、`PCASageMakerEstimator`、和 `XGBoostSageMakerEstimator` — 延伸 `SageMakerEstimator` 類別。
+ `SageMakerModel` - 延伸 `org.apache.spark.ml.Model` 類別。您可以使用此 `SageMakerModel` 來託管模型，並在 SageMaker AI 中取得推論。

您可以從 [SageMaker AI Spark](https://github.com/aws/sagemaker-spark) GitHub 程式庫下載 Python Spark (PySpark) 和 Scala 程式庫的來源碼。

如需 SageMaker AI Spark 程式庫的安裝和範例，請參閱 [SageMaker AI Spark for Scala 範例](apache-spark-example1.md)或[使用 SageMaker AI Spark for Python (PySpark) 範例的資源](apache-spark-additional-examples.md)。

如果您在 上使用 Amazon EMR AWS 來管理 Spark 叢集，請參閱 [Apache Spark](https://aws.amazon.com/emr/features/spark/)。如需在 SageMaker AI 中使用 Amazon EMR 的詳細資訊，請參閱[使用 Amazon EMR 進行資料準備](studio-notebooks-emr-cluster.md)。

**Topics**
+ [整合 Apache Spark 應用程式與 SageMaker AI](#spark-sdk-common-process)
+ [SageMaker AI Spark for Scala 範例](apache-spark-example1.md)
+ [使用 SageMaker AI Spark for Python (PySpark) 範例的資源](apache-spark-additional-examples.md)

## 整合 Apache Spark 應用程式與 SageMaker AI
<a name="spark-sdk-common-process"></a>

以下是整合 Apache Spark 應用程式與 SageMaker AI 之步驟的高階摘要。

1. 繼續使用您熟悉的 Apache Spark 程式庫進行資料預先處理。而資料集在 Spark 叢集中，仍為 `DataFrame`。將您的資料載入至 `DataFrame`。預先處理它，以便您具有 `org.apache.spark.ml.linalg.Vector` 為 `Doubles` 的 `features` 資料欄，以及選用的 `label` 資料欄，其中具有 `Double` 類型值。

1. 使用 SageMaker AI Spark 程式庫中的估算器來訓練您的模型。例如，如果您選擇 SageMaker AI 提供的 k-means 演算法進行模型訓練，請呼叫 `KMeansSageMakerEstimator.fit` 方法。

   提供 `DataFrame`，並將其做為輸入。估算器會傳回 `SageMakerModel` 物件。
**注意**  
`SageMakerModel` 會延伸 `org.apache.spark.ml.Model`。

   `fit` 方法會執行下列作業：

   1. 將輸入 `DataFrame` 轉換為 protobuf 格式。從輸入 `DataFrame` 選取 `features` 和 `label` 欄來執行此操作。然後，它會將 protobuf 資料上傳到 Amazon S3 儲存貯體。在 SageMaker AI 中採用 protobuf 格式可提高模型訓練效率。

   1. 傳送 SageMaker AI [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html) 請求，開始在 SageMaker AI 中訓練模型。模型訓練完成後，SageMaker AI 會將模型成品儲存至 S3 儲存貯體。

      SageMaker AI 擔任您為模型訓練所指定的 IAM 角色，為您執行任務。例如，它會使用該角色從 S3 儲存貯體讀取訓練資料，然後將模型成品寫入儲存貯體。

   1. 建立並傳回 `SageMakerModel` 物件。建構函式會執行下列任務，而這些任務與將模型部署至 SageMaker AI 相關。

      1. 將 [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateModel.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateModel.html) 請求傳送至 SageMaker AI。

      1. 將 [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateEndpointConfig.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateEndpointConfig.html) 請求傳送至 SageMaker AI。

      1. 將 [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateEndpoint.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateEndpoint.html) 請求傳送至 SageMaker AI，然後啟動指定的資源，並在這些資源上託管模型。

1. 您可以使用 `SageMakerModel.transform`，從 SageMaker AI 中託管的模型取得推論。

   提供具備輸入特徵的 `DataFrame` 輸入。接著，`transform` 方法會將該輸入轉換為 `DataFrame`，其將包含推論。`transform` 方法會在內部將請求傳送至 [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_runtime_InvokeEndpoint.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_runtime_InvokeEndpoint.html) SageMaker API，以取得推論。`transform` 方法會將推論附加到輸入 `DataFrame`。