本文属于机器翻译版本。若本译文内容与英语原文存在差异，则一律以英文原文为准。 # 为 AWS Step Functions SageMaker 创建自定义 Docker 容器镜像并将其用于模型训练 *Julia Bluszcz、Aubrey Oosthuizen、Mohan Gowda Purushothama、Neha Sharma 和 Mateusz Zaremba，Amazon Web Services* ## Summary 此模式展示了如何为[亚马逊创建 Docker 容器镜像 SageMaker并将其用于 A](https://docs.aws.amazon.com/sagemaker/latest/dg/whatis.html) [WS St](https://docs.aws.amazon.com/step-functions/latest/dg/welcome.html) ep Functions 中的训练模型。通过将自定义算法打包到容器中，您几乎可以在 SageMaker 环境中运行任何代码，无论编程语言、框架或依赖关系如何。在提供的示例[SageMaker 笔记本](https://docs.aws.amazon.com/sagemaker/latest/dg/nbi.html)中，自定义 Docker 容器镜像存储在[亚马逊弹性容器注册表 (Amazon ECR) Cont](https://docs.aws.amazon.com/AmazonECR/latest/userguide/what-is-ecr.html) ainer Registry 中。然后，Step Functions 使用存储在 Amazon ECR 中的容器为其运行 Python 处理脚本。 SageMaker然后，容器将模型导出至 [Amazon Simple Storage Service（Amazon S3）](https://docs.aws.amazon.com/AmazonS3/latest/userguide/Welcome.html)。 ## 先决条件和限制 **先决条件** + 一个有效的 Amazon Web Services account + SageMaker具有 Amazon S3 权限的 A [WS 身份和访问管理 (IAM) 角色](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) + [Step Functions 的 IAM 角色](https://sagemaker-examples.readthedocs.io/en/latest/step-functions-data-science-sdk/step_functions_mlworkflow_processing/step_functions_mlworkflow_scikit_learn_data_processing_and_model_evaluation.html#Create-an-Execution-Role-for-Step-Functions) + 基本熟悉 Python + 熟悉亚马逊 SageMaker Python 开发工具包 + 熟悉 AWS 命令行界面（AWS CLI） + 熟悉适用于 Python 的 Amazon SDK (Boto3) + 熟悉 Amazon ECR + 熟悉 Docker **产品版本** + AWS Step Functions Data Science SDK 版本 2.3.0 + 亚马逊 SageMaker Python SDK 版本 2.78.0 ## 架构下图显示了在 Step Functions 中为其创建 Docker 容器镜像 SageMaker，然后将其用于训练模型的示例工作流程： ![\[创建 Docker 容器镜像 SageMaker 以用作 Step Functions 训练模型的工作流程。\]](http://docs.aws.amazon.com/zh_cn/prescriptive-guidance/latest/patterns/images/pattern-img/7857d57f-3077-4b06-8971-fb5846387693/images/37755e38-0bc4-4dd0-90c7-135d95b00053.png) 下图显示了如下工作流： 1. 数据科学家或 DevOps 工程师使用 Amazon SageMaker 笔记本创建自定义 Docker 容器镜像。 1. 数据科学家或 DevOps 工程师将 Docker 容器映像存储在私有注册表中的 Amazon ECR 私有存储库中。 1. 数据科学家或 DevOps 工程师使用 Docker 容器在 Step Functions 工作流程中运行 Python SageMaker 处理作业。 **自动化和扩展** 此模式中的示例 SageMaker 笔记本使用`ml.m5.xlarge`笔记本实例类型。您可以更改实例类型，以适合您的用例。有关 SageMaker 笔记本实例类型的更多信息，请参阅 [Amazon SageMaker 定价](https://aws.amazon.com/sagemaker/pricing/)。 ## 工具 + [Amazon Elastic Container Registry (Amazon ECR)](https://docs.aws.amazon.com/AmazonECR/latest/userguide/what-is-ecr.html) 是一项安全、可扩展且可靠的托管容器映像注册表服务。 + [Amazon SageMaker](https://docs.aws.amazon.com/sagemaker/latest/dg/whatis.html) 是一项托管机器学习 (ML) 服务，可帮助您构建和训练机器学习模型，然后将其部署到生产就绪的托管环境中。 + [Amaz SageMaker on Python 软件开发工具包](https://github.com/aws/sagemaker-python-sdk)是一个开源库，用于在上 SageMaker训练和部署机器学习模型。 + [AWS Step Functions](https://docs.aws.amazon.com/step-functions/latest/dg/welcome.html) 是一项无服务器编排服务，可让您搭配使用 AWS Lambda 函数和其他 Amazon Web Services 来构建业务关键型应用程序。 + [AWS Step Functions Data Science Python SDK](https://aws-step-functions-data-science-sdk.readthedocs.io/en/stable/index.html) 是一个开源库，可帮助您创建处理和发布机器学习模型的 Step Functions 工作流。 ## 操作说明 ### 创建自定义 Docker 容器映像，并将其存储在 Amazon ECR 中 | Task | 说明 | 所需技能 | | --- | --- | --- | | 设置 Amazon ECR 并新建私有注册表。 | 如果您尚未设置 Amazon ECR，请按照 *Amazon ECR 用户指南*中的[设置 Amazon ECR](https://docs.aws.amazon.com/AmazonECR/latest/userguide/get-set-up-for-amazon-ecr.html)进行操作。每个 Amazon Web Services account 都提供有原定设置的私有 Amazon ECR 注册表。 | DevOps 工程师 | | 创建 Amazon ECR 私有存储库。 | 请按照 *Amazon ECR 用户指南*中的[创建私有存储库](https://docs.aws.amazon.com/AmazonECR/latest/userguide/repository-create.html)进行操作。您创建的存储库是存储自定义 Docker 容器映像的位置。 | DevOps 工程师 | | 创建一个 Dockerfile，其中包含运行 SageMaker 处理作业所需的规范。 | 通过配置 Dockerfile 来创建包含运行 SageMaker 处理作业所需的规格的 Dockerfile。有关说明，请参阅《*Amazon SageMaker 开发者指南》*中的[调整自己的训练容器](https://docs.aws.amazon.com/sagemaker/latest/dg/adapt-training-container.html)。有关 Dockerfile 的更多信息，请参阅 Docker 文档中的 [Dockerfile 参考](https://docs.docker.com/engine/reference/builder/)。**用于创建 DockerFile 的 Jupyter Notebook 代码单元格示例***单元格 1*

# Make docker folder
!mkdir -p docker

*单元格 2*

%%writefile docker/Dockerfile

FROM python:3.7-slim-buster

RUN pip3 install pandas==0.25.3 scikit-learn==0.21.3
ENV PYTHONUNBUFFERED=TRUE

ENTRYPOINT ["python3"]

| DevOps 工程师 | | 构建您的 Docker 容器映像并将其推送至 Amazon ECR。 | [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/zh_cn/prescriptive-guidance/latest/patterns/create-a-custom-docker-container-image-for-sagemaker-and-use-it-for-model-training-in-aws-step-functions.html)有关更多信息，请参阅在上[构建自己的算法容器中的构建和注册](https://sagemaker-examples.readthedocs.io/en/latest/advanced_functionality/scikit_bring_your_own/scikit_bring_your_own.html#Building-and-registering-the-container)*容器* GitHub。**用于构建和注册 Docker 映像的 Jupyter Notebook 代码单元格示例**运行以下单元之前，请确认已创建一个 Dockerfile 并将其存储在名为 `docker` 的目录中。此外，请确保您已创建一个 Amazon ECR 存储库，并将第一个单元格中的 `ecr_repository` 值替换为存储库名称。*单元格 1*

import boto3
tag = ':latest'
account_id = boto3.client('sts').get_caller_identity().get('Account')
region = boto3.Session().region_name
ecr_repository = 'byoc'

image_uri = '{}.dkr.ecr.{}.amazonaws.com/{}'.format(account_id, region, ecr_repository + tag)

*单元格 2*

# Build docker image
!docker build -t $image_uri docker

*单元格 3*

# Authenticate to ECR
!aws ecr get-login-password --region {region} | docker login --username AWS --password-stdin {account_id}.dkr.ecr.{region}.amazonaws.com

*单元格 4*

# Push docker image
!docker push $image_uri

您必须[对私有注册表的 Docker 客户端进行身份验证](https://docs.aws.amazon.com/AmazonECR/latest/userguide/registry_auth.html)，这样才能使用 `docker push` 和 `docker pull` 命令。这些命令将图像推送和拉出注册表中存储库。 | DevOps 工程师 | ### 创建采用您自定义 Docker 容器映像的 Step Functions 工作流 | Task | 说明 | 所需技能 | | --- | --- | --- | | 创建包含自定义处理和模型训练逻辑的 Python 脚本。 | 编写将在数据处理脚本中运行的自定义处理逻辑。然后，将其另存为名为 `training.py` 的 Python 脚本。有关更多信息，请参阅在开[启 SageMaker 脚本模式的情况下自带模型](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker-script-mode/sagemaker-script-mode.html) GitHub。**含自定义处理和模型训练逻辑的 Python 脚本示例**

%%writefile training.py
from numpy import empty
import pandas as pd
import os
from sklearn import datasets, svm
from joblib import dump, load


if __name__ == '__main__':
    digits = datasets.load_digits()
    #create classifier object
    clf = svm.SVC(gamma=0.001, C=100.)
    
    #fit the model
    clf.fit(digits.data[:-1], digits.target[:-1])
    
    #model output in binary format
    output_path = os.path.join('/opt/ml/processing/model', "model.joblib")
    dump(clf, output_path)

| 数据科学家 | | 创建一个 Step Functions 工作流程，其中包含您的 SageMaker 处理作业。 | 安装并导入 [AWS Step Functions Data Science SDK](https://aws-step-functions-data-science-sdk.readthedocs.io/en/stable/readmelink.html)，然后将 **training.py** 文件上传至 Amazon S3。然后，使用 [Amaz SageMaker on Python 开发工具包](https://github.com/aws/sagemaker-python-sdk)在 Step Functions 中定义处理步骤。请确认您已在 AWS 账户中[为 Step Functions 创建 IAM 执行角色](https://sagemaker-examples.readthedocs.io/en/latest/step-functions-data-science-sdk/step_functions_mlworkflow_processing/step_functions_mlworkflow_scikit_learn_data_processing_and_model_evaluation.html#Create-an-Execution-Role-for-Step-Functions)。**要上传至 Amazon S3 的环境设置和自定义训练脚本示例**

!pip install stepfunctions

import boto3
import stepfunctions
import sagemaker
import datetime

from stepfunctions import steps
from stepfunctions.inputs import ExecutionInput
from stepfunctions.steps import (
    Chain
)
from stepfunctions.workflow import Workflow
from sagemaker.processing import ScriptProcessor, ProcessingInput, ProcessingOutput

sagemaker_session = sagemaker.Session()
bucket = sagemaker_session.default_bucket() 
role = sagemaker.get_execution_role()
prefix = 'byoc-training-model'

# See prerequisites section to create this role
workflow_execution_role = f"arn:aws:iam::{account_id}:role/AmazonSageMaker-StepFunctionsWorkflowExecutionRole"

execution_input = ExecutionInput(
    schema={
        "PreprocessingJobName": str})


input_code = sagemaker_session.upload_data(
    "training.py",
    bucket=bucket,
    key_prefix="preprocessing.py",
)

**使用自定义 Amazon ECR 图像和 Python 脚本的 SageMaker 处理步骤定义示例**请务必使用 `execution_input` 参数指定任务名称。每次运行作业时，参数值必须是唯一的。此外，**training.py** 文件的代码作为 `input` 参数传递至 `ProcessingStep`，这意味着它将被复制到容器中。`ProcessingInput` 代码的目标与 `container_entrypoint` 内部的第二个参数相同。

script_processor = ScriptProcessor(command=['python3'],
                image_uri=image_uri,
                role=role,
                instance_count=1,
                instance_type='ml.m5.xlarge')


processing_step = steps.ProcessingStep(
    "training-step",
    processor=script_processor,
    job_name=execution_input["PreprocessingJobName"],
    inputs=[
        ProcessingInput(
            source=input_code,
            destination="/opt/ml/processing/input/code",
            input_name="code",
        ),
    ],
    outputs=[
        ProcessingOutput(
            source='/opt/ml/processing/model', 
            destination="s3://{}/{}".format(bucket, prefix), 
            output_name='byoc-example')
    ],
    container_entrypoint=["python3", "/opt/ml/processing/input/code/training.py"],
)

**运行处理作业的 Step Functions 工作 SageMaker 流程示例**此示例工作流仅包括 SageMaker 处理作业步骤，不包括完整的 Step Functions 工作流程。有关完整的工作流程示例，请参阅 AWS Step Functions 数据科学软件开发工具包文档[ SageMaker中的示例笔记本](https://aws-step-functions-data-science-sdk.readthedocs.io/en/stable/readmelink.html#example-notebooks-in-sagemaker)。

workflow_graph = Chain([processing_step])

workflow = Workflow(
    name="ProcessingWorkflow",
    definition=workflow_graph,
    role=workflow_execution_role
)

workflow.create()
# Execute workflow
execution = workflow.execute(
    inputs={
        "PreprocessingJobName": str(datetime.datetime.now().strftime("%Y%m%d%H%M-%SS")),  # Each pre processing job (SageMaker processing job) requires a unique name,
    }
)
execution_output = execution.get_output(wait=True)

| 数据科学家 | ## 相关资源 + [处理数据](https://docs.aws.amazon.com/sagemaker/latest/dg/processing-job.html)（*Amazon SageMaker 开发者指南*） + [调整自己的训练容器](https://docs.aws.amazon.com/sagemaker/latest/dg/adapt-training-container.html)（*Amazon SageMaker 开发者指南*）