本文属于机器翻译版本。若本译文内容与英语原文存在差异，则一律以英文原文为准。 # 将你的本地代码当作 SageMaker 训练作业来运行您可以将本地机器学习 (ML) Python 代码作为大型单节点 Amazon SageMaker 训练作业或多个并行作业运行。您可以使用 @remote 装饰器为代码添加注释来做到这一点，如以下代码示例中所示。Remote 函数不支持[分布式训练](https://docs.aws.amazon.com/sagemaker/latest/dg/distributed-training.html)（跨多个实例）。 ``` @remote(**settings) def divide(x, y): return x / y ``` SageMaker Python SDK 会自动将您的现有工作空间环境以及任何相关的数据处理代码和数据集转换为在 SageMaker 训练平台上运行的 SageMaker 训练作业。您还可以激活永久缓存功能，此功能将缓存之前下载的依赖项包来进一步缩短作业启动延迟。作业延迟的减少幅度大于单独使用 SageMaker AI 托管的温池所减少的延迟。有关更多信息，请参阅 [使用持久性缓存](train-warm-pools.md#train-warm-pools-persistent-cache)。 **注意** Remote 函数不支持分布式训练作业。以下部分介绍如何使用 @remote 装饰器为本地机器学习代码添加注释，以及如何针对使用案例定制体验。这包括自定义您的环境以及与 SageMaker 实验集成。 **Topics** + [设置环境](#train-remote-decorator-env) + [调用远程函数](train-remote-decorator-invocation.md) + [配置文件](train-remote-decorator-config.md) + [自定义运行时系统环境](train-remote-decorator-customize.md) + [容器映像兼容性](train-remote-decorator-container.md) + [使用 Amazon SageMaker 实验记录参数和指标](train-remote-decorator-experiments.md) + [将模块化代码用于 @remote 装饰器](train-remote-decorator-modular.md) + [运行时系统依赖项的私有存储库](train-remote-decorator-private.md) + [示例笔记本](train-remote-decorator-examples.md) ## 设置环境选择下列三个选项之一来设置环境。 ### 从 Amazon SageMaker Studio 经典版运行你的代码通过创建 SageMaker 笔记本并附上 Studio Classic 图像上可用的任何图像，您可以从 SageMaker Studio Class SageMaker ic 中注释和运行本地机器学习代码。以下说明可帮助您创建 SageMaker 笔记本、安装 SageMaker Python SDK 以及使用装饰器为代码添加注释。 1. 创建 SageMaker 笔记本并在 SageMaker Studio Classic 中附加图像，如下所示： 1. 按照《亚马逊 A * SageMaker I 开发者指南*》中[启动 Amazon SageMaker Studio Classic](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-launch.html) 中的说明进行操作。 1. 从左侧导航窗格中选择 **Studio**。这将打开一个新窗口。 1. 在**开始使用**对话框中，从下拉箭头选择用户配置文件。这将打开一个新窗口。 1. 选择**打开 Studio Classic**。 1. 从主工作区中选择**打开启动程序**。这将打开一个新页面。 1. 从主工作区中选择**创建笔记本**。 1. 在**更改环境**对话框中，从**映像**旁边的向下箭头中选择 **Base Python 3.0**。 @remote 装饰器会自动检测附加到 SageMaker Studio Classic 笔记本上的图像并使用它来运行 SageMaker训练作业。如果在装饰器或配置文件中将 `image_uri` 指定为参数，则将使用 `image_uri` 中指定的值而不是检测到的图像。有关如何在 Studio Class SageMaker ic 中创建笔记本的更多信息，请参阅**创建[或打开 Amazon SageMaker Studio Classic 笔记本中的 “从文件菜单创建笔记本](https://docs.aws.amazon.com/sagemaker/latest/dg/notebooks-create-open.html#notebooks-create-file-menu)”** 部分。有关可用映像的列表，请参阅[支持的 Docker 映像](https://docs.aws.amazon.com/sagemaker/latest/dg/train-remote-decorator-container.html)。 1. 安装 SageMaker Python 开发工具包。要在 SageMaker Studio Classic Notebook 中使用 @remote 函数为你的代码添加注释，你必须安装 Pyth SageMaker on SDK。安装 SageMaker Python 开发工具包，如以下代码示例所示。 ``` !pip install sagemaker ``` 1. 使用 @remote 装饰器在 SageMaker 训练作业中运行函数。要运行您的本地 ML 代码，请先创建一个依赖项文件以指示 SageMaker AI 在哪里找到您的本地代码。为此，请按照以下步骤操作： 1. 在 SageMaker Studio Classic Launcher 主工作区的 “**实用工具和文件**” 中，选择 “**文本文件**”。这将打开一个新选项卡，其中包含一个名为 `untitled.txt.` 的文本文件。有关 SageMaker Studio 经典用户界面 (UI) 的更多信息，请参阅 [Amazon SageMaker Studio 经典用户界面概述](https://docs.aws.amazon.com//sagemaker/latest/dg/studio-ui.html)。 1. 将 `untitled.txt ` 重命名为 `requirements.txt`。 1. 将代码所需的所有依赖项以及 A SageMaker I 库添加到`requirements.txt`。以下部分中提供了示例 `divide` 函数的 `requirements.txt` 的最小代码示例，如下所示。 ``` sagemaker ``` 1. 通过传递依赖项文件，使用远程装饰器运行您的代码，如下所示。 ``` from sagemaker.remote_function import remote @remote(instance_type="ml.m5.xlarge", dependencies='./requirements.txt') def divide(x, y): return x / y divide(2, 3.0) ``` 有关其他代码示例，请参阅示例笔记本 [quick\$1start.ipynb](https://github.com/aws/amazon-sagemaker-examples/blob/main/sagemaker-remote-function/quick_start/quick_start.ipynb)。如果你已经在运行 SageMaker Studio Classic 笔记本电脑，并且按照 **2 中的说明安装 Python SDK。安装 SageMaker Python 软件开发工具包**，必须重启内核。有关更多信息，请参阅 *Amazon A SageMaker I 开发者指南*中的[使用 SageMaker Studio 经典笔记本工具栏](https://docs.aws.amazon.com/sagemaker/latest/dg/notebooks-menu.html)。 ### 在 Amazon SageMaker 笔记本上运行您的代码您可以为 SageMaker 笔记本实例中的本地 ML 代码添加注释。以下说明说明如何使用自定义内核创建笔记本实例、安装 SageMaker Python SDK 以及如何使用装饰器为代码添加注释。 1. 使用自定义 `conda` 内核创建笔记本实例。你可以用 @remote 装饰器为你的本地 ML 代码添加注释，以便在训练作业中 SageMaker 使用。首先，您必须创建和自定义 SageMaker 笔记本实例，以使用 Python 版本 3.7 或更高版本（最高 3.10.x）的内核。为此，请按照以下步骤操作： 1. 打开 SageMaker AI 控制台，网址为[https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/)。 1. 在左侧导航面板中，选择**笔记本**以展开其选项。 1. 从展开的选项中选择**笔记本实例**。 1. 选择**创建笔记本实例**按钮。这将打开一个新页面。 1. 对于**笔记本实例名称**，输入一个最多包含 63 个字符且不含空格的名称。有效字符：**A-Z**、**a-z**、**0-9** 和 **.****:****\$1****=****@**** \$1****%****-**（连字符）。 1. 在**笔记本实例设置**对话框中，展开**其他配置**旁边的向右箭头。 1. 在**生命周期配置 - 可选**下，展开向下箭头并选择**创建新的生命周期配置**。这将打开一个新的对话框。 1. 在**名称**下，为您的配置设置输入名称。 1. 在**脚本**对话框的**启动笔记本**选项卡中，将文本框的现有内容替换为以下脚本。 ``` #!/bin/bash set -e sudo -u ec2-user -i <<'EOF' unset SUDO_UID WORKING_DIR=/home/ec2-user/SageMaker/custom-miniconda/ source "$WORKING_DIR/miniconda/bin/activate" for env in $WORKING_DIR/miniconda/envs/*; do BASENAME=$(basename "$env") source activate "$BASENAME" python -m ipykernel install --user --name "$BASENAME" --display-name "Custom ($BASENAME)" done EOF echo "Restarting the Jupyter server.." # restart command is dependent on current running Amazon Linux and JupyterLab CURR_VERSION_AL=$(cat /etc/system-release) CURR_VERSION_JS=$(jupyter --version) if [[ $CURR_VERSION_JS == *$"jupyter_core : 4.9.1"* ]] && [[ $CURR_VERSION_AL == *$" release 2018"* ]]; then sudo initctl restart jupyter-server --no-wait else sudo systemctl --no-block restart jupyter-server.service fi ``` 1. 在**脚本**对话框的**创建笔记本**选项卡中，将文本框的现有内容替换为以下脚本。 ``` #!/bin/bash set -e sudo -u ec2-user -i <<'EOF' unset SUDO_UID # Install a separate conda installation via Miniconda WORKING_DIR=/home/ec2-user/SageMaker/custom-miniconda mkdir -p "$WORKING_DIR" wget https://repo.anaconda.com/miniconda/Miniconda3-4.6.14-Linux-x86_64.sh -O "$WORKING_DIR/miniconda.sh" bash "$WORKING_DIR/miniconda.sh" -b -u -p "$WORKING_DIR/miniconda" rm -rf "$WORKING_DIR/miniconda.sh" # Create a custom conda environment source "$WORKING_DIR/miniconda/bin/activate" KERNEL_NAME="custom_python310" PYTHON="3.10" conda create --yes --name "$KERNEL_NAME" python="$PYTHON" pip conda activate "$KERNEL_NAME" pip install --quiet ipykernel # Customize these lines as necessary to install the required packages EOF ``` 1. 选择窗口右下角的**创建配置**按钮。 1. 选择窗口右下角的**创建笔记本实例**按钮。 1. 等待 notebook 实例的**状态**从 “**待定**” 变为**InService**。 1. 在笔记本实例中创建 Jupyter 笔记本。以下说明说明如何在新创建的实例中使用 Python 3.10 创建 Jupyter 笔记本。 SageMaker 1. 在上一步中的笔记本实例**状态**变为之后 **InService**，执行以下操作： 1. 在包含新创建的笔记本实例**名称**的行中的**操作**下选择**打开 Jupyter**。这将打开一个新的 Jupyter 服务器。 1. 在 Jupyter 服务器中，从右上角的菜单中选择**新建**。 1. 从向下箭头中选择 **conda\$1custom\$1python310**。这将创建一个使用 Python 3.10 内核的新 Jupyter 笔记本。现在可以像使用本地 Jupyter 笔记本一样使用这个新的 Jupyter 笔记本。 1. 安装 SageMaker Python 开发工具包。虚拟环境运行后，使用以下代码示例安装 SageMaker Python SDK。 ``` !pip install sagemaker ``` 1. 使用 @remote 装饰器在 SageMaker 训练作业中运行函数。当你在 SageMaker 笔记本内用 @remote 装饰器为本地机器学习代码添加注释时， SageMaker 训练将自动解释你的代码的功能并将其作为 SageMaker 训练作业运行。通过执行以下操作来设置笔记本： 1. 在笔记本菜单中，从您在步骤 1 “**使用自定义内核创建 SageMaker笔记本实例” 中创建的 SageMaker 笔记本实例中选择内核**名称。有关更多信息，请参阅[更改映像或内核](https://docs.aws.amazon.com/sagemaker/latest/dg/notebooks-run-and-manage-change-image.html)。 1. 从向下箭头中，选择使用 3.7 或更高版本的 Python 的自定义 `conda` 内核。例如，选择 `conda_custom_python310` 将选择 Python 3.10 的内核。 1. 选定**选择**。 1. 等待内核的状态显示为空闲，这表明内核已经启动。 1. 在 Jupyter 服务器主页中，从右上角的菜单中选择**新建**。 1. 在向下箭头旁边，选择**文本文件**。这将创建一个名为 `untitled.txt.` 的新文本文件 1. 将 `untitled.txt` 重命名为 `requirements.txt`，并添加代码所需的所有依赖项和 `sagemaker`。 1. 通过传递依赖项文件，使用远程装饰器运行您的代码，如下所示。 ``` from sagemaker.remote_function import remote @remote(instance_type="ml.m5.xlarge", dependencies='./requirements.txt') def divide(x, y): return x / y divide(2, 3.0) ``` 有关其他代码示例，请参阅示例笔记本 [quick\$1start.ipnyb](https://github.com/aws/amazon-sagemaker-examples/blob/main/sagemaker-remote-function/quick_start/quick_start.ipynb)。 ### 从本地 IDE 中运行您的代码您可以在首选本地 IDE 中使用 @remote 装饰器为本地机器学习代码添加注释。以下步骤说明了必要的先决条件、如何安装 Python SDK 以及如何使用 @remote 装饰器为代码添加注释。 1. 通过设置 AWS Command Line Interface (AWS CLI) 并创建角色来安装必备组件，如下所示： + 按照[设置 Amazon SageMaker AI **AWS CLI 先决条件的 “先决条件**” 部分中的说明登录 SageMaker AI](https://docs.aws.amazon.com/sagemaker/latest/dg/gs-set-up.html#gs-cli-prereq) 域。 + 按照 AI 角色的**创建执行角色**部分创建 I [SageMaker AM 角色](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html)。 1. 使用 PyCharm 或`conda`并使用 Python 3.7 或更高版本（最高 3.10.x）创建虚拟环境。 + 使用 PyCharm 以下方法设置虚拟环境： 1. 从主菜单中选择**文件**。 1. 选择**新项目**。 1. 从**使用的新环境**下的向下箭头中选择 **Conda**。 1. 在 **Python 版本**字段中，使用向下箭头选择 3.7 或更高版本的 Python 版本。您可以从列表中选择最高 3.10.x。 ![\[\]](http://docs.aws.amazon.com/zh_cn/sagemaker/latest/dg/images/training-pycharm-ide.png) + 如果您安装了 Anaconda，则可以使用 `conda` 设置虚拟环境，如下所示： + 打开 Anaconda 提示终端界面。 + 使用 Python 版本 3.7 或更高版本（最高 3.10x 版）创建并激活新的 `conda` 环境。以下代码示例说明如何使用 Python 3.10 版创建 `conda` 环境。 ``` conda create -n sagemaker_jobs_quick_start python=3.10 pip conda activate sagemaker_jobs_quick_start ``` 1. 安装 SageMaker Python 开发工具包。要从首选 IDE 中包装代码，您必须使用 Python 3.7 或更高版本（最高 3.10x 版）设置虚拟环境。您还需要兼容的容器映像。使用以下代码示例安装 SageMaker Python 开发工具包。 ``` pip install sagemaker ``` 1. 将您的代码封装在 @remote 装饰器中。P SageMaker ython SDK 将自动解释您的代码的功能并将其作为 SageMaker训练作业运行。以下代码示例演示如何导入必要的库、设置 SageMaker 会话以及如何使用 @remote 装饰器为函数添加注释。您可以通过直接提供所需的依赖项或使用活动的 `conda` 环境中的依赖项来运行代码。 + 要直接提供依赖项，请执行以下操作： + 在代码所在的工作目录中创建一个 `requirements.txt` 文件。 + 将代码所需的所有依赖项与 SageMaker 库一起添加。以下部分提供了示例 `divide` 函数的 `requirements.txt` 的最小代码示例。 ``` sagemaker ``` + 通过传递依赖项文件，使用 @remote 装饰器运行您的代码。在以下代码示例中，`The IAM role name`替换为 SageMaker 要用于运行任务的 AWS Identity and Access Management (IAM) 角色 ARN。 ``` import boto3 import sagemaker from sagemaker.remote_function import remote sm_session = sagemaker.Session(boto_session=boto3.session.Session(region_name="us-west-2")) settings = dict( sagemaker_session=sm_session, role=, instance_type="ml.m5.xlarge", dependencies='./requirements.txt' ) @remote(**settings) def divide(x, y): return x / y if __name__ == "__main__": print(divide(2, 3.0)) ``` + 要使用来自活动的 `conda` 环境的依赖项，请将值 `auto_capture` 用于 `dependencies` 参数，如下所示。 ``` import boto3 import sagemaker from sagemaker.remote_function import remote sm_session = sagemaker.Session(boto_session=boto3.session.Session(region_name="us-west-2")) settings = dict( sagemaker_session=sm_session, role=, instance_type="ml.m5.xlarge", dependencies="auto_capture" ) @remote(**settings) def divide(x, y): return x / y if __name__ == "__main__": print(divide(2, 3.0)) ``` **注意** 你也可以在 Jupyter 笔记本中实现前面的代码。 PyCharm 专业版原生支持 Jupyter。有关更多指导，请参阅文档中的 [Jupyter 笔记本支持](https://www.jetbrains.com/help/pycharm/ipython-notebook-support.html) PyCharm。 # 调用远程函数要在 @remote 装饰器中调用函数，请使用以下任一方法： + [使用 @remote 装饰器调用函数](#train-remote-decorator-invocation-decorator). + [使用 `RemoteExecutor` API 调用函数](#train-remote-decorator-invocation-api). 如果您使用 @remote 装饰器方法调用函数，则训练作业将等待函数完成后再开始新任务。但是，如果您使用 `RemoteExecutor` API，则可以并行运行多个作业。以下部分说明了这两种调用函数的方式。 ## 使用 @remote 装饰器调用函数你可以使用 @remote 装饰器来注释一个函数。 SageMaker AI 会将装饰器内部的代码转换为 SageMaker 训练作业。之后，训练作业将在装饰器内部调用该函数并等待作业完成。以下代码示例演示如何导入所需的库、启动 A SageMaker I 实例以及如何使用 @remote 装饰器对矩阵乘法进行注释。 ``` from sagemaker.remote_function import remote import numpy as np @remote(instance_type="ml.m5.large") def matrix_multiply(a, b): return np.matmul(a, b) a = np.array([[1, 0], [0, 1]]) b = np.array([1, 2]) assert (matrix_multiply(a, b) == np.array([1,2])).all() ``` 装饰器的定义如下所示。 ``` def remote( *, **kwarg): ... ``` 当您调用装饰函数时， SageMaker Python SDK 会将错误引发的所有异常加载到本地内存中。在以下代码示例中，对 divide 函数的首次调用成功完成，并将结果加载到本地内存中。在第二次调用 divide 函数时，代码返回一个错误，并将该错误加载到本地内存中。 ``` from sagemaker.remote_function import remote import pytest @remote() def divide(a, b): return a/b # the underlying job is completed successfully # and the function return is loaded assert divide(10, 5) == 2 # the underlying job fails with "AlgorithmError" # and the function exception is loaded into local memory with pytest.raises(ZeroDivisionError): divide(10, 0) ``` **注意** 装饰函数作为远程作业运行。如果线程中断，则底层作业将不会停止。 ### 如何更改局部变量的值装饰器函数在远程计算机上运行。在装饰函数中更改非局部变量或输入参数将不会更改本地值。在以下代码示例中，列表和字典已附加到装饰器函数中。调用装饰器函数时，此情况不会改变。 ``` a = [] @remote def func(): a.append(1) # when func is invoked, a in the local memory is not modified func() func() # a stays as [] a = {} @remote def func(a): # append new values to the input dictionary a["key-2"] = "value-2" a = {"key": "value"} func(a) # a stays as {"key": "value"} ``` 要更改在装饰器函数内部声明的局部变量的值，请从函数返回该变量。以下代码示例说明，在从函数返回局部变量时，该变量的值会发生变化。 ``` a = {"key-1": "value-1"} @remote def func(a): a["key-2"] = "value-2" return a a = func(a) -> {"key-1": "value-1", "key-2": "value-2"} ``` ### 数据序列化和反序列化当您调用远程函数时， SageMaker AI 会在输入和输出阶段自动序列化您的函数参数。使用 c [loud](https://github.com/cloudpipe/cloudpickle) pickle 对函数参数和返回值进行序列化。 SageMaker AI 支持序列化以下 Python 对象和函数。 + 内置 Python 对象，包括字典、列表、浮点数、整数、字符串、布尔值和元组 + Numpy 数组 + Pandas Dataframes + Scikit-learn 数据集和估算器 + PyTorch 模型 + TensorFlow 模型 + 的助推器等级 XGBoost 可使用以下各项，但有一些限制。 + Dask DataFrames + XGBoost Dmatrix 类 + TensorFlow 数据集和子类 + PyTorch 模型以下部分包含使用先前 Python 类的最佳实践，但远程函数存在一些限制，以及有关 SageMaker AI 将序列化数据存储在何处以及如何管理其访问权限的信息。 #### 有关能够有限地支持远程数据序列化的 Python 类的最佳实践您可以使用此部分中列出的 Python 类，但有一些限制。后续部分将讨论有关如何使用以下 Python 类的最佳实践。 + [Dask](https://www.dask.org/) DataFrames + 这 XGBoost DMatric 堂课 + TensorFlow 数据集和子类 + PyTorch 模型 ##### 适用于 Dask 的最佳实践 [Dask](https://www.dask.org/) 是一个用于 Python 中的并行计算的开源库。此部分说明了以下内容。 + 如何 DataFrame 将 Dask 传递给你的远程函数 + 如何将汇总统计数据从 Dask DataFrame 转换为 Pandas DataFrame ##### 如何 DataFrame 将 Dask 传递给你的远程函数 [Dask DataFrames](https://docs.dask.org/en/latest/dataframe.html) 通常用于处理大型数据集，因为它们可以容纳需要比可用内存更多的数据集。这是因为 Dask DataFrame 不会将您的本地数据加载到内存中。如果您将 Dask DataFrame 作为函数参数传递给远程函数，Dask 可能会传递对本地磁盘或云存储中数据的引用，而不是数据本身。以下代码显示了在远程函数中传递一个 Dask DataFrame 的示例，该函数将在空 DataFrame函数上运行。 ``` #Do not pass a Dask DataFrame to your remote function as follows def clean(df: dask.DataFrame ): cleaned = df[] \ ... ``` 只有当你使用时，Dask 才会将 Dask 中的数据加载 DataFrame 到内存中。 DataFrame 如果要在远程函数中使用 Dask DataFrame ，请提供数据的路径。之后，Dask 将直接从您在代码运行时指定的数据路径中读取数据集。以下代码示例显示了如何在远程函数`clean`中使用 Dask DataFrame。在代码示例中，`raw_data_path`传递给 clean 而不是 Dask DataFrame。在代码运行时，直接从 `raw_data_path` 中指定的 Amazon S3 存储桶的位置读取数据集。然后，该`persist`函数将数据集保存在内存中以方便后续`random_split`函数，并使用 Dask DataFrame API 函数将数据集写回 S3 存储桶中的输出数据路径。 ``` import dask.dataframe as dd @remote( instance_type='ml.m5.24xlarge', volume_size=300, keep_alive_period_in_seconds=600) #pass the data path to your remote function rather than the Dask DataFrame itself def clean(raw_data_path: str, output_data_path: str: split_ratio: list[float]): df = dd.read_parquet(raw_data_path) #pass the path to your DataFrame cleaned = df[(df.column_a >= 1) & (df.column_a < 5)]\ .drop(['column_b', 'column_c'], axis=1)\ .persist() #keep the data in memory to facilitate the following random_split operation train_df, test_df = cleaned.random_split(split_ratio, random_state=10) train_df.to_parquet(os.path.join(output_data_path, 'train') test_df.to_parquet(os.path.join(output_data_path, 'test')) clean("s3://amzn-s3-demo-bucket/raw/", "s3://amzn-s3-demo-bucket/cleaned/", split_ratio=[0.7, 0.3]) ``` ##### 如何将汇总统计数据从 Dask DataFrame 转换为 Pandas DataFrame DataFrame 通过调用以下示例代码所示`compute`的方法， DataFrame 可以将来自 Dask 的汇总统计数据转换为 Pandas。在示例中，S3 存储桶包含一个无法放入内存或 Pandas 数据框的大型 Dask DataFrame 。在以下示例中，远程函数扫描数据集，并将 DataFrame包含输出统计信息的 Dask 返回`describe`到 Pandas DataFrame。 ``` executor = RemoteExecutor( instance_type='ml.m5.24xlarge', volume_size=300, keep_alive_period_in_seconds=600) future = executor.submit(lambda: dd.read_parquet("s3://amzn-s3-demo-bucket/raw/").describe().compute()) future.result() ``` ##### XGBoost DMatric 课堂最佳实践 DMatrix 是用于加载数据的内部数据结构。 XGBoost 不能为了在计算会话之间轻松移动而对 DMatrix 对象进行封存。直接传递 DMatrix 实例将失败，并显示为`SerializationError`。 ##### 如何将数据对象传递给远程函数并使用它进行训练 XGBoost 要将 Pandas DataFrame 转换为 DMatrix 实例并在远程函数中使用它进行训练，请将其直接传递给远程函数，如以下代码示例所示。 ``` import xgboost as xgb @remote def train(df, params): #Convert a pandas dataframe into a DMatrix DataFrame and use it for training dtrain = DMatrix(df) return xgb.train(dtrain, params) ``` ##### TensorFlow 数据集和子类的最佳实践 TensorFlow 数据集和子类是训练期间 TensorFlow 用来加载数据的内部对象。 TensorFlow 不能为了在计算会话之间轻松移动而对数据集和子类进行封存。直接传递 Tensorflow 数据集或子类将失败，并显示 `SerializationError`。使用 Tensorflow I/O APIs 从存储中加载数据，如以下代码示例所示。 ``` import tensorflow as tf import tensorflow_io as tfio @remote def train(data_path: str, params): dataset = tf.data.TextLineDataset(tf.data.Dataset.list_files(f"{data_path}/*.txt")) ... train("s3://amzn-s3-demo-bucket/data", {}) ``` ##### PyTorch 模型的最佳实践 PyTorch 模型是可序列化的，可以在本地环境和远程函数之间传递。如果您的本地环境和远程环境具有不同的设备类型，例如（GPUs 和 CPUs），则无法将经过训练的模型返回到本地环境。例如，如果以下代码是在本地环境中开发的， GPUs 但未使用但在实例中运行 GPUs，则直接返回经过训练的模型将导致`DeserializationError`。 ``` # Do not return a model trained on GPUs to a CPU-only environment as follows @remote(instance_type='ml.g4dn.xlarge') def train(...): if torch.cuda.is_available(): device = torch.device("cuda") else: device = torch.device("cpu") # a device without GPU capabilities model = Net().to(device) # train the model ... return model model = train(...) #returns a DeserializationError if run on a device with GPU ``` 要将在 GPU 环境中训练的模型返回到仅包含 CPU 功能的模型，请 I/O APIs直接使用该 PyTorch 模型，如下面的代码示例所示。 ``` import s3fs model_path = "s3://amzn-s3-demo-bucket/folder/" @remote(instance_type='ml.g4dn.xlarge') def train(...): if torch.cuda.is_available(): device = torch.device("cuda") else: device = torch.device("cpu") model = Net().to(device) # train the model ... fs = s3fs.FileSystem() with fs.open(os.path.join(model_path, 'model.pt'), 'wb') as file: torch.save(model.state_dict(), file) #this writes the model in a device-agnostic way (CPU vs GPU) train(...) #use the model to train on either CPUs or GPUs model = Net() fs = s3fs.FileSystem()with fs.open(os.path.join(model_path, 'model.pt'), 'rb') as file: model.load_state_dict(torch.load(file, map_location=torch.device('cpu'))) ``` #### SageMaker AI 存储序列化数据的位置当您调用远程函数时， SageMaker AI 会在输入和输出阶段自动序列化您的函数参数和返回值。此序列化数据存储在 S3 存储桶的根目录下。您可以在配置文件中指定根目录 ``。这将自动为您生成参数 `job_name`。在根目录下， SageMaker AI 会创建一个``文件夹，其中包含您当前的工作目录、序列化函数、序列化函数的参数、结果以及调用序列化函数时出现的任何异常。在 `` 下，目录 `workdir` 包含当前工作目录的压缩存档。压缩存档包括工作目录中的所有 Python 文件和 `requirements.txt` 文件，该文件指定运行 Remote 函数所需的任何依赖项。以下是您在配置文件中指定的 S3 存储桶下的文件夹结构示例。 ``` / # specified by s3_root_uri or S3RootUri / #automatically generated for you workdir/workspace.zip # archive of the current working directory (workdir) function/ # serialized function arguments/ # serialized function arguments results/ # returned output from the serialized function including the model exception/ # any exceptions from invoking the serialized function ``` 您在 S3 存储桶中指定的根目录不适用于长期存储。序列化数据与序列化期间使用的 Python 版本和机器学习 (ML) 框架版本紧密相关。如果升级 Python 版本或机器学习框架，则可能无法使用序列化数据。可以改为执行以下操作。 + 以与 Python 版本和机器学习框架无关的格式存储模型和模型构件。 + 如果您升级 Python 或机器学习框架，请访问长期存储中的模型结果。 **重要** 要在指定时长后删除序列化数据，请在 S3 存储桶上设置[生命周期配置](https://docs.aws.amazon.com/AmazonS3/latest/userguide/how-to-set-lifecycle-configuration-intro.html)。 **注意** 使用 Python [pickle](https://docs.python.org/3/library/pickle.html) 模块序列化的文件的可移植性可能低于其他数据格式（包括 CSV、Parquet 和 JSON）的可移植性。请小心加载来自未知来源的经过 pickle 处理的文件。有关 Remote 函数的配置文件中应包含的内容的更多信息，请参阅[配置文件](https://docs.aws.amazon.com/sagemaker/latest/dg/train-remote-decorator-config.html)。 #### 对序列化数据的访问权限管理员可以为序列化数据提供设置，包括其位置和配置文件中的任何加密设置。默认情况下，序列化数据使用 AWS Key Management Service (AWS KMS) 密钥加密。管理员也可以使用[存储桶策略](https://docs.aws.amazon.com/AmazonS3/latest/userguide/example-bucket-policies.html)限制对配置文件中指定的根目录的访问权限。可以跨项目和作业共享并使用配置文件。有关更多信息，请参阅[配置文件](https://docs.aws.amazon.com/sagemaker/latest/dg/train-remote-decorator-config.html)。 ## 使用 `RemoteExecutor` API 调用函数您可以使用 `RemoteExecutor` API 来调用函数。 SageMaker AI Python SDK 会将`RemoteExecutor`调用中的代码转换为 SageMaker 人工智能训练作业。之后，训练作业将以异步操作的形式调用该函数并返回 future。如果您使用 `RemoteExecutor` API，则可以并行运行多个训练作业。有关 Python 中的 future 的更多信息，请参阅 [Futures](https://docs.python.org/3/library/asyncio-future.html)。以下代码示例演示如何导入所需的库、定义函数、启动 SageMaker AI 实例以及如何使用 API 提交并行运行`2`作业的请求。 ``` from sagemaker.remote_function import RemoteExecutor def matrix_multiply(a, b): return np.matmul(a, b) a = np.array([[1, 0], [0, 1]]) b = np.array([1, 2]) with RemoteExecutor(max_parallel_job=2, instance_type="ml.m5.large") as e: future = e.submit(matrix_multiply, a, b) assert (future.result() == np.array([1,2])).all() ``` `RemoteExecutor` 类是 [concurrent.futures.Executor](https://docs.python.org/3/library/concurrent.futures.html) 库的实现。以下代码示例说明如何定义一个函数并使用 `RemoteExecutorAPI` 调用该函数。在此示例中，`RemoteExecutor` 将提交所有 `4` 作业，但仅并行提交 `2`。最后两个作业将重用集群，且开销最小。 ``` from sagemaker.remote_function.client import RemoteExecutor def divide(a, b): return a/b with RemoteExecutor(max_parallel_job=2, keep_alive_period_in_seconds=60) as e: futures = [e.submit(divide, a, 2) for a in [3, 5, 7, 9]] for future in futures: print(future.result()) ``` `max_parallel_job` 参数仅用作速率限制机制，而不会优化计算资源分配。在上一个代码示例中，在提交任何作业之前，`RemoteExecutor` 不会为两个并行作业预留计算资源。有关 @remote 装饰器的 `max_parallel_job` 或其他参数的更多信息，请参阅 [Remote 函数类和方法规范](https://sagemaker.readthedocs.io/en/stable/remote_function/sagemaker.remote_function.html)。 ### `RemoteExecutor` API 的 Future 类 Future 类是一个公共类，它表示异步调用训练作业时的返回函数。Future 类实现了 [concurrent.futures.Future](https://docs.python.org/3/library/concurrent.futures.html) 类。此类可用于对底层作业进行操作并将数据加载到内存中。 # 配置文件 Amaz SageMaker on Python 软件开发工具包支持为 AWS 基础设施原始类型设置默认值。管理员配置这些默认值后，将在支持 SageMaker Python SDK 调用时自动传递这些默认值 APIs。可将装饰器函数的参数放入配置文件中。这样一来，您便能将与基础设施相关的设置与代码库分离开来。有关 Remote 函数和方法的参数的更多信息，请参阅 [Remote 函数类和方法规范](https://sagemaker.readthedocs.io/en/stable/remote_function/sagemaker.remote_function.html)。您可以为网络配置、IAM 角色、用于输入的 Amazon S3 文件夹、输出数据和配置文件中的标签设定基础设施设置。使用 @remote 装饰器或 `RemoteExecutor` API 调用函数时，可以使用配置文件。下面是一个示例配置文件，该文件定义了依赖项、资源和其他参数。此示例配置文件用于调用使用 @remote 装饰器或 RemoteExecutor API 启动的函数。 ``` SchemaVersion: '1.0' SageMaker: PythonSDK: Modules: RemoteFunction: Dependencies: 'path/to/requirements.txt' EnableInterContainerTrafficEncryption: true EnvironmentVariables: {'EnvVarKey': 'EnvVarValue'} ImageUri: '366666666666.dkr.ecr.us-west-2.amazonaws.com/my-image:latest' IncludeLocalWorkDir: true CustomFileFilter: IgnoreNamePatterns: - "*.ipynb" - "data" InstanceType: 'ml.m5.large' JobCondaEnvironment: 'your_conda_env' PreExecutionCommands: - 'command_1' - 'command_2' PreExecutionScript: 'path/to/script.sh' RoleArn: 'arn:aws:iam::366666666666:role/MyRole' S3KmsKeyId: 'yourkmskeyid' S3RootUri: 's3://amzn-s3-demo-bucket/my-project' VpcConfig: SecurityGroupIds: - 'sg123' Subnets: - 'subnet-1234' Tags: [{'Key': 'yourTagKey', 'Value':'yourTagValue'}] VolumeKmsKeyId: 'yourkmskeyid' ``` @remote 装饰器和 `RemoteExecutor` 将在以下配置文件中查找 `Dependencies`： + 管理员定义的配置文件。 + 用户定义的配置文件。这些配置文件的默认位置取决于您的环境并与之相关。以下代码示例返回管理员和用户配置文件的默认位置。这些命令必须在使用 SageMaker Python SDK 的相同环境中运行。 ``` import os from platformdirs import site_config_dir, user_config_dir #Prints the location of the admin config file print(os.path.join(site_config_dir("sagemaker"), "config.yaml")) #Prints the location of the user config file print(os.path.join(user_config_dir("sagemaker"), "config.yaml")) ``` 您可以通过分别为管理员定义的配置文件路径和用户定义的配置文件路径设置 `SAGEMAKER_ADMIN_CONFIG_OVERRIDE` 和 `SAGEMAKER_USER_CONFIG_OVERRIDE` 环境变量来覆盖这些文件的默认位置。如果管理员定义的配置文件和用户定义的配置文件包含密钥，则将使用用户定义的文件中的值。 # 自定义运行时系统环境您可以自定义运行时环境，使用首选的本地集成开发环境 (IDEs)、 SageMaker 笔记本或 SageMaker Studio Classic 笔记本来编写 ML 代码。 SageMaker AI 将帮助将你的函数及其依赖项打包并提交为 SageMaker 训练作业。这允许您访问 SageMaker 训练服务器的容量来运行您的训练作业。用于调用函数的远程装饰器和 `RemoteExecutor` 方法都允许用户定义和自定义其运行时系统环境。您可以使用 `requirements.txt` 文件或 conda 环境 YAML 文件。要同时使用 conda 环境 YAML 文件和 `requirements.txt` 文件自定义运行时系统环境，请参阅以下代码示例。 ``` # specify a conda environment inside a yaml file @remote(instance_type="ml.m5.large", image_uri = "my_base_python:latest", dependencies = "./environment.yml") def matrix_multiply(a, b): return np.matmul(a, b) # use a requirements.txt file to import dependencies @remote(instance_type="ml.m5.large", image_uri = "my_base_python:latest", dependencies = './requirements.txt') def matrix_multiply(a, b): return np.matmul(a, b) ``` 或者，您可以`dependencies`将`auto_capture`设置为，让 SageMaker Python SDK 捕获活动的 conda 环境中已安装的依赖项。需满足以下条件才能使 `auto_capture` 可靠地工作： + 您必须拥有一个活动的 conda 环境。我们建议不要将 `base` conda 环境用于远程作业，以便能减少潜在的依赖项冲突。在不使用 `base` conda 环境的情况下，还可以在远程作业中更快地设置环境。 + 您不得将 pip 与参数 `--extra-index-url` 的值结合使用来安装任何依赖项。 + 在本地开发环境中，使用 conda 安装的包和使用 pip 安装的包之间不得存在任何依赖项冲突。 + 您的本地开发环境不得包含与 Linux 不兼容的特定于操作系统的依赖项。如果 `auto_capture` 不起作用，建议您将依赖项作为 requirement.txt 或 conda environment.yaml 文件传入，如此部分中的第一个编码示例所述。 # 容器映像兼容性下表显示了与 @remote 装饰器兼容的 SageMaker 训练图像列表。 | Name | Python 版本 | 映像 URI - CPU | 映像 URI - GPU | | --- | --- | --- | --- | | Data Science | 3.7(py37) | 仅适用于 SageMaker Studio 经典笔记本电脑。Python SDK 在用作 SageMaker Studio Classic 笔记本内核镜像时会自动选择图片 URI。 | 仅适用于 SageMaker Studio 经典笔记本电脑。Python SDK 在用作 SageMaker Studio Classic 笔记本内核镜像时会自动选择图片 URI。 | | Data Science 2.0 | 3.8(py38) | 仅适用于 SageMaker Studio 经典笔记本电脑。Python SDK 在用作 SageMaker Studio Classic 笔记本内核镜像时会自动选择图片 URI。 | 仅适用于 SageMaker Studio 经典笔记本电脑。Python SDK 在用作 SageMaker Studio Classic 笔记本内核镜像时会自动选择图片 URI。 | | Data Science 3.0 | 3.10(py310) | 仅适用于 SageMaker Studio 经典笔记本电脑。Python SDK 在用作 SageMaker Studio Classic 笔记本内核镜像时会自动选择图片 URI。 | 仅适用于 SageMaker Studio 经典笔记本电脑。Python SDK 在用作 SageMaker Studio Classic 笔记本内核镜像时会自动选择图片 URI。 | | Base Python 2.0 | 3.8(py38) | 当 Python SDK 检测到开发环境正在使用 Python 3.8 运行时系统时，它会选择此映像。否则 Python SDK 在用作 SageMaker Studio Classic 笔记本内核镜像时会自动选择 | 仅适用于 SageMaker Studio 经典笔记本电脑。Python SDK 在用作 SageMaker Studio Classic 笔记本内核镜像时会自动选择图片 URI。 | | Base Python 3.0 | 3.10(py310) | 当 Python SDK 检测到开发环境正在使用 Python 3.8 运行时系统时，它会选择此映像。否则 Python SDK 在用作 SageMaker Studio Classic 笔记本内核镜像时会自动选择 | 仅适用于 SageMaker Studio 经典笔记本电脑。在用作 Studio Classic 笔记本内核映像时，Python SDK 会自动选择映像 URI。 | | DLC-TensorFlow 2.12.0 用于训练 SageMaker | 3.10(py310) | 763104351884.dkr.ecr.<区域>.amazonaws.com/tensorflow-training:2.12.0-cpu-py310-ubuntu20.04-sagemaker | 763104351884.dkr.ecr.<区域>.amazonaws.com/tensorflow-training:2.12.0-gpu-py310-cu118-ubuntu20.04-sagemaker | | 用于训练的 DLC-TensorFlow 2.11.0 SageMaker | 3.9(py39) | 763104351884.dkr.ecr.<区域>.amazonaws.com/tensorflow-training:2.11.0-cpu-py39-ubuntu20.04-sagemaker | 763104351884.dkr.ecr.<区域>.amazonaws.com/tensorflow-training:2.11.0-gpu-py39-cu112-ubuntu20.04-sagemaker | | DLC-TensorFlow 2.10.1 用于训练 SageMaker | 3.9(py39) | 763104351884.dkr.ecr.<区域>.amazonaws.com/tensorflow-training:2.10.1-cpu-py39-ubuntu20.04-sagemaker | 763104351884.dkr.ecr.<区域>.amazonaws.com/tensorflow-training:2.10.1-gpu-py39-cu112-ubuntu20.04-sagemaker | | DLC-TensorFlow 2.9.2 用于训练 SageMaker | 3.9(py39) | 763104351884.dkr.ecr.<区域>.amazonaws.com/tensorflow-training:2.9.2-cpu-py39-ubuntu20.04-sagemaker | 763104351884.dkr.ecr.<区域>.amazonaws.com/tensorflow-training:2.9.2-gpu-py39-cu112-ubuntu20.04-sagemaker | | DLC-TensorFlow 2.8.3 用于训练 SageMaker | 3.9(py39) | 763104351884.dkr.ecr.<区域>.amazonaws.com/tensorflow-training:2.8.3-cpu-py39-ubuntu20.04-sagemaker | 763104351884.dkr.ecr.<区域>.amazonaws.com/tensorflow-training:2.8.3-gpu-py39-cu112-ubuntu20.04-sagemaker | | DLC-PyTorch 2.0.0 用于训练 SageMaker | 3.10(py310) | 763104351884.dkr.ecr.<区域>.amazonaws.com/pytorch-training:2.0.0-cpu-py310-ubuntu20.04-sagemaker | 763104351884.dkr.ecr.<区域>.amazonaws.com/pytorch-training:2.0.0-gpu-py310-cu118-ubuntu20.04-sagemaker | | DLC-PyTorch 1.13.1 用于训练 SageMaker | 3.9(py39) | 763104351884.dkr.ecr.<区域>.amazonaws.com/pytorch-training:1.13.1-cpu-py39-ubuntu20.04-sagemaker | 763104351884.dkr.ecr.<区域>.amazonaws.com/pytorch-training:1.13.1-gpu-py39-cu117-ubuntu20.04-sagemaker | | DLC-PyTorch 1.12.1 用于训练 SageMaker | 3.8(py38) | 763104351884.dkr.ecr.<区域>.amazonaws.com/pytorch-training:1.12.1-cpu-py38-ubuntu20.04-sagemaker | 763104351884.dkr.ecr.<区域>.amazonaws.com/pytorch-training:1.12.1-gpu-py38-cu113-ubuntu20.04-sagemaker | | DLC-PyTorch 1.11.0 用于训练 SageMaker | 3.8(py38) | 763104351884.dkr.ecr.<区域>.amazonaws.com/pytorch-training:1.11.0-cpu-py38-ubuntu20.04-sagemaker | 763104351884.dkr.ecr.<区域>.amazonaws.com/pytorch-training:1.11.0-gpu-py38-cu113-ubuntu20.04-sagemaker | | DLC-MXNet 1.9.0 用于训练 SageMaker | 3.8(py38) | 763104351884.dkr.ecr.<区域>.amazonaws.com/mxnet-training:1.9.0-cpu-py38-ubuntu20.04-sagemaker | 763104351884.dkr.ecr.<区域>.amazonaws.com/mxnet-training:1.9.0-gpu-py38-cu112-ubuntu20.04-sagemaker | **注意** 要使用 Dee AWS p Learning Containers (DLC) 图像在本地运行作业，请使用 [DLC](https://github.com/aws/deep-learning-containers/blob/master/available_images.md) 文档 URIs 中的图像。DLC 映像不支持依赖项的 `auto_capture` 值。 [在 SageMaker Studio 中使用 SageMaker AI 分布](https://github.com/aws/sagemaker-distribution#amazon-sagemaker-studio)的作业以名`sagemaker-user`为的非 root 用户身份在容器中运行。此用户需要完全权限才能访问 `/opt/ml` 和 `/tmp`。通过将 `sudo chmod -R 777 /opt/ml /tmp` 添加到 `pre_execution_commands` 列表来授予此权限，如以下代码片段所示： ``` @remote(pre_execution_commands=["sudo chmod -R 777 /opt/ml /tmp"]) def func(): pass ``` 您还可以使用自定义映像运行 Remote 函数。为了与 Remote 函数兼容，应使用 Python 版本 3.7.x-3.10.x 构建自定义映像。以下是一个最小 Dockerfile 示例，说明了如何将 Docker 映像用于 Python 3.10。 ``` FROM python:3.10 #... Rest of the Dockerfile ``` 要在映像中创建 `conda` 环境并使用它来运行作业，请将环境变量设置 `SAGEMAKER_JOB_CONDA_ENV` 设置为 `conda` 环境名称。如果您的映像设置了 `SAGEMAKER_JOB_CONDA_ENV` 值，则 Remote 函数无法在训练作业运行期间创建新的 conda 环境。请参阅以下 Dockerfile 示例，该示例将 `conda` 环境用于 Python 版本 3.10。 ``` FROM continuumio/miniconda3:4.12.0 ENV SHELL=/bin/bash \ CONDA_DIR=/opt/conda \ SAGEMAKER_JOB_CONDA_ENV=sagemaker-job-env RUN conda create -n $SAGEMAKER_JOB_CONDA_ENV \ && conda install -n $SAGEMAKER_JOB_CONDA_ENV python=3.10 -y \ && conda clean --all -f -y \ ``` 要让 SageMaker AI 使用 [mamba](https://mamba.readthedocs.io/en/latest/user_guide/mamba.html) 在容器镜像中管理 Python 虚拟环境，请安装 miniforge 的 [mamba 工具包](https://github.com/conda-forge/miniforge)。要使用 mamba，请将以下代码示例添加到 Dockerfile 中。然后， SageMaker AI 将在运行时检测`mamba`可用性并使用它来代替`conda`。 ``` #Mamba Installation RUN curl -L -O "https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-Linux-x86_64.sh" \ && bash Mambaforge-Linux-x86_64.sh -b -p "/opt/conda" \ && /opt/conda/bin/conda init bash ``` 在使用 Remote 函数时，在 Amazon S3 存储桶上使用自定义 conda 通道会与 mamba 不兼容。如果您选择使用 mamba，请确保您未在 Amazon S3 上使用自定义 conda 通道。有关更多信息，请参阅**使用 Amazon S3 的自定义 conda 存储库**下的**先决条件**部分。以下是一个完整的 Dockerfile 示例，该实例说明了如何创建兼容的 Docker 映像。 ``` FROM python:3.10 RUN apt-get update -y \ # Needed for awscli to work # See: https://github.com/aws/aws-cli/issues/1957#issuecomment-687455928 && apt-get install -y groff unzip curl \ && pip install --upgrade \ 'boto3>1.0<2' \ 'awscli>1.0<2' \ 'ipykernel>6.0.0<7.0.0' \ #Use ipykernel with --sys-prefix flag, so that the absolute path to #/usr/local/share/jupyter/kernels/python3/kernel.json python is used # in kernelspec.json file && python -m ipykernel install --sys-prefix #Install Mamba RUN curl -L -O "https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-Linux-x86_64.sh" \ && bash Mambaforge-Linux-x86_64.sh -b -p "/opt/conda" \ && /opt/conda/bin/conda init bash #cleanup RUN apt-get clean \ && rm -rf /var/lib/apt/lists/* \ && rm -rf ${HOME}/.cache/pip \ && rm Mambaforge-Linux-x86_64.sh ENV SHELL=/bin/bash \ PATH=$PATH:/opt/conda/bin ``` 运行前面的 Dockerfile 示例生成的镜像也可以用[作 SageMaker Studio Classic 内核](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-byoi.html)镜像。 # 使用 Amazon SageMaker 实验记录参数和指标本指南介绍如何使用 Amazon SageMaker 实验记录参数和指标。A SageMaker I 实验由运行组成，每次运行都包含单个模型训练交互的所有输入、参数、配置和结果。您可以使用 @remote 装饰器或 `RemoteExecutor` API 记录来自 Remote 函数的参数和指标。要记录 Remote 函数中的参数和指标，请选择下列方法之一： + 使用实验库中的实例化在远程函数`Run`中运行的 SageMaker AI SageMaker 实验。有关更多信息，请参阅[创建 Amazon A SageMaker I 实验](https://docs.aws.amazon.com/sagemaker/latest/dg/experiments-create.html)。 + 在 `load_run` SageMaker AI 实验库中的远程函数中使用该函数。这将加载在 Remote 函数外部声明的 `Run` 实例。以下各节介绍如何使用前面列出的方法通过 A SageMaker I 实验运行创建和跟踪血统。这些章节还描述了 SageMaker 培训不支持的案例。 ## 使用 @remote 装饰器与 SageMaker 实验集成您可以在 SageMaker AI 中实例化实验，也可以从远程函数内部加载当前 A SageMaker I 实验。以下部分说明如何使用任一方法。 ### 使用实验创建实 SageMaker 验您可以创建在 SageMaker AI 实验中运行的实验。为此，您需要将实验名称、运行名称和其他参数传递入 Remote 函数中。以下代码示例导入实验名称、运行名称以及每次运行期间要记录的参数。在训练循环中，会随着时间的推移记录参数 `param_1` 和 `param_2`。常用参数可能包括批处理大小或纪元。在此示例中，在训练循环中，会随着时间的推移记录运行的指标 `metric_a` 和 `metric_b`。其他常见指标可能包括 `accuracy` 或 `loss`。 ``` from sagemaker.remote_function import remote from sagemaker.experiments.run import Run # Define your remote function @remote def train(value_1, value_2, exp_name, run_name): ... ... #Creates the experiment with Run( experiment_name=exp_name, run_name=run_name, ) as run: ... #Define values for the parameters to log run.log_parameter("param_1", value_1) run.log_parameter("param_2", value_2) ... #Define metrics to log run.log_metric("metric_a", 0.5) run.log_metric("metric_b", 0.1) # Invoke your remote function train(1.0, 2.0, "my-exp-name", "my-run-name") ``` ### 使用 @remote 装饰器启动的作业加载当前 SageMaker 实验使用 SageMaker 实验库中的`load_run()`函数从运行上下文中加载当前运行对象。您也可以在 Remote 函数中使用 `load_run()` 函数。加载由运行对象上的 `with` 语句本地初始化的运行对象，如以下代码示例所示。 ``` from sagemaker.experiments.run import Run, load_run # Define your remote function @remote def train(value_1, value_2): ... ... with load_run() as run: run.log_metric("metric_a", value_1) run.log_metric("metric_b", value_2) # Invoke your remote function with Run( experiment_name="my-exp-name", run_name="my-run-name", ) as run: train(0.5, 1.0) ``` ## 加载在使用 `RemoteExecutor` API 启动的作业中的当前实验运行如果您的作业是使用 AP SageMaker I 启动的，您也可以加载当前运行的 A `RemoteExecutor` I 实验。以下代码示例展示了如何将 `RemoteExecutor` API 与 SageMaker 实验`load_run`函数一起使用。这样做是为了加载当前运行的 SageMaker AI 实验并在提交的作业中捕获指标`RemoteExecutor`。 ``` from sagemaker.experiments.run import Run, load_run def square(x): with load_run() as run: result = x * x run.log_metric("result", result) return result with RemoteExecutor( max_parallel_job=2, instance_type="ml.m5.large" ) as e: with Run( experiment_name="my-exp-name", run_name="my-run-name", ): future_1 = e.submit(square, 2) ``` ## 使用 @remote 装饰器为代码添加注释时，不支持 SageMaker 实验用途 SageMaker AI 不支持将`Run`类型对象传递给 @remote 函数或使用全局`Run`对象。以下示例显示了将引发 `SerializationError` 的代码。以下代码示例尝试将 `Run` 类型对象传递给 @remote 装饰器，但会生成错误。 ``` @remote def func(run: Run): run.log_metrics("metric_a", 1.0) with Run(...) as run: func(run) ---> SerializationError caused by NotImplementedError ``` 以下代码示例尝试使用在 Remote 函数外部实例化的全局 `run` 对象。在代码示例中，`train()` 函数是在 `with Run` 上下文中定义的，中引用了全局运行对象。在调用 `train()` 时，它会生成一个错误。 ``` with Run(...) as run: @remote def train(metric_1, value_1, metric_2, value_2): run.log_parameter(metric_1, value_1) run.log_parameter(metric_2, value_2) train("p1", 1.0, "p2", 0.5) ---> SerializationError caused by NotImplementedError ``` # 将模块化代码用于 @remote 装饰器可以将您的代码整理为模块，以便在开发过程中轻松管理工作区，并且仍可以使用 @remote 函数来调用函数。您也可以将本地模块从开发环境复制到远程作业环境。为此，请将 `include_local_workdir` 参数设置为 `True`，如以下示例所示。 ``` @remote( include_local_workdir=True, ) ``` **注意** @remote 装饰器和参数必须出现在主文件中，而不是出现在任何依赖项文件中。设置`include_local_workdir`为时`True`， SageMaker AI 会打包所有 Python 脚本，同时保持进程当前目录中的目录结构。它还使依赖项在作业的工作目录中可用。例如，假设处理 MNIST 数据集的 Python 脚本分为一个 `main.py` 脚本和一个从属 `pytorch_mnist.py` 脚本。`main.py` 调用从属脚本。此外，`main.py` 脚本还包含导入从属关系的代码，如下所示。 ``` from mnist_impl.pytorch_mnist import ... ``` `main.py` 文件还必须包含 `@remote` 装饰器，并且必须将 `include_local_workdir` 参数设置为 `True`。默认情况下，`include_local_workdir` 参数包括目录中的所有 Python 脚本。您可以将此参数与 `custom_file_filter` 参数结合使用，自定义要上传到作业中的文件。您既可以传递一个用于筛选要上传到 S3 的作业从属关系的函数，也可以传递一个指定要在远程函数中忽略的本地目录和文件的 `CustomFileFilter` 对象。只有在 `include_local_workdir` 设置为 `True` 时，才能使用 `custom_file_filter`——否则参数将被忽略。以下示例使用 `CustomFileFilter` 来忽略所有笔记本文件和文件夹，或者在将文件上传到 S3 时忽略名为 `data` 的文件。 ``` @remote( include_local_workdir=True, custom_file_filter=CustomFileFilter( ignore_name_patterns=[ # files or directories to ignore "*.ipynb", # all notebook files "data", # folter or file named data ] ) ) ``` 以下示例演示了如何打包整个工作空间。 ``` @remote( include_local_workdir=True, custom_file_filter=CustomFileFilter( ignore_pattern_names=[] # package whole workspace ) ) ``` 以下示例说明了如何使用函数筛选文件。 ``` import os def my_filter(path: str, files: List[str]) -> List[str]: to_ignore = [] for file in files: if file.endswith(".txt") or file.endswith(".ipynb"): to_ignore.append(file) return to_ignore @remote( include_local_workdir=True, custom_file_filter=my_filter ) ``` ## 构建工作目录的最佳实践以下最佳实践建议您在模块化代码中使用 `@remote` 装饰器时如何组织目录结构。 + 将 @remote 装饰器放入位于工作区的根级别目录下的文件中。 + 在根级别构建本地模块。以下示例图显示了推荐的目录结构。在此示例结构中，`main.py` 脚本位于根级别目录下。 ``` . ├── config.yaml ├── data/ ├── main.py <----------------- @remote used here ├── mnist_impl │ ├── __pycache__/ │ │ └── pytorch_mnist.cpython-310.pyc │ ├── pytorch_mnist.py <-------- dependency of main.py ├── requirements.txt ``` 以下示例图显示了一个目录结构，当使用 @remote 装饰器为代码添加注释时，该结构会导致不一致的行为。在此示例结构中，包含 @remote 装饰器的 `main.py` 脚本**不**在根级别目录下。建议**不要**使用以下结构。 ``` . ├── config.yaml ├── entrypoint │ ├── data │ └── main.py <----------------- @remote used here ├── mnist_impl │ ├── __pycache__ │ │ └── pytorch_mnist.cpython-310.pyc │ └── pytorch_mnist.py <-------- dependency of main.py ├── requirements.txt ``` # 运行时系统依赖项的私有存储库您可以使用执行前命令或脚本在作业环境中配置依赖项管理器，例如 pip 或 conda。要实现网络隔离，请使用这两个选项中的任何一个来重定向依赖项管理器，以访问您的私有存储库并在 VPC 内运行 Remote 函数。执行前命令或脚本将在 Remote 函数运行之前运行。您可以使用 @remote 装饰器、`RemoteExecutor` API 或在配置文件中定义它们。以下各节介绍如何访问由管理的私有 Python Package 索引 (PyPI) 存储库。 AWS CodeArtifact这些部分还说明如何访问托管于 Amazon Simple Storage Service (Amazon S3) 上的自定义 conda 通道。 ## 如何使用使用管理的自定义 PyPI 存储库 AWS CodeArtifact CodeArtifact 要使用管理自定义 PyPI 存储库，需要满足以下先决条件： + 您的私有 PyPI 存储库应已创建。您可以使用 AWS CodeArtifact 来创建和管理您的私有软件包存储库。要了解更多信息 CodeArtifact，请参阅《[CodeArtifact 用户指南》](https://docs.aws.amazon.com/codeartifact/latest/ug/welcome.html)。 + 您的 VPC 应该可以访问您的 CodeArtifact 存储库。要允许从您的 VPC 连接到您的 CodeArtifact 存储库，您必须执行以下操作： + [为创建 VPC 终端节点 CodeArtifact](https://docs.aws.amazon.com/codeartifact/latest/ug/create-vpc-endpoints.html)。 + 为您的 VPC [创建一个 Amazon S3 网关终端节点](https://docs.aws.amazon.com/codeartifact/latest/ug/create-s3-gateway-endpoint.html)，该终端节点 CodeArtifact 允许存储包资产。以下执行前命令示例显示了如何在 SageMaker AI 训练作业中配置 pip 以指向您的 CodeArtifact 存储库。有关更多信息，请参阅[配置和使用 pi CodeArtifact p](https://docs.aws.amazon.com/codeartifact/latest/ug/python-configure-pip.html) ``` # use a requirements.txt file to import dependencies @remote( instance_type="ml.m5.large" image_uri = "my_base_python:latest", dependencies = './requirements.txt', pre_execution_commands=[ "aws codeartifact login --tool pip --domain my-org --domain-owner <000000000000> --repository my-codeartifact-python-repo --endpoint-url https://vpce-xxxxx.api.codeartifact.us-east-1.vpce.amazonaws.com" ] ) def matrix_multiply(a, b): return np.matmul(a, b) ``` ## 如何使用 Amazon S3 上托管的自定义 conda 通道要使用 Amazon S3 来管理自定义 conda 存储库，需要满足以下先决条件： + 必须已在您的 Amazon S3 存储桶中设置您的私有 conda 通道，并且必须为所有依赖包编制索引并将其上传到 Amazon S3 存储桶。有关如何为 conda 包编制索引的说明，请参阅[创建自定义通道](https://conda.io/projects/conda/en/latest/user-guide/tasks/create-custom-channels.html)。 + 您的 VPC 应具有对 Amazon S3 存储桶的访问权限。有关更多信息，请参阅[用于 Amazon S3 的端点](https://docs.aws.amazon.com/vpc/latest/userguide/vpc-endpoints-s3.html)。 + 您的作业映像中的基本 conda 环境应已安装 `boto3`。要检查您的环境，请在 Anaconda 提示符中输入以下内容，以检查 `boto3` 是否显示在生成的列表中。 ``` conda list -n base ``` + 应已使用 conda 而不是 [mamba](https://mamba.readthedocs.io/en/latest/installation.html) 安装您的作业映像。要检查您的环境，请确保上一个代码提示不会返回 `mamba`。以下执行前命令示例显示了如何在 SageMaker 训练作业中将 conda 配置为指向 Amazon S3 上的私人频道。执行前命令会删除默认频道并将自定义通道添加到 `.condarc` conda 配置文件中。 ``` # specify your dependencies inside a conda yaml file @remote( instance_type="ml.m5.large" image_uri = "my_base_python:latest", dependencies = "./environment.yml", pre_execution_commands=[ "conda config --remove channels 'defaults'" "conda config --add channels 's3://my_bucket/my-conda-repository/conda-forge/'", "conda config --add channels 's3://my_bucket/my-conda-repository/main/'" ] ) def matrix_multiply(a, b): return np.matmul(a, b) ``` # 示例笔记本您可以将现有工作空间环境中的训练代码以及任何相关的数据处理代码和数据集转换为 SageMaker 训练作业。以下笔记本向您展示了如何使用 XGBoost 算法和 Hugging Face 针对图像分类问题自定义环境、作业设置等。 [quick\$1start 笔记本](https://github.com/aws/amazon-sagemaker-examples/blob/main/sagemaker-remote-function/quick_start/quick_start.ipynb)包含以下代码示例： + 如何使用配置文件自定义作业设置。 + 如何异步将 Python 函数作为作业进行调用。 + 如何通过引入其他依赖项来自定义作业运行时环境。 + 如何将本地依赖项与 @remote 函数方法结合使用。以下笔记本提供了针对不同的机器学习问题类型和实现的其他代码示例。 + 要查看使用 @remote 装饰器解决图像分类问题的代码示例，请打开 [pytorch\$1mnist.ipynb](https://github.com/aws/amazon-sagemaker-examples/tree/main/sagemaker-remote-function/pytorch_mnist_sample_notebook) 笔记本。此分类问题使用修改后的美国国家标准与技术研究院 (MNIST) 示例数据集来识别手写数字。 + 要查看有关使用 @remote 装饰器解决与脚本相关的上一个图像分类问题的代码示例，请参阅 Pytorch MNIST 示例脚本 [train.py](https://github.com/aws/amazon-sagemaker-examples/tree/main/sagemaker-remote-function/pytorch_mnist_sample_script)。 + 要查看该 XGBoost 算法是如何使用 @remote 装饰器实现的：打开 [xg](https://github.com/aws/amazon-sagemaker-examples/tree/main/sagemaker-remote-function/xgboost_abalone) boost\$1abalone.ipynb 笔记本。 + 要了解 Hugging Face 如何与 @remote 装饰器集成，请打开 [huggingface.ipynb](https://github.com/aws/amazon-sagemaker-examples/tree/main/sagemaker-remote-function/huggingface_text_classification) 笔记本。