本文属于机器翻译版本。若本译文内容与英语原文存在差异，则一律以英文原文为准。 # 使用 SageMaker AI 分布式数据并行库创建自己的 Docker 容器要构建自己的 Docker 容器进行训练并使用 SageMaker AI 数据并行库，您必须在 Dockerfile 中包含正确的依赖项和 SageMaker AI 分布式并行库的二进制文件。本节提供有关如何使用数据 parallel 库创建具有最小依赖关系的完整 Dockerfile，以便在 SageMaker AI 中进行分布式训练。 **注意** 此自定义 Docker 选项将 SageMaker AI 数据并行库作为二进制文件仅适用于。 PyTorch **使用 SageMaker 训练工具包和数据并行库创建 Dockerfile** 1. 使用 [NVIDIA CUDA](https://hub.docker.com/r/nvidia/cuda) 提供的 Docker 映像开始。[使用包含 CUDA 运行时和开发工具（头文件和库）的 cuDNN 开发者版本从源代码进行构建。PyTorch ](https://github.com/pytorch/pytorch#from-source) ``` FROM nvidia/cuda:11.3.1-cudnn8-devel-ubuntu20.04 ``` **提示** 官方的 AWS 深度学习容器 (DLC) 镜像是基于 [NVIDIA CUDA 基础](https://hub.docker.com/r/nvidia/cuda)镜像构建的。如果你想在按照其余说明进行操作的同时使用预构建的 DLC 镜像作为参考，请参阅[适用 PyTorch 于 Dockerfiles 的 Dee AWS p Learning Container](https://github.com/aws/deep-learning-containers/tree/master/pytorch) s。 1. 添加以下参数以指定软件包 PyTorch 和其他软件包的版本。此外，请指明指向 A SageMaker I 数据并行库和其他使用 AWS 资源的软件（例如 Amazon S3 插件）的 Amazon S3 存储桶路径。要使用除以下代码示例中提供的版本之外的第三方库版本，我们建议您查看[AWS 深度学习容器的官方 Dockerfiles](https://github.com/aws/deep-learning-containers/tree/master/pytorch/training/docker)， PyTorch以查找经过测试、兼容且适合您的应用程序的版本。要 URLs 查找`SMDATAPARALLEL_BINARY`参数，请参阅中的查找表[支持的框架](distributed-data-parallel-support.md#distributed-data-parallel-supported-frameworks)。 ``` ARG PYTORCH_VERSION=1.10.2 ARG PYTHON_SHORT_VERSION=3.8 ARG EFA_VERSION=1.14.1 ARG SMDATAPARALLEL_BINARY=https://smdataparallel.s3.amazonaws.com/binary/pytorch/${PYTORCH_VERSION}/cu113/2022-02-18/smdistributed_dataparallel-1.4.0-cp38-cp38-linux_x86_64.whl ARG PT_S3_WHL_GPU=https://aws-s3-plugin.s3.us-west-2.amazonaws.com/binaries/0.0.1/1c3e69e/awsio-0.0.1-cp38-cp38-manylinux1_x86_64.whl ARG CONDA_PREFIX="/opt/conda" ARG BRANCH_OFI=1.1.3-aws ``` 1. 设置以下环境变量以正确构建 SageMaker 训练组件并运行数据 parallel 库。在后续步骤中，您将为组件使用这些变量。 ``` # Set ENV variables required to build PyTorch ENV TORCH_CUDA_ARCH_LIST="7.0+PTX 8.0" ENV TORCH_NVCC_FLAGS="-Xfatbin -compress-all" ENV NCCL_VERSION=2.10.3 # Add OpenMPI to the path. ENV PATH /opt/amazon/openmpi/bin:$PATH # Add Conda to path ENV PATH $CONDA_PREFIX/bin:$PATH # Set this enviroment variable for SageMaker AI to launch SMDDP correctly. ENV SAGEMAKER_TRAINING_MODULE=sagemaker_pytorch_container.training:main # Add enviroment variable for processes to be able to call fork() ENV RDMAV_FORK_SAFE=1 # Indicate the container type ENV DLC_CONTAINER_TYPE=training # Add EFA and SMDDP to LD library path ENV LD_LIBRARY_PATH="/opt/conda/lib/python${PYTHON_SHORT_VERSION}/site-packages/smdistributed/dataparallel/lib:$LD_LIBRARY_PATH" ENV LD_LIBRARY_PATH=/opt/amazon/efa/lib/:$LD_LIBRARY_PATH ``` 1. 在后续步骤中安装或更新 `curl`、`wget` 和 `git`，以下载和构建软件包。 ``` RUN --mount=type=cache,id=apt-final,target=/var/cache/apt \ apt-get update && apt-get install -y --no-install-recommends \ curl \ wget \ git \ && rm -rf /var/lib/apt/lists/* ``` 1. 安装用于 Amazon EC2 网络通信的 [Elastic Fabric Adapter (EFA)](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html) 软件。 ``` RUN DEBIAN_FRONTEND=noninteractive apt-get update RUN mkdir /tmp/efa \ && cd /tmp/efa \ && curl --silent -O https://efa-installer.amazonaws.com/aws-efa-installer-${EFA_VERSION}.tar.gz \ && tar -xf aws-efa-installer-${EFA_VERSION}.tar.gz \ && cd aws-efa-installer \ && ./efa_installer.sh -y --skip-kmod -g \ && rm -rf /tmp/efa ``` 1. 安装 [Conda](https://docs.conda.io/en/latest/) 来处理软件包管理。 ``` RUN curl -fsSL -v -o ~/miniconda.sh -O https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh && \ chmod +x ~/miniconda.sh && \ ~/miniconda.sh -b -p $CONDA_PREFIX && \ rm ~/miniconda.sh && \ $CONDA_PREFIX/bin/conda install -y python=${PYTHON_SHORT_VERSION} conda-build pyyaml numpy ipython && \ $CONDA_PREFIX/bin/conda clean -ya ``` 1. 获取、构建、安装 PyTorch 及其依赖关系。我们[使用源代码进行](https://github.com/pytorch/pytorch#from-source)构建PyTorch ，因为我们需要控制 NCCL 版本以保证与 [AWS OFI](https://github.com/aws/aws-ofi-nccl) NCCL 插件的兼容性。 1. 按照[PyTorch 官方 dockerfil](https://github.com/pytorch/pytorch/blob/master/Dockerfile) e 中的步骤，安装构建依赖项并设置 [ccache](https://ccache.dev/) 以加快重新编译速度。 ``` RUN DEBIAN_FRONTEND=noninteractive \ apt-get install -y --no-install-recommends \ build-essential \ ca-certificates \ ccache \ cmake \ git \ libjpeg-dev \ libpng-dev \ && rm -rf /var/lib/apt/lists/* # Setup ccache RUN /usr/sbin/update-ccache-symlinks RUN mkdir /opt/ccache && ccache --set-config=cache_dir=/opt/ccache ``` 1. 安装程序[PyTorch的常见依赖项和 Linux 依赖项](https://github.com/pytorch/pytorch#install-dependencies)。 ``` # Common dependencies for PyTorch RUN conda install astunparse numpy ninja pyyaml mkl mkl-include setuptools cmake cffi typing_extensions future six requests dataclasses # Linux specific dependency for PyTorch RUN conda install -c pytorch magma-cuda113 ``` 1. 克隆[PyTorch GitHub存储库](https://github.com/pytorch/pytorch)。 ``` RUN --mount=type=cache,target=/opt/ccache \ cd / \ && git clone --recursive https://github.com/pytorch/pytorch -b v${PYTORCH_VERSION} ``` 1. 安装并构建特定的 [NCCL](https://developer.nvidia.com/nccl) 版本。为此，请将默认 NCCL 文件夹 (`/pytorch/third_party/nccl`) 中的内容替换为 NVIDIA 存储库中的特定 NCCL 版本。 PyTorchNCCL 版本在本指南的第 3 步中设置。 ``` RUN cd /pytorch/third_party/nccl \ && rm -rf nccl \ && git clone https://github.com/NVIDIA/nccl.git -b v${NCCL_VERSION}-1 \ && cd nccl \ && make -j64 src.build CUDA_HOME=/usr/local/cuda NVCC_GENCODE="-gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_80,code=sm_80" \ && make pkg.txz.build \ && tar -xvf build/pkg/txz/nccl_*.txz -C $CONDA_PREFIX --strip-components=1 ``` 1. 构建并安装 PyTorch。完成此过程通常需要 1 个小时多一点。它使用上一步中下载的 NCCL 版本构建。 ``` RUN cd /pytorch \ && CMAKE_PREFIX_PATH="$(dirname $(which conda))/../" \ python setup.py install \ && rm -rf /pytorch ``` 1. 构建并安装 [AWS OFI NCCL 插件](https://github.com/aws/aws-ofi-nccl)。这使得 [libfabric](https://github.com/ofiwg/libfabric) 支持 SageMaker 人工智能数据并行库。 ``` RUN DEBIAN_FRONTEND=noninteractive apt-get update \ && apt-get install -y --no-install-recommends \ autoconf \ automake \ libtool RUN mkdir /tmp/efa-ofi-nccl \ && cd /tmp/efa-ofi-nccl \ && git clone https://github.com/aws/aws-ofi-nccl.git -b v${BRANCH_OFI} \ && cd aws-ofi-nccl \ && ./autogen.sh \ && ./configure --with-libfabric=/opt/amazon/efa \ --with-mpi=/opt/amazon/openmpi \ --with-cuda=/usr/local/cuda \ --with-nccl=$CONDA_PREFIX \ && make \ && make install \ && rm -rf /tmp/efa-ofi-nccl ``` 1. 构建并安装[TorchVision](https://github.com/pytorch/vision.git)。 ``` RUN pip install --no-cache-dir -U \ packaging \ mpi4py==3.0.3 RUN cd /tmp \ && git clone https://github.com/pytorch/vision.git -b v0.9.1 \ && cd vision \ && BUILD_VERSION="0.9.1+cu111" python setup.py install \ && cd /tmp \ && rm -rf vision ``` 1. 安装和配置 OpenSSH。MPI 需要使用 OpenSSH 在容器之间进行通信。允许 OpenSSH 与容器通信而无需请求确认。 ``` RUN apt-get update \ && apt-get install -y --allow-downgrades --allow-change-held-packages --no-install-recommends \ && apt-get install -y --no-install-recommends openssh-client openssh-server \ && mkdir -p /var/run/sshd \ && cat /etc/ssh/ssh_config | grep -v StrictHostKeyChecking > /etc/ssh/ssh_config.new \ && echo " StrictHostKeyChecking no" >> /etc/ssh/ssh_config.new \ && mv /etc/ssh/ssh_config.new /etc/ssh/ssh_config \ && rm -rf /var/lib/apt/lists/* # Configure OpenSSH so that nodes can communicate with each other RUN mkdir -p /var/run/sshd && \ sed 's@session\s*required\s*pam_loginuid.so@session optional pam_loginuid.so@g' -i /etc/pam.d/sshd RUN rm -rf /root/.ssh/ && \ mkdir -p /root/.ssh/ && \ ssh-keygen -q -t rsa -N '' -f /root/.ssh/id_rsa && \ cp /root/.ssh/id_rsa.pub /root/.ssh/authorized_keys \ && printf "Host *\n StrictHostKeyChecking no\n" >> /root/.ssh/config ``` 1. 安装 PT S3 插件以高效访问 Amazon S3 中的数据集。 ``` RUN pip install --no-cache-dir -U ${PT_S3_WHL_GPU} RUN mkdir -p /etc/pki/tls/certs && cp /etc/ssl/certs/ca-certificates.crt /etc/pki/tls/certs/ca-bundle.crt ``` 1. 安装 [libboost](https://www.boost.org/) 库。此软件包是将 SageMaker AI 数据并行库的异步 IO 功能联网所必需的。 ``` WORKDIR / RUN wget https://sourceforge.net/projects/boost/files/boost/1.73.0/boost_1_73_0.tar.gz/download -O boost_1_73_0.tar.gz \ && tar -xzf boost_1_73_0.tar.gz \ && cd boost_1_73_0 \ && ./bootstrap.sh \ && ./b2 threading=multi --prefix=${CONDA_PREFIX} -j 64 cxxflags=-fPIC cflags=-fPIC install || true \ && cd .. \ && rm -rf boost_1_73_0.tar.gz \ && rm -rf boost_1_73_0 \ && cd ${CONDA_PREFIX}/include/boost ``` 1. 安装以下 SageMaker AI 工具进行 PyTorch 训练。 ``` WORKDIR /root RUN pip install --no-cache-dir -U \ smclarify \ "sagemaker>=2,<3" \ sagemaker-experiments==0.* \ sagemaker-pytorch-training ``` 1. 最后，安装 SageMaker AI 数据 parallel 二进制文件和其余依赖项。 ``` RUN --mount=type=cache,id=apt-final,target=/var/cache/apt \ apt-get update && apt-get install -y --no-install-recommends \ jq \ libhwloc-dev \ libnuma1 \ libnuma-dev \ libssl1.1 \ libtool \ hwloc \ && rm -rf /var/lib/apt/lists/* RUN SMDATAPARALLEL_PT=1 pip install --no-cache-dir ${SMDATAPARALLEL_BINARY} ``` 1. 创建 Dockerfile 后，请参阅[调整自己的训练容器，了解如何构建 Docker 容器](https://docs.aws.amazon.com/sagemaker/latest/dg/adapt-training-container.html)，将其托管在 Amazon ECR 中，以及如何使用 Python SDK 运行训练作业。 SageMaker 以下示例代码显示了将之前所有代码块组合在一起的完整 Dockerfile。 ``` # This file creates a docker image with minimum dependencies to run SageMaker AI data parallel training FROM nvidia/cuda:11.3.1-cudnn8-devel-ubuntu20.04 # Set appropiate versions and location for components ARG PYTORCH_VERSION=1.10.2 ARG PYTHON_SHORT_VERSION=3.8 ARG EFA_VERSION=1.14.1 ARG SMDATAPARALLEL_BINARY=https://smdataparallel.s3.amazonaws.com/binary/pytorch/${PYTORCH_VERSION}/cu113/2022-02-18/smdistributed_dataparallel-1.4.0-cp38-cp38-linux_x86_64.whl ARG PT_S3_WHL_GPU=https://aws-s3-plugin.s3.us-west-2.amazonaws.com/binaries/0.0.1/1c3e69e/awsio-0.0.1-cp38-cp38-manylinux1_x86_64.whl ARG CONDA_PREFIX="/opt/conda" ARG BRANCH_OFI=1.1.3-aws # Set ENV variables required to build PyTorch ENV TORCH_CUDA_ARCH_LIST="3.7 5.0 7.0+PTX 8.0" ENV TORCH_NVCC_FLAGS="-Xfatbin -compress-all" ENV NCCL_VERSION=2.10.3 # Add OpenMPI to the path. ENV PATH /opt/amazon/openmpi/bin:$PATH # Add Conda to path ENV PATH $CONDA_PREFIX/bin:$PATH # Set this enviroment variable for SageMaker AI to launch SMDDP correctly. ENV SAGEMAKER_TRAINING_MODULE=sagemaker_pytorch_container.training:main # Add enviroment variable for processes to be able to call fork() ENV RDMAV_FORK_SAFE=1 # Indicate the container type ENV DLC_CONTAINER_TYPE=training # Add EFA and SMDDP to LD library path ENV LD_LIBRARY_PATH="/opt/conda/lib/python${PYTHON_SHORT_VERSION}/site-packages/smdistributed/dataparallel/lib:$LD_LIBRARY_PATH" ENV LD_LIBRARY_PATH=/opt/amazon/efa/lib/:$LD_LIBRARY_PATH # Install basic dependencies to download and build other dependencies RUN --mount=type=cache,id=apt-final,target=/var/cache/apt \ apt-get update && apt-get install -y --no-install-recommends \ curl \ wget \ git \ && rm -rf /var/lib/apt/lists/* # Install EFA. # This is required for SMDDP backend communication RUN DEBIAN_FRONTEND=noninteractive apt-get update RUN mkdir /tmp/efa \ && cd /tmp/efa \ && curl --silent -O https://efa-installer.amazonaws.com/aws-efa-installer-${EFA_VERSION}.tar.gz \ && tar -xf aws-efa-installer-${EFA_VERSION}.tar.gz \ && cd aws-efa-installer \ && ./efa_installer.sh -y --skip-kmod -g \ && rm -rf /tmp/efa # Install Conda RUN curl -fsSL -v -o ~/miniconda.sh -O https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh && \ chmod +x ~/miniconda.sh && \ ~/miniconda.sh -b -p $CONDA_PREFIX && \ rm ~/miniconda.sh && \ $CONDA_PREFIX/bin/conda install -y python=${PYTHON_SHORT_VERSION} conda-build pyyaml numpy ipython && \ $CONDA_PREFIX/bin/conda clean -ya # Install PyTorch. # Start with dependencies listed in official PyTorch dockerfile # https://github.com/pytorch/pytorch/blob/master/Dockerfile RUN DEBIAN_FRONTEND=noninteractive \ apt-get install -y --no-install-recommends \ build-essential \ ca-certificates \ ccache \ cmake \ git \ libjpeg-dev \ libpng-dev && \ rm -rf /var/lib/apt/lists/* # Setup ccache RUN /usr/sbin/update-ccache-symlinks RUN mkdir /opt/ccache && ccache --set-config=cache_dir=/opt/ccache # Common dependencies for PyTorch RUN conda install astunparse numpy ninja pyyaml mkl mkl-include setuptools cmake cffi typing_extensions future six requests dataclasses # Linux specific dependency for PyTorch RUN conda install -c pytorch magma-cuda113 # Clone PyTorch RUN --mount=type=cache,target=/opt/ccache \ cd / \ && git clone --recursive https://github.com/pytorch/pytorch -b v${PYTORCH_VERSION} # Note that we need to use the same NCCL version for PyTorch and OFI plugin. # To enforce that, install NCCL from source before building PT and OFI plugin. # Install NCCL. # Required for building OFI plugin (OFI requires NCCL's header files and library) RUN cd /pytorch/third_party/nccl \ && rm -rf nccl \ && git clone https://github.com/NVIDIA/nccl.git -b v${NCCL_VERSION}-1 \ && cd nccl \ && make -j64 src.build CUDA_HOME=/usr/local/cuda NVCC_GENCODE="-gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_80,code=sm_80" \ && make pkg.txz.build \ && tar -xvf build/pkg/txz/nccl_*.txz -C $CONDA_PREFIX --strip-components=1 # Build and install PyTorch. RUN cd /pytorch \ && CMAKE_PREFIX_PATH="$(dirname $(which conda))/../" \ python setup.py install \ && rm -rf /pytorch RUN ccache -C # Build and install OFI plugin. \ # It is required to use libfabric. RUN DEBIAN_FRONTEND=noninteractive apt-get update \ && apt-get install -y --no-install-recommends \ autoconf \ automake \ libtool RUN mkdir /tmp/efa-ofi-nccl \ && cd /tmp/efa-ofi-nccl \ && git clone https://github.com/aws/aws-ofi-nccl.git -b v${BRANCH_OFI} \ && cd aws-ofi-nccl \ && ./autogen.sh \ && ./configure --with-libfabric=/opt/amazon/efa \ --with-mpi=/opt/amazon/openmpi \ --with-cuda=/usr/local/cuda \ --with-nccl=$CONDA_PREFIX \ && make \ && make install \ && rm -rf /tmp/efa-ofi-nccl # Build and install Torchvision RUN pip install --no-cache-dir -U \ packaging \ mpi4py==3.0.3 RUN cd /tmp \ && git clone https://github.com/pytorch/vision.git -b v0.9.1 \ && cd vision \ && BUILD_VERSION="0.9.1+cu111" python setup.py install \ && cd /tmp \ && rm -rf vision # Install OpenSSH. # Required for MPI to communicate between containers, allow OpenSSH to talk to containers without asking for confirmation RUN apt-get update \ && apt-get install -y --allow-downgrades --allow-change-held-packages --no-install-recommends \ && apt-get install -y --no-install-recommends openssh-client openssh-server \ && mkdir -p /var/run/sshd \ && cat /etc/ssh/ssh_config | grep -v StrictHostKeyChecking > /etc/ssh/ssh_config.new \ && echo " StrictHostKeyChecking no" >> /etc/ssh/ssh_config.new \ && mv /etc/ssh/ssh_config.new /etc/ssh/ssh_config \ && rm -rf /var/lib/apt/lists/* # Configure OpenSSH so that nodes can communicate with each other RUN mkdir -p /var/run/sshd && \ sed 's@session\s*required\s*pam_loginuid.so@session optional pam_loginuid.so@g' -i /etc/pam.d/sshd RUN rm -rf /root/.ssh/ && \ mkdir -p /root/.ssh/ && \ ssh-keygen -q -t rsa -N '' -f /root/.ssh/id_rsa && \ cp /root/.ssh/id_rsa.pub /root/.ssh/authorized_keys \ && printf "Host *\n StrictHostKeyChecking no\n" >> /root/.ssh/config # Install PT S3 plugin. # Required to efficiently access datasets in Amazon S3 RUN pip install --no-cache-dir -U ${PT_S3_WHL_GPU} RUN mkdir -p /etc/pki/tls/certs && cp /etc/ssl/certs/ca-certificates.crt /etc/pki/tls/certs/ca-bundle.crt # Install libboost from source. # This package is needed for smdataparallel functionality (for networking asynchronous IO). WORKDIR / RUN wget https://sourceforge.net/projects/boost/files/boost/1.73.0/boost_1_73_0.tar.gz/download -O boost_1_73_0.tar.gz \ && tar -xzf boost_1_73_0.tar.gz \ && cd boost_1_73_0 \ && ./bootstrap.sh \ && ./b2 threading=multi --prefix=${CONDA_PREFIX} -j 64 cxxflags=-fPIC cflags=-fPIC install || true \ && cd .. \ && rm -rf boost_1_73_0.tar.gz \ && rm -rf boost_1_73_0 \ && cd ${CONDA_PREFIX}/include/boost # Install SageMaker AI PyTorch training. WORKDIR /root RUN pip install --no-cache-dir -U \ smclarify \ "sagemaker>=2,<3" \ sagemaker-experiments==0.* \ sagemaker-pytorch-training # Install SageMaker AI data parallel binary (SMDDP) # Start with dependencies RUN --mount=type=cache,id=apt-final,target=/var/cache/apt \ apt-get update && apt-get install -y --no-install-recommends \ jq \ libhwloc-dev \ libnuma1 \ libnuma-dev \ libssl1.1 \ libtool \ hwloc \ && rm -rf /var/lib/apt/lists/* # Install SMDDP RUN SMDATAPARALLEL_PT=1 pip install --no-cache-dir ${SMDATAPARALLEL_BINARY} ``` **提示** 有关创建用于 SageMaker 人工智能训练的自定义 Dockerfile 的更多一般信息，请参阅[使用自己的训练算](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo.html)法。 **提示** 如果要扩展自定义 Dockerfile 以纳入 SageMaker AI 模型并行库，请参阅。[使用 SageMaker 分布式模型并行库创建自己的 Docker 容器](model-parallel-sm-sdk.md#model-parallel-bring-your-own-container)