翻訳は機械翻訳により提供されています。提供された翻訳内容と英語版の間で齟齬、不一致または矛盾がある場合、英語版が優先します。 # ステップ 2: SageMaker Python SDK を使用してトレーニングジョブを起動する SageMaker Python SDK は、TensorFlow および PyTorch などの ML フレームワークによるモデルのマネージドトレーニングをサポートしています。これらのフレームワークのいずれかを使用してトレーニングジョブを開始するには、SageMaker [TensorFlow 推定器](https://sagemaker.readthedocs.io/en/v2.199.0/frameworks/tensorflow/sagemaker.tensorflow.html#tensorflow-estimator)、SageMaker [PyTorch 推定器](https://sagemaker.readthedocs.io/en/v2.199.0/frameworks/pytorch/sagemaker.pytorch.html#pytorch-estimator)、または SageMaker 汎用[推定器](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/estimators.html#sagemaker.estimator.Estimator)を定義して、変更したトレーニングスクリプトとモデル並列処理設定を使用します。 **Topics** + [SageMaker TensorFlow 推定器と PyTorch 推定器を使用する](#model-parallel-using-sagemaker-pysdk) + [SageMaker の分散モデル並列ライブラリを含む事前構築済みの Docker コンテナを拡張する](#model-parallel-customize-container) + [SageMaker 分散モデル並列ライブラリを使用した独自の Docker コンテナの作成](#model-parallel-bring-your-own-container) ## SageMaker TensorFlow 推定器と PyTorch 推定器を使用する TensorFlow および PyTorch 推定器クラスには `distribution` パラメータが含まれており、これを使用して分散型トレーニングフレームワークを使用するための構成パラメータを指定できます。SageMaker モデル並列ライブラリは、ハイブリッドデータとモデル並列処理に内部的に MPI を使用するため、ライブラリでは MPI オプションを使用する必要があります。次の TensorFlow または PyTorch 推定器のテンプレートは、SageMaker モデル並列ライブラリを MPI で使用するための `distribution` パラメータを構成する方法を示しています。 ------ #### [ Using the SageMaker TensorFlow estimator ] ``` import sagemaker from sagemaker.tensorflow import TensorFlow smp_options = { "enabled":True, # Required "parameters": { "partitions": 2, # Required "microbatches": 4, "placement_strategy": "spread", "pipeline": "interleaved", "optimize": "speed", "horovod": True, # Use this for hybrid model and data parallelism } } mpi_options = { "enabled" : True, # Required "processes_per_host" : 8, # Required # "custom_mpi_options" : "--mca btl_vader_single_copy_mechanism none" } smd_mp_estimator = TensorFlow( entry_point="{{your_training_script.py}}", # Specify your train script source_dir="{{location_to_your_script}}", role=sagemaker.get_execution_role(), instance_count=1, instance_type='{{ml.p3.16xlarge}}', framework_version='{{2.6.3}}', py_version='{{py38}}', distribution={ "smdistributed": {"modelparallel": smp_options}, "mpi": mpi_options }, base_job_name="{{SMD-MP-demo}}", ) smd_mp_estimator.fit('{{s3://my_bucket/my_training_data/}}') ``` ------ #### [ Using the SageMaker PyTorch estimator ] ``` import sagemaker from sagemaker.pytorch import PyTorch smp_options = { "enabled":True, "parameters": { # Required "pipeline_parallel_degree": 2, # Required "microbatches": 4, "placement_strategy": "spread", "pipeline": "interleaved", "optimize": "speed", "ddp": True, } } mpi_options = { "enabled" : True, # Required "processes_per_host" : 8, # Required # "custom_mpi_options" : "--mca btl_vader_single_copy_mechanism none" } smd_mp_estimator = PyTorch( entry_point="{{your_training_script.py}}", # Specify your train script source_dir="{{location_to_your_script}}", role=sagemaker.get_execution_role(), instance_count=1, instance_type='{{ml.p3.16xlarge}}', framework_version='{{1.13.1}}', py_version='{{py38}}', distribution={ "smdistributed": {"modelparallel": smp_options}, "mpi": mpi_options }, base_job_name="{{SMD-MP-demo}}", ) smd_mp_estimator.fit('{{s3://my_bucket/my_training_data/}}') ``` ------ ライブラリを有効にするには、`"smdistributed"` キーと `"mpi"` キーの設定ディクショナリを SageMaker Python SDK の推定器コンストラクタの `distribution` 引数に渡す必要があります。 **SageMaker モデル並列処理の設定パラメータ** + `"smdistributed"` キーについては、`"modelparallel"` キーと次の内部ディクショナリを含むディクショナリを渡します。 **注記** 1 つのトレーニングジョブでの `"modelparallel"` と `"dataparallel"` の使用はサポートされていません。 + `"enabled"` – 必須。モデル並列処理を有効にするには、 `"enabled": True` を設定します。 + `"parameters"` – 必須。SageMaker モデル並列処理の一連のパラメータを指定します。 + 一般的なパラメータの完全なリストについては、「SageMaker Python SDK ドキュメント」の「[Parameters for `smdistributed`](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smd_model_parallel_general.html#smdistributed-parameters)」を参照してください。 TensorFlow については、「[TensorFlow-specific Parameters](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smd_model_parallel_general.html#tensorflow-specific-parameters)」を参照してください。 PyTorch については、「[PyTorch-specific Parameters](https://sagemaker.readthedocs.io/en/v2.199.0/api/training/smd_model_parallel_general.html#pytorch-specific-parameters)」を参照してください。 + `"pipeline_parallel_degree"` (または `smdistributed-modelparallel 構築済みのコンテナを拡張し、SageMaker のモデル並列処理ライブラリを使用するには、PyTorch または TensorFlow に使用可能な AWS 深層学習コンテナ (DLC) イメージのいずれかを使用する必要があります。SageMaker モデル並列処理ライブラリは、CUDA (`cuxyz`) を使う TensorFlow (2.3.0 以降) と PyTorch (1.6.0 以降) の DLC イメージに含まれています。DLC イメージの完全なリストについては、「AWS Deep Learning Containers GitHub リポジトリ」の「[Available Deep Learning Containers Images](https://github.com/aws/deep-learning-containers/blob/master/available_images.md)」を参照してください。 **ヒント** 最新バージョンの SageMaker モデル並列処理ライブラリにアクセスするには、最新バージョンの TensorFlow または PyTorch を含むイメージを使うことをお勧めします。例えば、Dockerfile は次と同じような `FROM` ステートメントを含むことになります。 ``` # Use the SageMaker DLC image URI for TensorFlow or PyTorch FROM {{aws-dlc-account-id}}.dkr.ecr.{{aws-region}}.amazonaws.com/{{framework}}-training:{{{framework-version-tag}}} # Add your dependencies here RUN {{...}} ENV PATH="/opt/ml/code:{{${PATH}}}" # this environment variable is used by the SageMaker AI container to determine our user code directory. ENV SAGEMAKER_SUBMIT_DIRECTORY /opt/ml/code ``` さらに、PyTorch または TensorFlow 推定器を定義する場合は、トレーニングスクリプトの `entry_point` を指定する必要があります。これは、Dockerfile の `ENV SAGEMAKER_SUBMIT_DIRECTORY` で識別されるパスと同じにしてください。 **ヒント** この Docker コンテナを Amazon Elastic Container Registry (Amazon ECR) にプッシュし、イメージ URI (`image_uri`) を使って SageMaker 推定器をトレーニング用に定義する必要があります。詳細については、「[構築済みコンテナを拡張する](prebuilt-containers-extend.md)」を参照してください。 Docker コンテナのホスティングとコンテナのイメージ URI の取得が完了したら、次のように SageMaker `PyTorch` 推定器オブジェクトを作成します。この例では、`smp_options` と `mpi_options` が既に定義されていることを前提としています。 ``` smd_mp_estimator = Estimator( entry_point="{{your_training_script.py}}", role=sagemaker.get_execution_role(), instance_type='{{ml.p3.16xlarge}}', sagemaker_session=sagemaker_session, image_uri='{{your_aws_account_id}}.dkr.ecr.{{region}}.amazonaws.com/{{name}}:{{tag}}' instance_count={{1}}, distribution={ "smdistributed": smp_options, "mpi": mpi_options }, base_job_name="{{SMD-MP-demo}}", ) smd_mp_estimator.fit('s3://my_bucket/my_training_data/') ``` ## SageMaker 分散モデル並列ライブラリを使用した独自の Docker コンテナの作成トレーニング用に独自の Docker コンテナを構築して SageMaker モデル並列ライブラリを使用するには、SageMaker 分散並列ライブラリの正しい依存関係とバイナリファイルを Dockerfile に含める必要があります。このセクションでは、SageMaker トレーニング環境とモデル並列ライブラリを独自の Docker コンテナに適切に準備するために含める必要のある最小限のコードブロックセットについて説明します。 **注記** SageMaker モデル並列ライブラリをバイナリとして使用するこのカスタム Docker オプションは PyTorch でのみ使用できます。 **SageMaker トレーニングツールキットとモデル並列ライブラリを使用して Dockerfile を作成する方法** 1. [NVIDIA CUDA ベースイメージ](https://hub.docker.com/r/nvidia/cuda)の 1 つから始めます。 ``` FROM {{}} ``` **ヒント** 公式の AWS Deep Learning Container (DLC) イメージは、[NVIDIA CUDA ベースイメージ](https://hub.docker.com/r/nvidia/cuda)から構築されています。Deep [Learning Container for PyTorch の公式 Dockerfiles AWS](https://github.com/aws/deep-learning-containers/tree/master/pytorch/training/docker)を調べて、インストールする必要があるライブラリのバージョンと設定方法を確認することをお勧めします。公式の Dockerfile は、SageMaker と Deep Learning Containers のサービスチームによって完成され、ベンチマークテストされ、管理されています。表示されたリンクで、使用する PyTorch バージョンを選択し、CUDA (`cuxyz`) フォルダーを選択し、`.gpu` または `.sagemaker.gpu` で終わる Dockerfile を選択します。 1. 分散型トレーニング環境を設定するには、[Elastic Fabric Adapter (EFA)](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html)、[NVIDIA Collective Communications Library (NCCL)](https://developer.nvidia.com/nccl)、[Open MPI](https://www.open-mpi.org/) などの通信およびネットワークデバイス用のソフトウェアをインストールする必要があります。選択する PyTorch と CUDA のバージョンによっては、互換性のあるバージョンのライブラリをインストールする必要があります。 **重要** SageMaker モデル並列ライブラリでは以降の手順で SageMaker データ並列ライブラリが必要になるため、[SageMaker AI 分散データ並列ライブラリを使用して独自の Docker コンテナを作成する](data-parallel-bring-your-own-container.md) の指示に従って分散トレーニング用の SageMaker トレーニング環境を適切に設定することを強くお勧めします。 NCCL と Open MPI を使用した EFA の設定の詳細については、「[EFA と MPI の開始方法](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-start.html)」と「[EFA とNCCL の開始方法](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-start-nccl.html)」を参照してください。 1. PyTorch 用の SageMaker 分散トレーニングパッケージの URL を指定するには、次の引数を追加します。SageMaker モデル並列ライブラリでは、SageMaker データ並列ライブラリはクロスノードリモートダイレクトメモリアクセス (RDMA) を使用する必要があります。 ``` ARG SMD_MODEL_PARALLEL_URL=https://sagemaker-distributed-model-parallel.s3.us-west-2.amazonaws.com/pytorch-1.10.0/build-artifacts/2022-02-21-19-26/smdistributed_modelparallel-1.7.0-cp38-cp38-linux_x86_64.whl ARG SMDATAPARALLEL_BINARY=https://smdataparallel.s3.amazonaws.com/binary/pytorch/1.10.2/cu113/2022-02-18/smdistributed_dataparallel-1.4.0-cp38-cp38-linux_x86_64.whl ``` 1. SageMaker モデル並列ライブラリが必要とする依存関係をインストールします。 1. [METIS](http://glaros.dtc.umn.edu/gkhome/metis/metis/overview) ライブラリをインストールします。 ``` ARG METIS=metis-{{5.1.0}} RUN rm /etc/apt/sources.list.d/* \ && wget -nv http://glaros.dtc.umn.edu/gkhome/fetch/sw/metis/${METIS}.tar.gz \ && gunzip -f ${METIS}.tar.gz \ && tar -xvf ${METIS}.tar \ && cd ${METIS} \ && apt-get update \ && make config shared=1 \ && make install \ && cd .. \ && rm -rf ${METIS}.tar* \ && rm -rf ${METIS} \ && rm -rf /var/lib/apt/lists/* \ && apt-get clean ``` 1. [RAPIDS メモリマネージャーライブラリ](https://github.com/rapidsai/rmm#rmm-rapids-memory-manager)をインストールします。これには [CMake](https://cmake.org/) 3.14 以降が必要です。 ``` ARG RMM_VERSION={{0.15.0}} RUN wget -nv https://github.com/rapidsai/rmm/archive/v${RMM_VERSION}.tar.gz \ && tar -xvf v${RMM_VERSION}.tar.gz \ && cd rmm-${RMM_VERSION} \ && INSTALL_PREFIX=/usr/local ./build.sh librmm \ && cd .. \ && rm -rf v${RMM_VERSION}.tar* \ && rm -rf rmm-${RMM_VERSION} ``` 1. SageMaker モデル並列ライブラリをインストールします。 ``` RUN pip install --no-cache-dir -U ${SMD_MODEL_PARALLEL_URL} ``` 1. SageMaker データ並列ライブラリをインストールします。 ``` RUN SMDATAPARALLEL_PT=1 pip install --no-cache-dir ${SMDATAPARALLEL_BINARY} ``` 1. [sagemaker-training-toolkit](https://github.com/aws/sagemaker-training-toolkit) をインストールします。ツールキットには、SageMaker トレーニングプラットフォームおよび SageMaker Python SDK と互換性のあるコンテナを作成するために必要な共通の機能が含まれています。 ``` RUN pip install sagemaker-training ``` 1. Dockerfile の作成が完了したら、「[Adapting Your Own Training Container](https://docs.aws.amazon.com/sagemaker/latest/dg/adapt-training-container.html)」を参照して、Docker コンテナを構築し、Amazon ECR でホストする方法について確認してください。 **ヒント** SageMaker AI でのトレーニング用のカスタム Dockerfile の作成に関する一般的な情報については、「[Use Your Own Training Algorithms](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo.html)」を参照してください。