翻訳は機械翻訳により提供されています。提供された翻訳内容と英語版の間で齟齬、不一致または矛盾がある場合、英語版が優先します。

# SageMaker トレーニングジョブの事前トレーニングのチュートリアル (GPU)
<a name="sagemaker-hyperpod-gpu-sagemaker-training-jobs-pretrain-tutorial"></a>

このチュートリアルでは、GPU インスタンスで SageMaker トレーニングジョブを使用して事前トレーニングジョブを設定して実行するプロセスについて説明します。
+ 環境をセットアップする
+ SageMaker HyperPod レシピを使用してトレーニングジョブを起動する

開始する前に、以下の前提条件を満たしていることを確認します。

**前提条件**  
環境のセットアップを開始する前に、以下を確認します。  
データをロードしてトレーニングアーティファクトを出力できる、Amazon FSx ファイルシステム、または Amazon S3 バケット。
Amazon SageMaker AI で 1x ml.p4d.24xlarge と 1x ml.p5.48xlarge のサービスクォータをリクエスト済み。サービスクォータの引き上げをリクエストするには、次のいずれかを行います。  
 AWS Service Quotas コンソールで、 サービスに移動します AWS 。
**[Amazon SageMaker AI]** を選択します。
ml.p4d.24xlarge 1 つと ml.p5.48xlarge インスタンス 1 つを選択します。
以下の管理ポリシーを使用して AWS Identity and Access Management(IAM) ロールを作成し、例を実行するためのアクセス許可を SageMaker AI に付与します。  
AmazonSageMakerFullAccess
AmazonEC2FullAccess
以下の形式のいずれか。  
JSON
JSONGZ (圧縮 JSON)
ARROW
(オプション) HuggingFace のモデル重みを事前トレーニングまたはファインチューニングに使用する場合は、HuggingFace トークンを取得する必要があります。アクセストークンの詳細については、「[ユーザーアクセストークン](https://huggingface.co/docs/hub/en/security-tokens)」を参照してください。

## GPU SageMaker トレーニングジョブの環境設定
<a name="sagemaker-hyperpod-gpu-sagemaker-training-jobs-environment-setup"></a>

SageMaker トレーニングジョブを実行する前に、 `aws configure` コマンドを実行して AWS 認証情報と優先リージョンを設定します。configure コマンドの代わりに、`AWS_ACCESS_KEY_ID`、`AWS_SECRET_ACCESS_KEY`、`AWS_SESSION_TOKEN.` などの環境変数を使用して認証情報を指定できます。詳細については、「[SageMaker AI Python SDK](https://github.com/aws/sagemaker-python-sdk)」を参照してください。

SageMaker AI JupyterLab で SageMaker AI JupyterLab Notebook を使用して SageMaker トレーニングジョブを起動することを強くお勧めします。詳細については、「[SageMaker JupyterLab](studio-updated-jl.md)」を参照してください。
+ (オプション) 仮想環境と依存関係を設定します。Amazon SageMaker Studio で Jupyter ノートブックを使用している場合は、このステップをスキップできます。Python 3.9 以降を使用していることを確認します。

  ```
  # set up a virtual environment
  python3 -m venv ${PWD}/venv
  source venv/bin/activate
  # install dependencies after git clone.
  
  git clone --recursive git@github.com:aws/sagemaker-hyperpod-recipes.git
  cd sagemaker-hyperpod-recipes
  pip3 install -r requirements.txt
  # Set the aws region.
  
  aws configure set <your_region>
  ```
+ SageMaker AI Python SDK をインストールします。

  ```
  pip3 install --upgrade sagemaker
  ```
+ `Container`: GPU コンテナは SageMaker AI Python SDK が自動的に設定します。独自のコンテナを指定することもできます。
**注記**  
Llama 3.2 マルチモーダルトレーニングジョブを実行している場合、`transformers` バージョンは `4.45.2 ` 以上である必要があります。

  SageMaker AI Python SDK を使用している場合にのみ、`source_dir` で `transformers==4.45.2` を `requirements.txt` の末尾に追加します。例えば、SageMaker AI JupyterLab のノートブックで使用している場合は、末尾に追加します。

  HyperPod レシピを使用してクラスタータイプ `sm_jobs` を使用して起動する場合、これは自動的に行われます。

## Jupyter Notebook を使用してトレーニングジョブを起動する
<a name="sagemaker-hyperpod-gpu-sagemaker-training-jobs-launch-training-job-notebook"></a>

次の Python コードを使用すると、レシピで SageMaker トレーニングジョブを実行できます。[SageMaker AI Python SDK](https://sagemaker.readthedocs.io/en/stable/)の PyTorch 推定ツールを活用してレシピを送信します。次の例では、SageMaker AI トレーニングプラットフォームで llama3-8b レシピを起動します。

```
import os
import sagemaker,boto3
from sagemaker.debugger import TensorBoardOutputConfig

from sagemaker.pytorch import PyTorch

sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()

bucket = sagemaker_session.default_bucket() 
output = os.path.join(f"s3://{bucket}", "output")
output_path = "<s3-URI>"

overrides = {
    "run": {
        "results_dir": "/opt/ml/model",
    },
    "exp_manager": {
        "exp_dir": "",
        "explicit_log_dir": "/opt/ml/output/tensorboard",
        "checkpoint_dir": "/opt/ml/checkpoints",
    },   
    "model": {
        "data": {
            "train_dir": "/opt/ml/input/data/train",
            "val_dir": "/opt/ml/input/data/val",
        },
    },
}

tensorboard_output_config = TensorBoardOutputConfig(
    s3_output_path=os.path.join(output, 'tensorboard'),
    container_local_output_path=overrides["exp_manager"]["explicit_log_dir"]
)

estimator = PyTorch(
    output_path=output_path,
    base_job_name=f"llama-recipe",
    role=role,
    instance_type="ml.p5.48xlarge",
    training_recipe="training/llama/hf_llama3_8b_seq8k_gpu_p5x16_pretrain",
    recipe_overrides=recipe_overrides,
    sagemaker_session=sagemaker_session,
    tensorboard_output_config=tensorboard_output_config,
)

estimator.fit(inputs={"train": "s3 or fsx input", "val": "s3 or fsx input"}, wait=True)
```

上記のコードは、トレーニングレシピを使用して PyTorch 推定ツールオブジェクトを作成し、`fit()` メソッドを使用してモデルに適合させます。training\$1recipe パラメータを使用して、トレーニングに使用するレシピを指定します。

**注記**  
Llama 3.2 マルチモーダルトレーニングジョブを実行している場合、トランスフォーマーのバージョンは 4.45.2 以降である必要があります。

直接 SageMaker AI Python SDK を使用している場合にのみ、`source_dir` で `transformers==4.45.2` を `requirements.txt` の末尾に追加します。例えば、Jupyter ノートブックを使用している場合は、テキストファイルにこのバージョンを追加する必要があります。

SageMaker トレーニングジョブのエンドポイントをデプロイする際は、使用しているイメージ URI を指定する必要があります。イメージ URI を指定しない場合、推定ツールはトレーニングイメージをデプロイのイメージとして使用します。SageMaker HyperPod が提供するトレーニングイメージには、推論とデプロイに必要な依存関係は含まれていません。以下は、推論イメージをデプロイに使用する方法の例です。

```
from sagemaker import image_uris
container=image_uris.retrieve(framework='pytorch',region='us-west-2',version='2.0',py_version='py310',image_scope='inference', instance_type='ml.p4d.24xlarge')
predictor = estimator.deploy(initial_instance_count=1,instance_type='ml.p4d.24xlarge',image_uri=container)
```

**注記**  
前述のコードを Sagemaker ノートブックインスタンスで実行すると、SageMaker AI JupyterLab が提供するデフォルトの 5GB を超えるストレージが必要になる場合があります。スペースが使用できない問題が発生した場合は、別のノートブックインスタンスを使用する新しいノートブックインスタンスを作成し、ノートブックのストレージを増やします。

## レシピランチャーを使用してトレーニングジョブを起動する
<a name="sagemaker-hyperpod-gpu-sagemaker-training-jobs-launch-training-job-recipes"></a>

ファイル内の `./recipes_collection/cluster/sm_jobs.yaml` ファイルを以下のとおり更新します。

```
sm_jobs_config:
  output_path: <s3_output_path>
  tensorboard_config:
    output_path: <s3_output_path>
    container_logs_path: /opt/ml/output/tensorboard  # Path to logs on the container
  wait: True  # Whether to wait for training job to finish
  inputs:  # Inputs to call fit with. Set either s3 or file_system, not both.
    s3:  # Dictionary of channel names and s3 URIs. For GPUs, use channels for train and validation.
      train: <s3_train_data_path>
      val: null
  additional_estimator_kwargs:  # All other additional args to pass to estimator. Must be int, float or string.
    max_run: 180000
    enable_remote_debug: True
  recipe_overrides:
    exp_manager:
      explicit_log_dir: /opt/ml/output/tensorboard
    data:
      train_dir: /opt/ml/input/data/train
    model:
      model_config: /opt/ml/input/data/train/config.json
    compiler_cache_url: "<compiler_cache_url>"
```

`./recipes_collection/config.yaml` を更新して、`cluster` と `cluster_type` で `sm_jobs` を指定します。

```
defaults:
  - _self_
  - cluster: sm_jobs  # set to `slurm`, `k8s` or `sm_jobs`, depending on the desired cluster
  - recipes: training/llama/hf_llama3_8b_seq8k_trn1x4_pretrain
cluster_type: sm_jobs  # bcm, bcp, k8s or sm_jobs. If bcm, k8s or sm_jobs, it must match - cluster above.
```

以下のコマンドを使って、ジョブを起動します。

```
python3 main.py --config-path recipes_collection --config-name config
```

SageMaker トレーニングジョブの設定の詳細については、「SageMaker トレーニングジョブでトレーニングジョブを実行する」を参照してください。