翻訳は機械翻訳により提供されています。提供された翻訳内容と英語版の間で齟齬、不一致または矛盾がある場合、英語版が優先します。

# TorchServe で推論用の大規模モデルをデプロイする
<a name="large-model-inference-tutorials-torchserve"></a>

このチュートリアルでは、TorchServe を使用して、GPU 上で Amazon SageMaker AI に大規模なモデルをデプロイし、推論を提供する方法を紹介します。この例では、[OPT-30b](https://huggingface.co/facebook/opt-30b) モデルを `ml.g5` インスタンスにデプロイします。これを変更して、他のモデルやインスタンスタイプと連携できます。例に出現する `italicized placeholder text` は、実際の情報に置き換えてください。

TorchServe は、大規模な分散モデル推論向けの強力なオープンプラットフォームです。PyTorch、ネイティブ PiPPy、DeepSpeed、HuggingFace Accelerate などの一般的なライブラリをサポートしており、分散型の大規模モデルおよび非分散モデルのどちらの推論シナリオでも一貫性のある、統一のハンドラー API を提供します。詳細については、[TorchServe の大規模モデル推論のドキュメント](https://pytorch.org/serve/large_model_inference.html#)を参照してください。

## TorchServe を使用した深層学習コンテナ
<a name="large-model-inference-tutorials-torchserve-dlcs"></a>

TorchServe を使用して SageMaker AI で大規模モデルをデプロイするには、SageMaker AI 深層学習コンテナ (DLC) のいずれかを使用できます。デフォルトでは、TorchServe はすべての AWS PyTorch DLCs。モデルのロード中に、TorchServe は PiPPy、Deepspeed、Accelerate などの大規模モデルに特化したライブラリをインストールする場合があります。

次の表は、[TorchServe を使用する SageMaker AI DLC](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#sagemaker-framework-containers-sm-support-only) を網羅した一覧です。


| DLC カテゴリ | フレームワーク | ハードウェア | URL の例 | 
| --- | --- | --- | --- | 
| [SageMaker AI フレームワークコンテナ](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#sagemaker-framework-containers-sm-support-only) |  PyTorch 2.0.0 以降  | CPU、GPU | 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:2.0.1-gpu-py310-cu118-ubuntu20.04-sagemaker | 
| [SageMaker AI フレームワーク Graviton コンテナ](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#sagemaker-framework-graviton-containers-sm-support-only) |  PyTorch 2.0.0 以降  | CPU | 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference-graviton:2.0.1-cpu-py310-ubuntu20.04-sagemaker | 
| [StabilityAI 推論コンテナ](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#stabilityai-inference-containers) |  PyTorch 2.0.0 以降  | GPU | 763104351884.dkr.ecr.us-east-1.amazonaws.com/stabilityai-pytorch-inference:2.0.1-sgm0.1.0-gpu-py310-cu118-ubuntu20.04-sagemaker | 
| [Neuron コンテナ](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#neuron-containers) | PyTorch 1.13.1 | Neuronx | 763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-inference-neuron:1.13.1-neuron-py310-sdk2.12.0-ubuntu20.04 | 

## 開始方法
<a name="large-model-inference-tutorials-torchserve-getting-started"></a>

モデルをデプロイする前に、前提条件が満たされているか確認してください。また、モデルパラメータを設定し、ハンドラーコードをカスタマイズすることもできます。

### 前提条件
<a name="large-model-inference-tutorials-torchserve-getting-started-prereqs"></a>

開始するには、次の前提条件が整っていることを確認してください。

1.  AWS アカウントにアクセスできることを確認します。[が IAM ユーザーまたは IAM ロールを介してアカウントにアクセスできるように環境を設定します](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html)。 AWS CLI AWS IAM ロールの使用をお勧めします。個人アカウントでテストする目的で、以下の管理アクセス許可ポリシーを IAM ロールにアタッチできます。
   + [AmazonEC2ContainerRegistryFullAccess](https://console.aws.amazon.com/iam/home#policies/arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryFullAccess)
   + [AmazonEC2FullAccess](https://console.aws.amazon.com/iam/home#policies/arn:aws:iam::aws:policy/AmazonEC2FullAccess)
   + [AWSServiceRoleForAmazonEKSNodegroup](https://console.aws.amazon.com/iam/home#policies/arn:aws:iam::aws:policy/AWSServiceRoleForAmazonEKSNodegroup)
   + [AmazonSageMakerFullAccess](https://console.aws.amazon.com/iam/home#policies/arn:aws:iam::aws:policy/AmazonSageMakerFullAccess)
   + [AmazonS3FullAccess](https://console.aws.amazon.com/iam/home#policies/arn:aws:iam::aws:policy/AmazonS3FullAccess)

   IAM ポリシーのロールへのアタッチに関する詳細については、「*AWS IAM ユーザーガイド*」の「[IAM ID のアクセス許可の追加および削除](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_manage-attach-detach.html)」を参照してください。

1. 次の例に示すように、依存関係をローカルに設定します。

   1. のバージョン 2 をインストールします AWS CLI。

      ```
      # Install the latest AWS CLI v2 if it is not installed
      !curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" !unzip awscliv2.zip
      #Follow the instructions to install v2 on the terminal
      !cat aws/README.md
      ```

   1. SageMaker AI と Boto3 クライアントをインストールします。

      ```
      # If already installed, update your client
      #%pip install sagemaker pip --upgrade --quiet
      !pip install -U sagemaker
      !pip install -U boto
      !pip install -U botocore
      !pip install -U boto3
      ```

### モデル設定とパラメータを設定する
<a name="large-model-inference-tutorials-torchserve-getting-started-config"></a>

TorchServe は [https://pytorch.org/docs/stable/elastic/run.html](https://pytorch.org/docs/stable/elastic/run.html) を使用して、モデル並列処理用の分散環境を設定します。TorchServe には、大規模なモデルに対して複数のワーカーをサポートする機能があります。デフォルトでは、TorchServe はラウンドロビンアルゴリズムを使用して、ホスト上のワーカーに GPU を割り当てます。モデル推論が大規模な場合、各ワーカーに割り当てられる GPU の数は、`model_config.yaml` ファイルで指定された GPU の数に基づいて自動的に計算されます。特定の時点で使用可能な GPU デバイスの ID を指定する環境変数 `CUDA_VISIBLE_DEVICES` は、この数に基づいて設定されます。

例えば、1 つのノードに GPU が 8 つあり、1 つのワーカーがそのノードで 4 つの GPU を必要とするとします (`nproc_per_node=4`)。この場合、TorchServe は最初のワーカーに 4 つの GPU を割り当て (`CUDA_VISIBLE_DEVICES="0,1,2,3"`)、2 番目のワーカーに 4 つの GPU を割り当てます (`CUDA_VISIBLE_DEVICES="4,5,6,7”`)。

このデフォルトの動作に加えて、TorchServe では、ユーザーが柔軟にワーカーの GPU 数を指定することができます。例えば、[モデル設定の YAML ファイル](https://github.com/pytorch/serve/blob/5ee02e4f050c9b349025d87405b246e970ee710b/model-archiver/README.md?plain=1#L164)で変数 `deviceIds: [2,3,4,5]` を設定し、`nproc_per_node=2` を設定した場合、TorchServe は最初のワーカーに `CUDA_VISIBLE_DEVICES="4,5”`、2 番目のワーカーに `CUDA_VISIBLE_DEVICES=”2,3”` を割り当てます。

次の `model_config.yaml` の例では、[OPT-30b](https://huggingface.co/facebook/opt-30b) モデルのフロントエンドパラメータとバックエンドパラメータの両方を設定しています。設定されたフロントエンドパラメータは、`parallelType`、`deviceType`、`deviceIds `、`torchrun` です。設定できるフロントエンドパラメータの詳細については、[PyTorch GitHub のドキュメント](https://github.com/pytorch/serve/blob/2bf505bae3046b0f7d0900727ec36e611bb5dca3/docs/configuration.md?plain=1#L267)を参照してください。バックエンド設定は、フリースタイルのカスタマイズを許可する YAML マップに基づいています。バックエンドパラメータについては、DeepSpeed 設定とカスタムハンドラーコードで使用される追加のパラメータを定義します。

```
# TorchServe front-end parameters
minWorkers: 1
maxWorkers: 1
maxBatchDelay: 100
responseTimeout: 1200
parallelType: "tp"
deviceType: "gpu"
# example of user specified GPU deviceIds
deviceIds: [0,1,2,3] # sets CUDA_VISIBLE_DEVICES

torchrun:
    nproc-per-node: 4

# TorchServe back-end parameters
deepspeed:
    config: ds-config.json
    checkpoint: checkpoints.json

handler: # parameters for custom handler code
    model_name: "facebook/opt-30b"
    model_path: "model/models--facebook--opt-30b/snapshots/ceea0a90ac0f6fae7c2c34bcb40477438c152546"
    max_length: 50
    max_new_tokens: 10
    manual_seed: 40
```

### ハンドラーをカスタマイズする
<a name="large-model-inference-tutorials-torchserve-getting-started-handlers"></a>

TorchServe は、大規模なモデル推論用に、一般的なライブラリで構築された[ベースハンドラー](https://github.com/pytorch/serve/tree/master/ts/torch_handler/distributed)と[ハンドラーユーティリティ](https://github.com/pytorch/serve/tree/master/ts/handler_utils)を提供しています。次の例では、カスタムハンドラークラスの [TransformersSeqClassifierHandler](https://github.com/pytorch/serve/blob/ab69b69a59d6ca6074df7e6d4014f07eb48dedba/examples/large_models/deepspeed/custom_handler.py#L16C7-L16C39) で [BaseDeepSpeedHandler](https://github.com/pytorch/serve/blob/ab69b69a59d6ca6074df7e6d4014f07eb48dedba/ts/torch_handler/distributed/base_deepspeed_handler.py#L8) を拡張し、[ハンドラーユーティリティ](https://github.com/pytorch/serve/blob/master/ts/handler_utils/distributed/deepspeed.py)を使用する方法を示しています。完全なコード例については、[PyTorch GitHub ドキュメントの `custom_handler.py` コード](https://github.com/pytorch/serve/blob/master/examples/large_models/deepspeed/custom_handler.py)を参照してください。

```
class TransformersSeqClassifierHandler(BaseDeepSpeedHandler, ABC):
    """
    Transformers handler class for sequence, token classification and question answering.
    """

    def __init__(self):
        super(TransformersSeqClassifierHandler, self).__init__()
        self.max_length = None
        self.max_new_tokens = None
        self.tokenizer = None
        self.initialized = False

    def initialize(self, ctx: Context):
        """In this initialize function, the HF large model is loaded and
        partitioned using DeepSpeed.
        Args:
            ctx (context): It is a JSON Object containing information
            pertaining to the model artifacts parameters.
        """
        super().initialize(ctx)
        model_dir = ctx.system_properties.get("model_dir")
        self.max_length = int(ctx.model_yaml_config["handler"]["max_length"])
        self.max_new_tokens = int(ctx.model_yaml_config["handler"]["max_new_tokens"])
        model_name = ctx.model_yaml_config["handler"]["model_name"]
        model_path = ctx.model_yaml_config["handler"]["model_path"]
        seed = int(ctx.model_yaml_config["handler"]["manual_seed"])
        torch.manual_seed(seed)

        logger.info("Model %s loading tokenizer", ctx.model_name)

        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.tokenizer.pad_token = self.tokenizer.eos_token
        config = AutoConfig.from_pretrained(model_name)
        with torch.device("meta"):
            self.model = AutoModelForCausalLM.from_config(
                config, torch_dtype=torch.float16
            )
        self.model = self.model.eval()

        ds_engine = get_ds_engine(self.model, ctx)
        self.model = ds_engine.module
        logger.info("Model %s loaded successfully", ctx.model_name)
        self.initialized = True

    def preprocess(self, requests):
        """
        Basic text preprocessing, based on the user's choice of application mode.
        Args:
            requests (list): A list of dictionaries with a "data" or "body" field, each
                            containing the input text to be processed.
        Returns:
            tuple: A tuple with two tensors: the batch of input ids and the batch of
                attention masks.
        """

    def inference(self, input_batch):
        """
        Predicts the class (or classes) of the received text using the serialized transformers
        checkpoint.
        Args:
            input_batch (tuple): A tuple with two tensors: the batch of input ids and the batch
                                of attention masks, as returned by the preprocess function.
        Returns:
            list: A list of strings with the predicted values for each input text in the batch.
        """
        
    def postprocess(self, inference_output):
        """Post Process Function converts the predicted response into Torchserve readable format.
        Args:
            inference_output (list): It contains the predicted response of the input text.
        Returns:
            (list): Returns a list of the Predictions and Explanations.
        """
```

## モデルのアーティファクトの準備
<a name="large-model-inference-tutorials-torchserve-artifacts"></a>

SageMaker AI にモデルをデプロイする前に、モデルアーティファクトをパッケージ化する必要があります。大規模モデルの場合は、引数 `--archive-format no-archive` を指定して PyTorch の [torch-model-archiver](https://github.com/pytorch/serve/blob/master/model-archiver/README.md) ツールを使用することをお勧めします。この引数を指定した場合、モデルアーティファクトの圧縮がスキップされます。次の例では、すべてのモデルアーティファクトを `opt/` という名前の新しいフォルダに保存します。

```
torch-model-archiver --model-name opt --version 1.0 --handler custom_handler.py --extra-files ds-config.json -r requirements.txt --config-file opt/model-config.yaml --archive-format no-archive
```

`opt/` フォルダを作成したら、PyTorch の [Download\$1model](https://github.com/pytorch/serve/blob/master/examples/large_models/utils/Download_model.py) ツールを使用して、OPT-30b モデルをそのフォルダにダウンロードします。

```
cd opt
python path_to/Download_model.py --model_path model --model_name facebook/opt-30b --revision main
```

最後に、モデルアーティファクトを Amazon S3 バケットにアップロードします。

```
aws s3 cp opt {your_s3_bucket}/opt --recursive
```

これで、モデルアーティファクトが Amazon S3 に保存され、SageMaker AI エンドポイントにデプロイできるようになります。

## SageMaker Python SDK を使用してモデルをデプロイする
<a name="large-model-inference-tutorials-torchserve-deploy"></a>

モデルアーティファクトを準備したら、モデルを SageMaker AI ホスティングエンドポイントにデプロイできます。このセクションでは、単一の大きなモデルをエンドポイントにデプロイし、ストリーミング応答の予測を行う方法について説明します。エンドポイントからの応答のストリーミングの詳細については、「[リアルタイムエンドポイントを呼び出す](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints-test-endpoints.html)」を参照してください。

モデルをデプロイするには、次の手順を実行します。

1. SageMaker AI セッションを作成します。次の例を参照してください。

   ```
   import boto3
   import sagemaker
   from sagemaker import Model, image_uris, serializers, deserializers
   
   boto3_session=boto3.session.Session(region_name="us-west-2")
   smr = boto3.client('sagemaker-runtime-demo')
   sm = boto3.client('sagemaker')
   role = sagemaker.get_execution_role()  # execution role for the endpoint
   sess= sagemaker.session.Session(boto3_session, sagemaker_client=sm, sagemaker_runtime_client=smr)  # SageMaker AI session for interacting with different AWS APIs
   region = sess._region_name  # region name of the current SageMaker Studio Classic environment
   account = sess.account_id()  # account_id of the current SageMaker Studio Classic environment
   
   # Configuration:
   bucket_name = sess.default_bucket()
   prefix = "torchserve"
   output_path = f"s3://{bucket_name}/{prefix}"
   print(f'account={account}, region={region}, role={role}, output_path={output_path}')
   ```

1. SageMaker AI で非圧縮モデルを作成します。次の例を参照してください。

   ```
   from datetime import datetime
   
   instance_type = "ml.g5.24xlarge"
   endpoint_name = sagemaker.utils.name_from_base("ts-opt-30b")
   s3_uri = {your_s3_bucket}/opt
   
   model = Model(
       name="torchserve-opt-30b" + datetime.now().strftime("%Y-%m-%d-%H-%M-%S"),
       # Enable SageMaker uncompressed model artifacts
       model_data={
           "S3DataSource": {
                   "S3Uri": s3_uri,
                   "S3DataType": "S3Prefix",
                   "CompressionType": "None",
           }
       },
       image_uri=container,
       role=role,
       sagemaker_session=sess,
       env={"TS_INSTALL_PY_DEP_PER_MODEL": "true"},
   )
   print(model)
   ```

1. モデルを Amazon EC2 インスタンスにデプロイします。次の例を参照してください。

   ```
   model.deploy(
       initial_instance_count=1,
       instance_type=instance_type,
       endpoint_name=endpoint_name,
       volume_size=512, # increase the size to store large model
       model_data_download_timeout=3600, # increase the timeout to download large model
       container_startup_health_check_timeout=600, # increase the timeout to load large model
   )
   ```

1. ストリーミング応答を処理するクラスを初期化します。次の例を参照してください。

   ```
   import io
   
   class Parser:
       """
       A helper class for parsing the byte stream input. 
       
       The output of the model will be in the following format:
       ```
       b'{"outputs": [" a"]}\n'
       b'{"outputs": [" challenging"]}\n'
       b'{"outputs": [" problem"]}\n'
       ...
       ```
       
       While usually each PayloadPart event from the event stream will contain a byte array 
       with a full json, this is not guaranteed and some of the json objects may be split across
       PayloadPart events. For example:
       ```
       {'PayloadPart': {'Bytes': b'{"outputs": '}}
       {'PayloadPart': {'Bytes': b'[" problem"]}\n'}}
       ```
       
       This class accounts for this by concatenating bytes written via the 'write' function
       and then exposing a method which will return lines (ending with a '\n' character) within
       the buffer via the 'scan_lines' function. It maintains the position of the last read 
       position to ensure that previous bytes are not exposed again. 
       """
       
       def __init__(self):
           self.buff = io.BytesIO()
           self.read_pos = 0
           
       def write(self, content):
           self.buff.seek(0, io.SEEK_END)
           self.buff.write(content)
           data = self.buff.getvalue()
           
       def scan_lines(self):
           self.buff.seek(self.read_pos)
           for line in self.buff.readlines():
               if line[-1] != b'\n':
                   self.read_pos += len(line)
                   yield line[:-1]
                   
       def reset(self):
           self.read_pos = 0
   ```

1. ストリーミング応答の予測をテストします。次の例を参照してください。

   ```
   import json
   
   body = "Today the weather is really nice and I am planning on".encode('utf-8')
   resp = smr.invoke_endpoint_with_response_stream(EndpointName=endpoint_name, Body=body, ContentType="application/json")
   event_stream = resp['Body']
   parser = Parser()
   for event in event_stream:
       parser.write(event['PayloadPart']['Bytes'])
       for line in parser.scan_lines():
           print(line.decode("utf-8"), end=' ')
   ```

これで、モデルを SageMaker AI エンドポイントにデプロイし、応答を得るために呼び出せるようになったはずです。SageMaker AI リアルタイムエンドポイントの詳細については、「[シングルモデルエンドポイント](realtime-single-model.md)」を参照してください。