本文為英文版的機器翻譯版本，如內容有任何歧義或不一致之處，概以英文版為準。

# 使用 SageMaker AI 自訂 Docker 容器
<a name="docker-containers-adapt-your-own"></a>

您可以調整現有的 Docker 映像，以便與 SageMaker AI 搭配使用。當您有的容器所符合的特徵或安全性需求是預先建置的 SageMaker AI 映像目前不支援的，您使用 SageMaker AI 可能要搭配現有的外部 Docker 映像。有兩個工具組可讓您使用自有的容器，並將其調整為 SageMaker AI 可使用：
+ [SageMaker 訓練工具組](https://github.com/aws/sagemaker-training-toolkit) – 搭配使用此工具組和 SageMaker AI 來訓練模型。
+ [SageMaker AI 推論工具組](https://github.com/aws/sagemaker-inference-toolkit) – 搭配使用此工具組和 SageMaker AI 來部署模型。

以下主題說明如何使用 SageMaker 訓練和推論工具組來調整現有映像：

**Topics**
+ [個別架構程式庫](#docker-containers-adapt-your-own-frameworks)
+ [SageMaker 訓練和推論工具組](amazon-sagemaker-toolkits.md)
+ [調整自有訓練容器](adapt-training-container.md)
+ [為 Amazon SageMaker AI 調整您的自有推論容器](adapt-inference-container.md)

## 個別架構程式庫
<a name="docker-containers-adapt-your-own-frameworks"></a>

除了 SageMaker AI 訓練工具組和 SageMaker 推論工具組之外，SageMaker AI 還提供專門用於 TensorFlow、MXNet、PyTorch 和 Chainer 的工具組。以下資料表提供 GitHub 儲存庫的連結，儲存庫內包含每個架構及其各自的服務工具組的原始程式碼。連結的指示均用於使用 Python SDK 來執行訓練演算法以及在 SageMaker AI 上託管模型。這些個別程式庫的功能包含在 SageMaker AI 訓練工具組和 SageMaker AI 推論工具組中。


| 架構 | 工具組原始程式碼 | 
| --- | --- | 
| TensorFlow |  [SageMaker AI TensorFlow 訓練](https://github.com/aws/sagemaker-tensorflow-training-toolkit) [SageMaker AI TensorFlow 服務](https://github.com/aws/sagemaker-tensorflow-serving-container)  | 
| MXNet |  [SageMaker AI MXNet 訓練](https://github.com/aws/sagemaker-mxnet-training-toolkit) [SageMaker AI MXNet 推論](https://github.com/aws/sagemaker-mxnet-inference-toolkit)  | 
| PyTorch |  [SageMaker AI PyTorch 訓練](https://github.com/aws/sagemaker-pytorch-training-toolkit) [SageMaker AI PyTorch 推論](https://github.com/aws/sagemaker-pytorch-inference-toolkit)  | 
| Chainer |  [SageMaker AI Chainer SageMaker AI 容器](https://github.com/aws/sagemaker-chainer-container)  | 

# SageMaker 訓練和推論工具組
<a name="amazon-sagemaker-toolkits"></a>

[SageMaker 訓練](https://github.com/aws/sagemaker-training-toolkit)和 [SageMaker AI 推論](https://github.com/aws/sagemaker-inference-toolkit)工具組會實作您需要的功能，以調整容器在 SageMaker AI 上執行指令碼、訓練演算法和部署模型。安裝時，程式庫會為使用者定義下列項目：
+ 儲存程式碼和其他資源的位置。
+ 包含啟動容器時要執行之程式碼的進入點。您的 Dockerfile 必須將需要執行的程式碼複製到與 SageMaker AI 相容之容器所預期的位置。
+ 容器管理部署以進行訓練和推論所需要的其他資訊。

## SageMaker AI 工具組容器結構
<a name="sagemaker-toolkits-structure"></a>

SageMaker AI 訓練模型時，會在容器的 `/opt/ml` 目錄中建立下列檔案資料夾結構。

```
/opt/ml
├── input
│   ├── config
│   │   ├── hyperparameters.json
│   │   └── resourceConfig.json
│   └── data
│       └── <channel_name>
│           └── <input data>
├── model
│
├── code
│
├── output
│
└── failure
```

執行模型*訓練*任務時，SageMaker AI 容器使用 `/opt/ml/input/` 目錄，其中包含 JSON 檔案，這些檔案設定演算法的超參數，以及用於分散式訓練的網路配置。`/opt/ml/input/` 目錄也包含指定 SageMaker AI 存取資料之通道的檔案，而資料儲存在 Amazon Simple Storage Service (Amazon S3) 中。SageMaker AI 容器程式庫放置容器將在 `/opt/ml/code/` 目錄中執行的指令碼。您的指令碼應該將演算法產生的模型寫入至 `/opt/ml/model/` 目錄。如需詳細資訊，請參閱[具有自訂訓練演算法的容器](your-algorithms-training-algo.md)。

在 SageMaker AI 上*託管*訓練過的模型以進行推論時，您可以將模型部署到 HTTP 端點。此模型會即時預測以回應推論請求。容器必須包含服務堆疊來處理這些請求。

在託管或批次轉換容器中，模型檔案位於訓練期間寫入的相同資料夾中。

```
/opt/ml/model
│
└── <model files>
```

如需詳細資訊，請參閱[具有自訂推論程式碼的容器](your-algorithms-inference-main.md)。

## 單一與多個容器
<a name="sagemaker-toolkits-separate-images"></a>

您可以提供訓練演算法與推論程式碼的個別 Docker 映像，或是對兩者使用單一 Docker 映像。當您建立 Docker 映像以用於 SageMaker AI 時，請考量下列事項：
+ 提供兩種 Docker 映像會提高儲存需求與成本，因為常用程式庫可能會重複。
+ 一般而言，較小型的容器在訓練與託管方面，啟動速度會比較快。由於系統可以更快速地進行自動擴充，模型訓練速度會隨之加快，且託管服務可以對增加的流量做出反應。
+ 您或許能夠針對明顯小於訓練容器的推論容器，進行撰寫作業。當您使用 GPU 進行訓練，但您的推論程式碼已對 CPU 進行最佳化時，這種情況尤為常見。
+ SageMaker AI 要求 Docker 容器在沒有特殊權限的情況下執行。
+ 您建置的 Docker 容器和 SageMaker AI 提供的那些容器都可以將訊息傳送到 `Stdout` 和 `Stderr` 檔案。SageMaker AI 會將這些訊息傳送至您 AWS 帳戶中的 Amazon CloudWatch logs。

如需有關如何建立 SageMaker AI 容器及如何在其中執行指令碼的詳細資訊，請參閱 GitHub 上的 [SageMaker AI 訓練工具組](https://github.com/aws/sagemaker-training-toolkit)和 [SageMaker AI 推論工具組](https://github.com/aws/sagemaker-inference-toolkit)儲存庫。這些儲存庫也提供重要的環境變數和 SageMaker AI 容器提供的環境變數清單。

# 調整自有訓練容器
<a name="adapt-training-container"></a>

若要執行您自有的訓練模型，請透過 Amazon SageMaker 筆記本執行個體使用 [Amazon SageMaker Training Toolkit](https://github.com/aws/sagemaker-training-toolkit) 建立一個 Docker 容器。

## 步驟 1：建立一個 SageMaker 筆記本執行個體
<a name="byoc-training-step1"></a>

1. 開啟位在 [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/) 的 Amazon SageMaker AI 主控台。

1. 從左邊導覽窗格中，選擇**筆記本**，選擇**筆記本執行個體**，然後選擇**建立筆記本執行個體**。

1. 在**建立筆記本執行個體**頁面上，提供下列資訊：

   1. 對於**筆記本執行個體名稱**，輸入 **RunScriptNotebookInstance**。

   1. 對於**筆記本執行個體類型**，選擇 **ml.t2.medium**。

   1. 在**許可與加密**區段內執行下列動作：

      1. 對於 **IAM 角色**，選擇**建立新角色**。這會開啟新視窗。

      1. 在**建立 IAM 角色**頁面上，選擇**特定的 S3 儲存貯體**，指定名為 **sagemaker-run-script** 的 Amazon S3 儲存貯體，然後選擇**建立角色**。

         SageMaker AI 會建立名為 `AmazonSageMaker-ExecutionRole-YYYYMMDDTHHmmSS` 的 IAM 角色。例如 `AmazonSageMaker-ExecutionRole-20190429T110788`。請注意，執行角色命名慣例會使用角色建立時的日期和時間，並以 `T` 分隔。

   1. 對於**根存取**，選擇**已啟用**。

   1. 選擇**建立筆記本執行個體**。

1. 在**筆記本執行個體**頁面上，**狀態**為**待定**。Amazon SageMaker AI 可能需要幾分鐘的時間才能啟動機器學習運算執行個體 (在此情況下，它會啟動筆記本執行個體)，並將機器學習 (ML) 儲存磁碟區連接到執行個體上。筆記本執行個體具備預先設定的 Jupyter 筆記本伺服器和一組 Anaconda 程式庫。如需詳細資訊，請參閱 [CreateNotebookInstance](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateNotebookInstance.html)。

   
1. 按一下您剛建立的筆記本的**名稱**。這會開啟新頁面。

1.  在**許可與加密**區段中，複製 **IAM 角色 ARN 編號**，然後將它貼到記事本檔案中暫存。稍後您可以使用此 IAM 角色 ARN 編號，在筆記本執行個體中設定本機訓練估算器。**IAM 角色 ARN 編號**如下所示：`'arn:aws:iam::111122223333:role/service-role/AmazonSageMaker-ExecutionRole-20190429T110788'`

1. 筆記本執行個體的狀態變更為 **InService** 後，請選擇**開啟 JupyterLab**。

## 步驟 2：建立並上傳 Dockerfile 和 Python 訓練指令碼
<a name="byoc-training-step2"></a>

1. 開啟 JupyterLab 後，在 JupyterLab 主目錄內建立一個新資料夾。在左上角選擇**新增資料夾**圖示，然後輸入資料夾名稱 `docker_test_folder`。

1. 在 `docker_test_folder` 目錄中，建立一個 `Dockerfile` 文字檔案。

   1. 選擇左上角的**新增啟動器**圖示 (\$1)。

   1. 在**其他**區段下右邊的窗格中，選擇**文字檔案**。

   1. 將下列 `Dockerfile` 範例程式碼貼到您的文字檔中。

      ```
      #Download an open source TensorFlow Docker image
      FROM tensorflow/tensorflow:latest-gpu-jupyter
      
      # Install sagemaker-training toolkit that contains the common functionality necessary to create a container compatible with SageMaker AI and the Python SDK.
      RUN pip3 install sagemaker-training
      
      # Copies the training code inside the container
      COPY train.py /opt/ml/code/train.py
      
      # Defines train.py as script entrypoint
      ENV SAGEMAKER_PROGRAM train.py
      ```

      Dockerfile 指令碼會執行以下任務：
      + `FROM tensorflow/tensorflow:latest-gpu-jupyter` – 下載最新的 TensorFlow Docker 基礎映像。您可以將其取代為任何要用於建置容器的 Docker 基礎映像，以及 AWS 預先建置的容器基礎映像。
      + `RUN pip install sagemaker-training` – 安裝 [SageMaker AI Training Toolkit](https://github.com/aws/sagemaker-training-toolkit)，其中包含建立與 SageMaker AI 相容的容器所需的一般功能。
      + `COPY train.py /opt/ml/code/train.py` – 將指令碼複製到 SageMaker AI 在容器內預期的位置。此指令碼必須位於此資料夾。
      + `ENV SAGEMAKER_PROGRAM train.py` – 將您的訓練指令碼 `train.py` 視為複製到容器的資料夾 `/opt/ml/code` 的進入點指令碼。這是您建立自有容器時唯一必須指定的環境變數。

   1.  在左側目錄導覽窗格中，文字檔案名稱可能會自動命名為 `untitled.txt`。要重新命名檔案，請在檔案上按一下滑鼠右鍵，選擇**重新命名**，將檔案重新命名為 `Dockerfile` 且不含 `.txt` 副檔名，然後按下 `Ctrl+s` 或 `Command+s` 儲存檔案。

1. 將訓練指令碼 `train.py` 上傳至 `docker_test_folder`。您可以使用下列範例指令碼為這個練習建立一個模型，此模型在 [MNIST 資料集](https://en.wikipedia.org/wiki/MNIST_database)訓練讀取手寫數字。

   ```
   import tensorflow as tf
   import os
   
   mnist = tf.keras.datasets.mnist
   
   (x_train, y_train), (x_test, y_test) = mnist.load_data()
   x_train, x_test = x_train / 255.0, x_test / 255.0
   
   model = tf.keras.models.Sequential([
   tf.keras.layers.Flatten(input_shape=(28, 28)),
   tf.keras.layers.Dense(128, activation='relu'),
   tf.keras.layers.Dropout(0.2),
   tf.keras.layers.Dense(10, activation='softmax')
   ])
   
   model.compile(optimizer='adam',
   loss='sparse_categorical_crossentropy',
   metrics=['accuracy'])
   
   model.fit(x_train, y_train, epochs=1)
   model_save_dir = f"{os.environ.get('SM_MODEL_DIR')}/1"
   
   model.evaluate(x_test, y_test)
   tf.saved_model.save(model, model_save_dir)
   ```

## 步驟 3：建立容器
<a name="byoc-training-step3"></a>

1. 在 JupyterLab 主目錄中，開啟 Jupyter 筆記本。若要開啟新的筆記本，請選擇**新的啟動**圖示，然後在**筆記本**區段中選擇最新版的 **conda\$1tensorflow2**。

1. 在第一個筆記本儲存格執行下列命令，可切換至 `docker_test_folder` 目錄：

   ```
   cd ~/SageMaker/docker_test_folder
   ```

   這樣會返回目前的目錄，如下所示：

   ```
   ! pwd
   ```

   `output: /home/ec2-user/SageMaker/docker_test_folder`

1. 若要建立 Docker 容器，請執行以下 Docker build 命令 (包括在結尾處有句點的空格)：

   ```
   ! docker build -t tf-custom-container-test .
   ```

   必須從您建立的 Docker 目錄中執行 Docker build 命令，在此案例中為 `docker_test_folder`。
**注意**  
如果您收到以下錯誤訊息，表示 Docker 找不到 Dockerfile，請確認 Dockerfile 的名稱正確，且已存入目錄。  

   ```
   unable to prepare context: unable to evaluate symlinks in Dockerfile path: 
   lstat /home/ec2-user/SageMaker/docker/Dockerfile: no such file or directory
   ```
請記住，`docker` 會在當前目錄中查找名稱為 `Dockerfile` 且不含任何副檔名的檔案。如果您將其命名為其他名稱，則可以使用 `-f` 標記手動輸入文件名稱。例如，如果您將 Dockerfile 命名為 `Dockerfile-text.txt`，則需執行下列命令：  

   ```
   ! docker build -t tf-custom-container-test -f Dockerfile-text.txt .
   ```

## 步驟 4：測試容器
<a name="byoc-training-step4"></a>

1. 若要在筆記本執行個體內本機測試容器，請開啟 Jupyter 筆記本。選擇**新增啟動器**，然後在**筆記本**區段內選擇最新版的 **conda\$1tensorflow2** 。

1. 將下列範例指令碼貼到筆記本程式碼儲存格，以設定 SageMaker AI 估算器。

   ```
   import sagemaker
   from sagemaker.estimator import Estimator
   
   estimator = Estimator(image_uri='tf-custom-container-test',
                         role=sagemaker.get_execution_role(),
                         instance_count=1,
                         instance_type='local')
   
   estimator.fit()
   ```

   在上述程式碼範例中，`role` 引數指定為 `sagemaker.get_execution_role()`，以自動擷取為 SageMaker AI 工作階段設定的角色。您也可以用設定筆記本執行個體時所使用的 **IAM 角色 ARN 編號**的字串值來取代。ARN 看起來應該如下所示：`'arn:aws:iam::111122223333:role/service-role/AmazonSageMaker-ExecutionRole-20190429T110788'`。

1. 執行程式碼儲存格。此測試會輸出訓練環境組態、用於環境變數的值、資料的來源，以及訓練期間獲得的損失和準確率。

## 步驟 5：將容器推送至 Amazon Elastic Container Registry (Amazon ECR)
<a name="byoc-training-step5"></a>

1. 成功執行此本機模式測試之後，您可以將 Docker 容器推送至 [Amazon ECR](https://docs.aws.amazon.com/AmazonECR/latest/userguide/what-is-ecr.html)，用它來執行訓練工作。如果您想要使用私有的 Docker 登錄檔而非 Amazon ECR，請參閱[將您的訓練容器推送至私有登錄檔](https://docs.aws.amazon.com/sagemaker/latest/dg/docker-containers-adapt-your-own-private-registry.html)。

   在一個筆記本儲存格中執行以下命令列。

   ```
   %%sh
   
   # Specify an algorithm name
   algorithm_name=tf-custom-container-test
   
   account=$(aws sts get-caller-identity --query Account --output text)
   
   # Get the region defined in the current configuration (default to us-west-2 if none defined)
   region=$(aws configure get region)
   region=${region:-us-west-2}
   
   fullname="${account}.dkr.ecr.${region}.amazonaws.com/${algorithm_name}:latest"
   
   # If the repository doesn't exist in ECR, create it.
   
   aws ecr describe-repositories --repository-names "${algorithm_name}" > /dev/null 2>&1
   if [ $? -ne 0 ]
   then
   aws ecr create-repository --repository-name "${algorithm_name}" > /dev/null
   fi
   
   # Get the login command from ECR and execute it directly
   
   aws ecr get-login-password --region ${region}|docker login --username AWS --password-stdin ${fullname}
   
   # Build the docker image locally with the image name and then push it to ECR
   # with the full name.
   
   docker build -t ${algorithm_name} .
   docker tag ${algorithm_name} ${fullname}
   
   docker push ${fullname}
   ```
**注意**  
這個 bash Shell 指令碼可能會有許可問題，產生類似以下的錯誤訊息：  

   ```
   "denied: User: [ARN] is not authorized to perform: ecr:InitiateLayerUpload on resource:
   arn:aws:ecr:us-east-1:[id]:repository/tf-custom-container-test"
   ```
如果發生此錯誤，您需要將 **AmazonEC2ContainerRegistryFullAccess** 政策連接至您的 IAM 角色。前往 [IAM 主控台](https://console.aws.amazon.com/iam/home)，從左側導覽窗格中選擇**角色**，然後查找您用於筆記本執行個體的 IAMrole。在**許可**標籤下，選擇**連接政策**按鈕，然後搜尋 **AmazonEC2ContainerRegistryFullAccess** 政策。選取政策的核取方塊，然後選擇**新增許可**來完成。

1. 在 Studio 筆記本儲存格中執行下列程式碼，以呼叫您的訓練容器的 Amazon ECR 映像。

   ```
   import boto3
   
   account_id = boto3.client('sts').get_caller_identity().get('Account')
   ecr_repository = 'tf-custom-container-test'
   tag = ':latest'
   
   region = boto3.session.Session().region_name
   
   uri_suffix = 'amazonaws.com'
   if region in ['cn-north-1', 'cn-northwest-1']:
       uri_suffix = 'amazonaws.com.cn'
   
   byoc_image_uri = '{}.dkr.ecr.{}.{}/{}'.format(account_id, region, uri_suffix, ecr_repository + tag)
   
   byoc_image_uri
   # This should return something like
   # 111122223333.dkr.ecr.us-east-2.amazonaws.com/sagemaker-byoc-test:latest
   ```

1. 使用從上一個步驟擷取的 `ecr_image` 以設定 SageMaker AI 估算器物件。下列程式碼範例會以 `byoc_image_uri` 設定 SageMaker AI 估算器，並在 Amazon EC2 執行個體上啟動一個訓練任務。

------
#### [ SageMaker Python SDK v1 ]

   ```
   import sagemaker
   from sagemaker import get_execution_role
   from sagemaker.estimator import Estimator
   
   estimator = Estimator(image_uri=byoc_image_uri,
                         role=get_execution_role(),
                         base_job_name='tf-custom-container-test-job',
                         instance_count=1,
                         instance_type='ml.g4dn.xlarge')
   
   #train your model
   estimator.fit()
   ```

------
#### [ SageMaker Python SDK v2 ]

   ```
   import sagemaker
   from sagemaker import get_execution_role
   from sagemaker.estimator import Estimator
   
   estimator = Estimator(image_uri=byoc_image_uri,
                         role=get_execution_role(),
                         base_job_name='tf-custom-container-test-job',
                         instance_count=1,
                         instance_type='ml.g4dn.xlarge')
   
   #train your model
   estimator.fit()
   ```

------

1. 如果您想要使用自有容器部署您的模型，請參閱[調整您自有的推論容器](https://docs.aws.amazon.com/sagemaker/latest/dg/adapt-inference-container.html)。您也可以使用可部署 TensorFlow 模型的 AWS架構容器。若要部署讀取手寫數字的範例模型，請將下列範例指令碼輸入您在上一個子步驟中用來訓練模型的同一個筆記本，以取得部署所需的映像 URI (通用資源識別碼)，然後部署該模型。

   ```
   import boto3
   import sagemaker
   
   #obtain image uris
   from sagemaker import image_uris
   container = image_uris.retrieve(framework='tensorflow',region='us-west-2',version='2.11.0',
                       image_scope='inference',instance_type='ml.g4dn.xlarge')
   
   #create the model entity, endpoint configuration and endpoint
   predictor = estimator.deploy(1,instance_type='ml.g4dn.xlarge',image_uri=container)
   ```

   使用下列程式碼範例，以 MNIST 資料集內手寫數字的範例來測試模型。

   ```
   #Retrieve an example test dataset to test
   import numpy as np
   import matplotlib.pyplot as plt
   from keras.datasets import mnist
   
   # Load the MNIST dataset and split it into training and testing sets
   (x_train, y_train), (x_test, y_test) = mnist.load_data()
   # Select a random example from the training set
   example_index = np.random.randint(0, x_train.shape[0])
   example_image = x_train[example_index]
   example_label = y_train[example_index]
   
   # Print the label and show the image
   print(f"Label: {example_label}")
   plt.imshow(example_image, cmap='gray')
   plt.show()
   ```

   將測試手寫數字轉換為 TensorFlow 可擷取並進行測試預測的形式。

   ```
   from sagemaker.serializers import JSONSerializer
   data = {"instances": example_image.tolist()}
   predictor.serializer=JSONSerializer() #update the predictor to use the JSONSerializer
   predictor.predict(data) #make the prediction
   ```

如需顯示如何在本機測試自訂容器並將其推送至 Amazon ECR 映像的完整範例，請參閱[建立您自有的 TensorFlow 容器](https://sagemaker-examples.readthedocs.io/en/latest/advanced_functionality/tensorflow_bring_your_own/tensorflow_bring_your_own.html)範例筆記本。

**提示**  
若要對訓練工作進行分析和偵錯，以監控系統使用率問題 (例如 CPU 瓶頸和 GPU 使用率不足)，並找出訓練問題 (例如過度擬合、過度訓練、爆炸張量和梯度消失)，請使用 Amazon SageMaker Debugger。如需詳細資訊，請參閱[Debugger 和自訂訓練容器搭配使用](debugger-bring-your-own-container.md)。

## 步驟 6：清除資源
<a name="byoc-training-step6"></a>

**入門範例使用完畢後清除資源**

1. 開啟 [SageMaker AI 主控台](https://console.aws.amazon.com/sagemaker/)，選取筆記本執行個體 **RunScriptNotebookInstance**，選擇**動作**，然後選擇**停止**。停止執行個體可能需要幾分鐘。

1. 執行個體**狀態**變更為**已停止**後，選擇**動作**，選擇**刪除**，然後在對話方塊中選擇**刪除**。刪除執行個體可能需要幾分鐘。當筆記本執行個體被刪除，會從表格中消失。

1. 開啟 [Amazon S3 主控台](https://console.aws.amazon.com/s3/)，刪除您為了儲存模型成品和訓練資料集而建立的儲存貯體。

1. 開啟 [IAM 主控台](https://console.aws.amazon.com/iam/)並刪除該 IAM 角色。如果已建立許可政策，也可一併刪除。
**注意**  
 Docker 容器執行之後會自動關閉。您不需要刪除它。

## 部落格與案例研究
<a name="byoc-blogs-and-examples"></a>

下列部落格探討在 Amazon SageMaker AI 中使用自訂訓練容器的案例研究。
+ [為什麼要在 Amazon SageMaker AI 使用自有容器，以及如何正確使用](https://medium.com/@pandey.vikesh/why-bring-your-own-container-to-amazon-sagemaker-and-how-to-do-it-right-bc158fe41ed1)，*中* (2023 年 1 月 20 日)

# 調整您的訓練工作以存取私有 Docker 登錄檔中的映像
<a name="docker-containers-adapt-your-own-private-registry"></a>

您可以使用私有 [Docker 登錄檔](https://docs.docker.com/registry/)，代替 Amazon Elastic Container Registry (Amazon ECR) 來託管 SageMaker AI 訓練的映像。下列指示說明如何建立 Docker 登錄檔、配置虛擬私有雲端 (VPC) 和訓練任務、儲存映像，以及讓 SageMaker AI 可以存取私有 Docker 登錄檔中的訓練映像。這些指示也會示範如何使用需要對 SageMaker 訓練工作進行身份驗證的 Docker 登錄檔。

## 在私有 Docker 登錄檔中建立並儲存您的映像
<a name="docker-containers-adapt-your-own-private-registry-prerequisites"></a>

建立一個私有的 Docker 登錄檔來儲存您的映像。您的登錄檔必須：
+ 使用 [Docker Registry HTTP API](https://docs.docker.com/registry/spec/api/) 協議
+ 可從 `CreateTrainingJob` API 中的 [VpcConfig](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html#API_CreateTrainingJob_RequestSyntax) 參數內指定的相同 VPC 存取。在建立訓練工作時輸入 `VpcConfig`。
+ 使用出自已知公用憑證授權機構的 [TLS 憑證](https://aws.amazon.com/what-is/ssl-certificate/)加以保護。

如需建立 Docker 登錄檔的詳細資訊，請參閱[設定登錄檔伺服器](https://docs.docker.com/registry/deploying/)。

## 設定您的 VPC 和 SageMaker 訓練工作
<a name="docker-containers-adapt-your-own-private-registry-configure"></a>

SageMaker AI 使用您的 VPC 內的網路連線以存取 Docker 登錄檔中的映像。若要使用 Docker 登錄檔中的映像進行訓練，必須可以從您帳戶中的 Amazon VPC 存取該登錄檔。如需詳細資訊，請參閱[使用需要驗證的 Docker 登錄檔進行訓練](docker-containers-adapt-your-own-private-registry-authentication.md)。

並且必須將訓練工作設定為連接到 Docker 登錄檔可以存取的同一個 VPC。如需詳細資訊，請參閱[設定訓練工作以供 Amazon VPC 存取](https://docs.aws.amazon.com/sagemaker/latest/dg/train-vpc.html#train-vpc-configure)。

## 使用私有 Docker 登錄檔中的映像建立訓練工作
<a name="docker-containers-adapt-your-own-private-registry-create"></a>

若要使用私有 Docker 登錄檔中的映像進行訓練，請使用下列指南來設定映像、設定並建立訓練工作。以下程式碼範例使用 適用於 Python (Boto3) 的 AWS SDK 用戶端。

1. 建立訓練映像組態物件，並輸入 `Vpc` 的 `TrainingRepositoryAccessMode` 欄位，如下所示。

   ```
   training_image_config = {
       'TrainingRepositoryAccessMode': 'Vpc'
   }
   ```
**注意**  
如果您的私有 Docker 登錄檔需要驗證，必須新增一個 `TrainingRepositoryAuthConfig` 物件至訓練映像組態物件。您也必須指定 函數的 Amazon Resource Name (ARN)，該 AWS Lambda 函數會使用 `TrainingRepositoryAuthConfig` 物件`TrainingRepositoryCredentialsProviderArn`的 欄位，為 SageMaker AI 提供存取憑證。如需詳細資訊，請參閱下方的程式碼架構範例。  

   ```
   training_image_config = {
      'TrainingRepositoryAccessMode': 'Vpc',
      'TrainingRepositoryAuthConfig': {
           'TrainingRepositoryCredentialsProviderArn': 'arn:aws:lambda:Region:Acct:function:FunctionName'
      }
   }
   ```

   如需如何建立 Lambda 函式以提供驗證的資訊，請參閱[使用需要驗證的 Docker 登錄檔進行訓練](docker-containers-adapt-your-own-private-registry-authentication.md)。

1. 使用 Boto3 用戶端建立一個訓練工作，並將正確的組態傳送至 [create\$1training\$1job](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html) API。下列指示說明如何設定元件及建立訓練工作。

   1. 建立要傳送給 `create_training_job` 的 `AlgorithmSpecification` 物件。使用您在前一步驟中建立的訓練映像組態物件，如以下程式碼範例所示。

      ```
      algorithm_specification = {
         'TrainingImage': 'myteam.myorg.com/docker-local/my-training-image:<IMAGE-TAG>',
         'TrainingImageConfig': training_image_config,
         'TrainingInputMode': 'File'
      }
      ```
**注意**  
若要使用固定版本而非更新版本的映像，請參照映像的[摘要](https://docs.docker.com/engine/reference/commandline/pull/#pull-an-image-by-digest-immutable-identifier)，而不是依據名稱或標籤。

   1. 指定要傳送給 `create_training_job` 的訓練工作名稱和角色，如以下程式碼範例所示。

      ```
      training_job_name = 'private-registry-job'
      execution_role_arn = 'arn:aws:iam::123456789012:role/SageMakerExecutionRole'
      ```

   1. 為訓練工作的 VPC 組態指定安全群組和子網路。您的私有 Docker 登錄檔必須允許來自指定的安全群組的傳入流量，如下列程式碼範例所示。

      ```
      vpc_config = {
          'SecurityGroupIds': ['sg-0123456789abcdef0'],
          'Subnets': ['subnet-0123456789abcdef0','subnet-0123456789abcdef1']
      }
      ```
**注意**  
如果您的子網路與私有 Docker 登錄檔不在相同的 VPC 中，您必須在兩個 VPC 之間設定網路連線。如需詳細資訊，請參閱使用 [VPC 對等互連](https://docs.aws.amazon.com/vpc/latest/userguide/vpc-peering.html)連接數個 VPC。

   1. 指定資源組態，包括用於訓練的機器學習運算執行個體和儲存磁碟區，如下列程式碼範例所示。

      ```
      resource_config = {
          'InstanceType': 'ml.m4.xlarge',
          'InstanceCount': 1,
          'VolumeSizeInGB': 10,
      }
      ```

   1. 指定輸入和輸出資料組態、訓練資料集的儲存位置，以及您要儲存模型成品的位置，如下列程式碼範例所示。

      ```
      input_data_config = [
          {
              "ChannelName": "training",
              "DataSource":
              {
                  "S3DataSource":
                  {
                      "S3DataDistributionType": "FullyReplicated",
                      "S3DataType": "S3Prefix",
                      "S3Uri": "s3://your-training-data-bucket/training-data-folder"
                  }
              }
          }
      ]
      
      output_data_config = {
          'S3OutputPath': 's3://your-output-data-bucket/model-folder'
      }
      ```

   1. 指定模型訓練工作可以執行的秒數上限，如下列程式碼範例所示。

      ```
      stopping_condition = {
          'MaxRuntimeInSeconds': 1800
      }
      ```

   1. 最後，使用您在先前步驟所指定的參數來建立訓練工作，如下列程式碼範例所示。

      ```
      import boto3
      sm = boto3.client('sagemaker')
      try:
          resp = sm.create_training_job(
              TrainingJobName=training_job_name,
              AlgorithmSpecification=algorithm_specification,
              RoleArn=execution_role_arn,
              InputDataConfig=input_data_config,
              OutputDataConfig=output_data_config,
              ResourceConfig=resource_config,
              VpcConfig=vpc_config,
              StoppingCondition=stopping_condition
          )
      except Exception as e:
          print(f'error calling CreateTrainingJob operation: {e}')
      else:
          print(resp)
      ```

# 使用 SageMaker AI 估算器執行訓練任務
<a name="docker-containers-adapt-your-own-private-registry-estimator"></a>

您也可以使用 SageMaker Python SDK 中的[估算器](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html)來處理 SageMaker 訓練工作的組態和執行。下列程式碼範例，說明如何使用私有 Docker 登錄檔的映像來設定及執行估算器。

1. 匯入必要的程式庫和相依性，如以下程式碼範例所示。

   ```
   import boto3
   import sagemaker
   from sagemaker.estimator import Estimator
   
   session = sagemaker.Session()
   
   role = sagemaker.get_execution_role()
   ```

1. 提供您的訓練映像、安全性群組和子網路的通用資源識別碼 (URI) 給您的訓練工作的 VPC 組態，如下列程式碼範例所示。

   ```
   image_uri = "myteam.myorg.com/docker-local/my-training-image:<IMAGE-TAG>"
   
   security_groups = ["sg-0123456789abcdef0"]
   subnets = ["subnet-0123456789abcdef0", "subnet-0123456789abcdef0"]
   ```

   如需有關 `security_group_ids` 和 `subnets` 的詳細資訊，請參閱 SageMaker Python SDK 的[估算器](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html)一節有關適當參數的說明。
**注意**  
SageMaker AI 使用您的 VPC 內的網路連線以存取 Docker 登錄檔中的映像。若要使用 Docker 登錄檔中的映像進行訓練，必須可以從您帳戶中的 Amazon VPC 存取該登錄檔。

1. 或者，如果您的 Docker 登錄檔需要身分驗證，您也必須指定 AWS Lambda 函數的 Amazon Resource Name (ARN)，以提供 SageMaker AI 存取憑證。以下程式碼範例說明如何指定 ARN。

   ```
   training_repository_credentials_provider_arn = "arn:aws:lambda:us-west-2:1234567890:function:test"
   ```

   如需詳細資訊以了解如何在需要驗證的 Docker 登錄檔中使用映像檔，請參閱下方的**使用需要驗證的 Docker 登錄檔進行訓練**。

1. 使用先前步驟的程式碼範例來設定估算器，如下列程式碼範例所示。

   ```
   # The training repository access mode must be 'Vpc' for private docker registry jobs 
   training_repository_access_mode = "Vpc"
   
   # Specify the instance type, instance count you want to use
   instance_type="ml.m5.xlarge"
   instance_count=1
   
   # Specify the maximum number of seconds that a model training job can run
   max_run_time = 1800
   
   # Specify the output path for the model artifacts
   output_path = "s3://your-output-bucket/your-output-path"
   
   estimator = Estimator(
       image_uri=image_uri,
       role=role,
       subnets=subnets,
       security_group_ids=security_groups,
       training_repository_access_mode=training_repository_access_mode,
       training_repository_credentials_provider_arn=training_repository_credentials_provider_arn,  # remove this line if auth is not needed
       instance_type=instance_type,
       instance_count=instance_count,
       output_path=output_path,
       max_run=max_run_time
   )
   ```

1. 以您的工作名稱和輸入路徑做為參數呼叫 `estimator.fit`，以開始訓練工作，如下列程式碼範例所示。

   ```
   input_path = "s3://your-input-bucket/your-input-path"
   job_name = "your-job-name"
   
   estimator.fit(
       inputs=input_path,
       job_name=job_name
   )
   ```

# 使用需要驗證的 Docker 登錄檔進行訓練
<a name="docker-containers-adapt-your-own-private-registry-authentication"></a>

如果您的 Docker 登錄檔需要驗證，您必須建立 AWS Lambda 函式以供 SageMaker AI 存取憑證。然後，建立訓練工作，並在 [create\$1training\$1job](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_training_job) API 內提供此 Lambda 函式的 ARN。最後，您可以選擇建立介面 VPC 端點，讓您的 VPC 可以與 Lambda 函式通訊，而不必透過網際網路傳送流量。以下指南說明如何建立 Lambda 函式、為其指派正確角色，以及建立介面 VPC 端點。

## 建立 Lambda 函式
<a name="docker-containers-adapt-your-own-private-registry-authentication-create-lambda"></a>

建立 AWS Lambda 函數，將存取憑證傳遞至 SageMaker AI 並傳回回應。下列程式碼範例建立 Lambda 函式處理常式，如下所示。

```
def handler(event, context):
   response = {
      "Credentials": {"Username": "username", "Password": "password"}
   }
   return response
```

設定私有 Docker 登錄檔的驗證類型，會決定 Lambda 函式傳回的回應內容，如下所示。
+ 如果您的私有 Docker 登錄檔使用基本驗證，Lambda 函式會傳回所需的使用者名稱和密碼，以便向登錄檔進行驗證。
+ 如果您的私有 Docker 登錄檔使用[持有者權杖驗證](https://docs.docker.com/registry/spec/auth/token/)，則使用者名稱和密碼將發送至您的授權服務器，然後回傳持有者權杖。然後，此權杖將用於驗證您的私有 Docker 登錄檔。

**注意**  
如果同一帳戶中多個登錄檔有多個 Lambda 函式，而且訓練工作的執行角色相同，那麼針對登錄檔之一的訓練工作，將可以存取其他登錄檔的 Lambda 函式。

## 授予 Lambda 函式正確的角色許可
<a name="docker-containers-adapt-your-own-private-registry-authentication-lambda-role"></a>

您在 `create_training_job` API 中使用的 [IAMrole](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) 必須具有呼叫 AWS Lambda 函數的許可。下列程式碼範例示範如何延伸 IAM 角色的許可政策以呼叫 `myLambdaFunction`。

```
{
    "Effect": "Allow",
    "Action": [
        "lambda:InvokeFunction"
    ],
    "Resource": [
        "arn:aws:lambda:*:*:function:*myLambdaFunction*"
    ]
}
```

如需如何編輯角色許可的詳細資訊，請參閱*身分和存取管理使用者指南*AWS 內的[修改角色許可政策 (主控台)](https://docs.aws.amazon.com/IAM/latest/UserGuide/roles-managingrole-editing-console.html#roles-modify_permissions-policy)。

**注意**  
已連接 **AmazonSageMakerFullAccess** 受管政策的 IAM 角色具有許可，可以呼叫名稱中含有 “SageMaker AI” 的所有 Lambda 函式。

## 為 Lambda 建立介面 VPC 端點
<a name="docker-containers-adapt-your-own-private-registry-authentication-lambda-endpoint"></a>

如果您建立一個介面端點，Amazon VPC 就可以與 Lambda 函式通訊，而不必透過網際網路傳送流量。如需詳細資訊，請參閱 *AWS Lambda 開發人員指南*中的[設定 Lambda 的介面 VPC 端點](https://docs.aws.amazon.com/lambda/latest/dg/configuration-vpc-endpoints.html)。

建立介面端點之後，SageMaker 訓練會透過您的 VPC 將請求傳送至 `lambda.region.amazonaws.com` 來呼叫您的 Lambda 函式。如果您在建立介面端點時選取**啟用 DNS 名稱**，[Amazon Route 53](https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/Welcome.html) 會將呼叫路由至 Lambda 介面端點。如果您使用不同的 DNS 提供者，則必須將 `lambda.region.amazonaws.co`m 對應至 Lambda 介面端點。

# 為 Amazon SageMaker AI 調整您的自有推論容器
<a name="adapt-inference-container"></a>

如果您無法將 [預先建置的 SageMaker AI Docker 映像](docker-containers-prebuilt.md) Amazon SageMaker AI 列出的任何映像用於使用案例，您可以建置自己的 Docker 容器，並在 SageMaker AI 內使用該容器進行訓練和推論。若要相容於 SageMaker AI，您的容器必須具有下列特性：
+ 您的容器必須在連接埠 `8080` 上列出 Web 伺服器。
+ 您的容器必須接受對 `/invocations` 和 `/ping` 即時端點的 `POST` 請求。您傳送至這些端點的請求必須 60 秒傳回一般回應，8 分鐘傳回串流回應，大小上限為 25 MB。

如需如何使用 SageMaker AI 建置自己的 Docker 容器以進行訓練和推論的更多資訊和範例，請參閱[建立您自有的演算法容器](https://github.com/aws/amazon-sagemaker-examples/blob/main/advanced_functionality/scikit_bring_your_own/scikit_bring_your_own.ipynb)。

下列指南說明如何搭配使用 Amazon SageMaker Studio Classic 和 `JupyterLab` 空間來調整推論容器，以使用 SageMaker AI 託管。此範例使用 NGINX Web 伺服器、Gunicorn 作為 Python Web 伺服器閘道介面，以及 Flask 作為 Web 應用程式架構。您可以使用不同的應用程式來調整容器，只要容器符合先前列出的要求即可。如需進一步了解如何使用您的自有推論程式碼，請參閱 [具託管服務的自訂推論程式碼](your-algorithms-inference-code.md)。

**調整您的推論容器**

請透過下列步驟調整您的自有推論容器，以使用 SageMaker AI 託管。下列步驟中的範例使用預先訓練的[具名實體辨識 (NER) 模型](https://spacy.io/universe/project/video-spacys-ner-model-alt)，該模型使用適用於 `Python` 和下列項目的 [spaCy](https://spacy.io/) 自然語言處理 (NLP) 程式庫：
+ Dockerfile 用於建置包含 NER 模型的容器。
+ 提供 NER 模型的推論指令碼。

如果您針對使用案例調整此範例，則您必須使用部署和提供模型所需的 Dockerfile 和推論指令碼。

1. (選用) 使用 Amazon SageMaker Studio Classic 建立 JupyterLab 空間。

   您可以使用任何筆記本來執行指令碼，以使用 SageMaker AI 託管來調整推論容器。此範例說明如何使用 Amazon SageMaker Studio Classic 中的 JupyterLab 空間，來啟動隨附於 SageMaker AI Distribution 映像的 JupyterLab 應用程式。如需詳細資訊，請參閱[SageMaker JupyterLab](studio-updated-jl.md)。

1. 上傳 Docker 檔案和推論指令碼。

   1. 在主目錄中建立新的資料夾。如果您使用的是 JupyterLab，請在左上角選擇**新增資料夾**圖示，然後輸入資料夾名稱以包含您的 Dockerfile。在此範例中，資料夾稱為 `docker_test_folder`。

   1. 將 Dockerfile 文字檔案上傳至您的新資料夾。以下範例 Dockerfile 從 [spaCy](https://spacy.io/) 建立具預先訓練[具名實體辨識 (NER) 模型](https://spacy.io/universe/project/video-spacys-ner-model)的 Docker 容器，這是執行範例所需的應用程式和環境變數：

      ```
      FROM python:3.8
      
      RUN apt-get -y update && apt-get install -y --no-install-recommends \
               wget \
               python3 \
               nginx \
               ca-certificates \
          && rm -rf /var/lib/apt/lists/*
      
      RUN wget https://bootstrap.pypa.io/get-pip.py && python3 get-pip.py && \
          pip install flask gevent gunicorn && \
              rm -rf /root/.cache
      
      #pre-trained model package installation
      RUN pip install spacy
      RUN python -m spacy download en
      
      
      # Set environment variables
      ENV PYTHONUNBUFFERED=TRUE
      ENV PYTHONDONTWRITEBYTECODE=TRUE
      ENV PATH="/opt/program:${PATH}"
      
      COPY NER /opt/program
      WORKDIR /opt/program
      ```

      在先前的程式碼範例中，環境變數 `PYTHONUNBUFFERED` 會防止 Python 緩衝標準輸出串流，以便更快速地將日誌交付給使用者。環境變數 `PYTHONDONTWRITEBYTECODE` 可防止 Python 寫入編譯的位元碼 `.pyc` 檔案，這種檔案對此使用案例來說並無必要。環境變數 `PATH` 用於識別 `train` 和 `serve` 程式在調用容器時的位置。

   1. 在新資料夾中建立新目錄，以包含提供模型的指令碼。此範例使用名為 `NER` 的目錄，其中包含執行此範例所需的下列指令碼：
      + `predictor.py` – Python 指令碼，其中包含使用模型載入和執行推論的邏輯。
      + `nginx.conf` – 用來設定 Web 伺服器的指令碼。
      + `serve` – 啟動推論伺服器的指令碼。
      + `wsgi.py` – 提供模型的協助程式指令碼。
**重要**  
如果您將推論指令碼複製到結尾為 `.ipynb` 的筆記本並重新命名，則指令碼可能包含防止端點部署的格式字元。反之，請建立文字檔案並將其重新命名。

   1. 上傳指令碼，讓您的模型可用於推論。以下是名為 `predictor.py` 的指令碼範例，其使用 Flask 提供 `/ping` 和 `/invocations` 端點：

      ```
      from flask import Flask
      import flask
      import spacy
      import os
      import json
      import logging
      
      #Load in model
      nlp = spacy.load('en_core_web_sm') 
      #If you plan to use a your own model artifacts, 
      #your model artifacts should be stored in /opt/ml/model/ 
      
      
      # The flask app for serving predictions
      app = Flask(__name__)
      @app.route('/ping', methods=['GET'])
      def ping():
          # Check if the classifier was loaded correctly
          health = nlp is not None
          status = 200 if health else 404
          return flask.Response(response= '\n', status=status, mimetype='application/json')
      
      
      @app.route('/invocations', methods=['POST'])
      def transformation():
          
          #Process input
          input_json = flask.request.get_json()
          resp = input_json['input']
          
          #NER
          doc = nlp(resp)
          entities = [(X.text, X.label_) for X in doc.ents]
      
          # Transform predictions to JSON
          result = {
              'output': entities
              }
      
          resultjson = json.dumps(result)
          return flask.Response(response=resultjson, status=200, mimetype='application/json')
      ```

      如果正確載入模型，上一個指令碼範例中的 `/ping` 端點會傳回 `200` 的狀態程式碼；如果錯誤載入模型，則 `404`。`/invocations` 端點會處理格式為 JSON 的請求、擷取輸入欄位，並使用 NER 模型來識別和儲存變數實體中的實體。Flask 應用程式會傳回包含這些實體的回應。有關這些必要運作狀態要求的詳細資訊，請參閱 [容器對運作狀態檢查 (Ping) 請求應有的回應方式](your-algorithms-inference-code.md#your-algorithms-inference-algo-ping-requests)。

   1. 上傳指令碼以啟動推論伺服器。下列指令碼範例呼叫 `serve`，其使用 Gunicorn 作為應用程式伺服器，Nginx 作為 Web 伺服器：

      ```
      #!/usr/bin/env python
      
      # This file implements the scoring service shell. You don't necessarily need to modify it for various
      # algorithms. It starts nginx and gunicorn with the correct configurations and then simply waits until
      # gunicorn exits.
      #
      # The flask server is specified to be the app object in wsgi.py
      #
      # We set the following parameters:
      #
      # Parameter                Environment Variable              Default Value
      # ---------                --------------------              -------------
      # number of workers        MODEL_SERVER_WORKERS              the number of CPU cores
      # timeout                  MODEL_SERVER_TIMEOUT              60 seconds
      
      import multiprocessing
      import os
      import signal
      import subprocess
      import sys
      
      cpu_count = multiprocessing.cpu_count()
      
      model_server_timeout = os.environ.get('MODEL_SERVER_TIMEOUT', 60)
      model_server_workers = int(os.environ.get('MODEL_SERVER_WORKERS', cpu_count))
      
      def sigterm_handler(nginx_pid, gunicorn_pid):
          try:
              os.kill(nginx_pid, signal.SIGQUIT)
          except OSError:
              pass
          try:
              os.kill(gunicorn_pid, signal.SIGTERM)
          except OSError:
              pass
      
          sys.exit(0)
      
      def start_server():
          print('Starting the inference server with {} workers.'.format(model_server_workers))
      
      
          # link the log streams to stdout/err so they will be logged to the container logs
          subprocess.check_call(['ln', '-sf', '/dev/stdout', '/var/log/nginx/access.log'])
          subprocess.check_call(['ln', '-sf', '/dev/stderr', '/var/log/nginx/error.log'])
      
          nginx = subprocess.Popen(['nginx', '-c', '/opt/program/nginx.conf'])
          gunicorn = subprocess.Popen(['gunicorn',
                                       '--timeout', str(model_server_timeout),
                                       '-k', 'sync',
                                       '-b', 'unix:/tmp/gunicorn.sock',
                                       '-w', str(model_server_workers),
                                       'wsgi:app'])
      
          signal.signal(signal.SIGTERM, lambda a, b: sigterm_handler(nginx.pid, gunicorn.pid))
      
          # Exit the inference server upon exit of either subprocess
          pids = set([nginx.pid, gunicorn.pid])
          while True:
              pid, _ = os.wait()
              if pid in pids:
                  break
      
          sigterm_handler(nginx.pid, gunicorn.pid)
          print('Inference server exiting')
      
      # The main routine to invoke the start function.
      
      if __name__ == '__main__':
          start_server()
      ```

      先前的指令碼範例會定義訊號處理常式函式 `sigterm_handler`，該函式會在接收 `SIGTERM` 訊號時關閉 Nginx 和 Gunicorn 子程序。`start_server` 函式會啟動訊號處理常式、啟動和監控 Nginx 和 Gunicorn 子程序，以及擷取日誌串流。

   1. 上傳指令碼以設定您的 Web 伺服器。下列指令碼範例稱為 `nginx.conf`，其使用 Gunicorn 作為應用程式伺服器來設定 Nginx Web 伺服器，將您的模型用於推論：

      ```
      worker_processes 1;
      daemon off; # Prevent forking
      
      
      pid /tmp/nginx.pid;
      error_log /var/log/nginx/error.log;
      
      events {
        # defaults
      }
      
      http {
        include /etc/nginx/mime.types;
        default_type application/octet-stream;
        access_log /var/log/nginx/access.log combined;
        
        upstream gunicorn {
          server unix:/tmp/gunicorn.sock;
        }
      
        server {
          listen 8080 deferred;
          client_max_body_size 5m;
      
          keepalive_timeout 5;
          proxy_read_timeout 1200s;
      
          location ~ ^/(ping|invocations) {
            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
            proxy_set_header Host $http_host;
            proxy_redirect off;
            proxy_pass http://gunicorn;
          }
      
          location / {
            return 404 "{}";
          }
        }
      }
      ```

      先前的指令碼範例會將 Nginx 設為在前景中執行、設定擷取 `error_log` 的位置，並將 `upstream` 定義為 Gunicorn 伺服器的通訊端。伺服器會將伺服器區塊設為接聽連接埠 `8080`，並設定用戶端請求內文大小和逾時值的限制。伺服器區塊會將包含 `/ping` 或 `/invocations` 路徑的請求轉送至 Gunicorn `server http://gunicorn`，並傳回其他路徑的 `404` 錯誤。

   1. 上傳提供模型所需的任何其他指令碼。此範例需要下列名為 `wsgi.py` 的指令碼範例，以協助 Gunicorn 尋找您的應用程式：

      ```
      import predictor as myapp
      
      # This is just a simple wrapper for gunicorn to find your app.
      # If you want to change the algorithm file, simply change "predictor" above to the
      # new file.
      
      app = myapp.app
      ```

   從資料夾 `docker_test_folder` 中，您的目錄結構應包含 Dockerfile 和資料夾 NER。NER 資料夾應包含檔案 `nginx.conf`、`predictor.py`、`serve` 和 `wsgi.py`，如下所示：

    ![\[The Dockerfile structure has inference scripts under the NER directory next to the Dockerfile.\]](http://docs.aws.amazon.com/zh_tw/sagemaker/latest/dg/images/docker-file-struct-adapt-ex.png) 

1. 建置您自有的容器。

   從資料夾 `docker_test_folder` 中建置您的 Docker 容器。下列命令範例將建置在您的 Dockerfile 中設定的 Docker 容器：

   ```
   ! docker build -t byo-container-test .
   ```

   先前的命令會在目前的工作目錄中建置名為 `byo-container-test` 的容器。如需 Docker 建置參數的詳細資訊，請參閱[建置引數](https://docs.docker.com/build/guide/build-args/)。
**注意**  
如果您收到以下錯誤訊息，表示 Docker 找不到 Dockerfile，請確認 Dockerfile 的名稱正確，且已存入目錄。  

   ```
   unable to prepare context: unable to evaluate symlinks in Dockerfile path:
   lstat /home/ec2-user/SageMaker/docker_test_folder/Dockerfile: no such file or directory
   ```
Docker 會在當前目錄中查找名稱為 Dockerfile 且不含任何副檔名的檔案。如果您將其命名為其他名稱，則可以使用 -f 標記手動輸入文件名稱。例如，如果您將 Dockerfile 命名為 Dockerfile-text.txt，請使用後面接有檔案的 `-f` 標記來建置您的 Docker 容器，如下所示：  

   ```
   ! docker build -t byo-container-test -f Dockerfile-text.txt .
   ```

1. 推送 Docker 映像至 Amazon Elastic Container Registry (Amazon ECR)

   在筆記本儲存格中，將 Docker 映像推送至 ECR。下列程式碼範例示範如何在本機建置容器、登入並將其推送至 ECR：

   ```
   %%sh
   # Name of algo -> ECR
   algorithm_name=sm-pretrained-spacy
   
   #make serve executable
   chmod +x NER/serve
   account=$(aws sts get-caller-identity --query Account --output text)
   # Region, defaults to us-west-2
   region=$(aws configure get region)
   region=${region:-us-east-1}
   fullname="${account}.dkr.ecr.${region}.amazonaws.com/${algorithm_name}:latest"
   # If the repository doesn't exist in ECR, create it.
   aws ecr describe-repositories --repository-names "${algorithm_name}" > /dev/null 2>&1
   if [ $? -ne 0 ]
   then
       aws ecr create-repository --repository-name "${algorithm_name}" > /dev/nullfi
   # Get the login command from ECR and execute it directly
   aws ecr get-login-password --region ${region}|docker login --username AWS --password-stdin ${fullname}
   # Build the docker image locally with the image name and then push it to ECR
   # with the full name.
   
   docker build  -t ${algorithm_name} .
   docker tag ${algorithm_name} ${fullname}
   
   docker push ${fullname}
   ```

   上一個範例示範如何執行以下必要步驟，將範例 Docker 容器推送至 ECR：

   1. 將演算法名稱定義為 `sm-pretrained-spacy`。

   1. 確保可執行 NER 資料夾內的 `serve` 檔案。

   1. 設定 AWS 區域。

   1. 如果尚無 ECR，請建立 ECR。

   1. 登入 ECR。

   1. 在本機建置 Docker 容器。

   1. 將 Docker 映像推送至 ECR

1. 設定 SageMaker AI 用戶端

   如果您想要使用 SageMaker AI 託管服務進行推論，則必須[建立模型](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker/client/create_model.html)、建立[端點組態](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker/client/create_endpoint_config.html#)和[建立端點](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker/client/create_endpoint.html#)。若要從端點取得推論，您可以使用 SageMaker AI boto3 執行期用戶端來調用端點。下列程式碼說明如何使用 [SageMaker AI boto3 用戶端](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html)設定 SageMaker AI 用戶端和 SageMaker 執行期用戶端：

   ```
   import boto3
   from sagemaker import get_execution_role
   
   sm_client = boto3.client(service_name='sagemaker')
   runtime_sm_client = boto3.client(service_name='sagemaker-runtime')
   
   account_id = boto3.client('sts').get_caller_identity()['Account']
   region = boto3.Session().region_name
   
   #used to store model artifacts which SageMaker AI will extract to /opt/ml/model in the container, 
   #in this example case we will not be making use of S3 to store the model artifacts
   #s3_bucket = '<S3Bucket>'
   
   role = get_execution_role()
   ```

   先前的程式碼範例不使用 Amazon S3 儲存貯體，而是將其插入為註解，以示範如何儲存模型成品。

   如果您在執行先前的程式碼範例後收到許可錯誤，您可能需要將許可新增至 IAM 角色。如需關於 IAM 角色的詳細資訊，請參閱[Amazon SageMaker 角色管理器](role-manager.md)。如需了解如何將許可新增至目前角色，請參閱 [AWS Amazon SageMaker AI 的 受管政策](security-iam-awsmanpol.md)。

1. 建立您的模型。

   如果您要使用 SageMaker AI 託管服務進行推論，您必須在 SageMaker AI 中建立模型。下列程式碼範例示範如何在 SageMaker AI 內建立 spaCy NER 模型：

   ```
   from time import gmtime, strftime
   
   model_name = 'spacy-nermodel-' + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
   # MODEL S3 URL containing model atrifacts as either model.tar.gz or extracted artifacts. 
   # Here we are not  
   #model_url = 's3://{}/spacy/'.format(s3_bucket) 
   
   container = '{}.dkr.ecr.{}.amazonaws.com/sm-pretrained-spacy:latest'.format(account_id, region)
   instance_type = 'ml.c5d.18xlarge'
   
   print('Model name: ' + model_name)
   #print('Model data Url: ' + model_url)
   print('Container image: ' + container)
   
   container = {
   'Image': container
   }
   
   create_model_response = sm_client.create_model(
       ModelName = model_name,
       ExecutionRoleArn = role,
       Containers = [container])
   
   print("Model Arn: " + create_model_response['ModelArn'])
   ```

   先前的程式碼範例顯示，如果您要使用步驟 5 中註解的 Amazon S3 儲存貯體，則該如何使用 `s3_bucket` 定義 `model_url`，以及定義容器映像的 ECR URI。先前的程式碼範例將 `ml.c5d.18xlarge` 定義為執行個體類型。您也可以選擇不同的執行個體類型。如需可用執行個體類型的詳細資訊，請參閱 [Amazon EC2 執行個體類型](https://aws.amazon.com/ec2/instance-types/)。

   在先前的程式碼範例中，`Image` 金鑰指向容器映像 URI。`create_model_response` 定義使用 `create_model method` 來建立模型，並傳回模型名稱、角色和包含容器資訊的清單。

   上一個指令碼的輸出範例如下：

   ```
   Model name: spacy-nermodel-YYYY-MM-DD-HH-MM-SS
   Model data Url: s3://spacy-sagemaker-us-east-1-bucket/spacy/
   Container image: 123456789012.dkr.ecr.us-east-2.amazonaws.com/sm-pretrained-spacy:latest
   Model Arn: arn:aws:sagemaker:us-east-2:123456789012:model/spacy-nermodel-YYYY-MM-DD-HH-MM-SS
   ```

1. 

   1. 

**設定及建立端點**

      若要使用 SageMaker AI 託管進行推論，您還必須設定和建立端點。SageMaker AI 將使用此端點進行推論。下列組態範例示範如何使用您先前定義的執行個體類型和模型名稱來產生和設定端點：

      ```
      endpoint_config_name = 'spacy-ner-config' + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
      print('Endpoint config name: ' + endpoint_config_name)
      
      create_endpoint_config_response = sm_client.create_endpoint_config(
          EndpointConfigName = endpoint_config_name,
          ProductionVariants=[{
              'InstanceType': instance_type,
              'InitialInstanceCount': 1,
              'InitialVariantWeight': 1,
              'ModelName': model_name,
              'VariantName': 'AllTraffic'}])
              
      print("Endpoint config Arn: " + create_endpoint_config_response['EndpointConfigArn'])
      ```

      在先前的組態範例中，`create_endpoint_config_response` 會將 `model_name` 與使用時間戳記建立的唯一端點組態名稱 `endpoint_config_name` 建立關聯。

      上一個指令碼的輸出範例如下：

      ```
      Endpoint config name: spacy-ner-configYYYY-MM-DD-HH-MM-SS
      Endpoint config Arn: arn:aws:sagemaker:us-east-2:123456789012:endpoint-config/spacy-ner-config-MM-DD-HH-MM-SS
      ```

      如需有關端點錯誤的詳細資訊，請參閱[當我建立或更新端點時，為什麼 Amazon SageMaker AI 端點進入失敗狀態？](https://repost.aws/knowledge-center/sagemaker-endpoint-creation-fail)

   1. 

**建立端點並等待端點處於服務狀態。**

       下列程式碼範例使用先前組態範例中的組態來建立端點和部署模型：

      ```
      %%time
      
      import time
      
      endpoint_name = 'spacy-ner-endpoint' + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
      print('Endpoint name: ' + endpoint_name)
      
      create_endpoint_response = sm_client.create_endpoint(
          EndpointName=endpoint_name,
          EndpointConfigName=endpoint_config_name)
      print('Endpoint Arn: ' + create_endpoint_response['EndpointArn'])
      
      resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
      status = resp['EndpointStatus']
      print("Endpoint Status: " + status)
      
      print('Waiting for {} endpoint to be in service...'.format(endpoint_name))
      waiter = sm_client.get_waiter('endpoint_in_service')
      waiter.wait(EndpointName=endpoint_name)
      ```

      在先前的程式碼範例中，`create_endpoint` 方法會使用先前程式碼範例中建立的產生端點名稱來建立端點，並列印端點的 Amazon Resource Name。`describe_endpoint` 方法會傳回端點及其狀態的相關資訊。SageMaker AI 等待程式會等待端點處於服務狀態。

1. 測試您的端點。

   端點處於服務狀態後，請將[調用請求](https://boto3.amazonaws.com/v1/documentation/api/1.9.42/reference/services/sagemaker-runtime.html#SageMakerRuntime.Client.invoke_endpoint)傳送至您的端點。下列程式碼範例示範如何將測試請求傳送至您的端點：

   ```
   import json
   content_type = "application/json"
   request_body = {"input": "This is a test with NER in America with \
       Amazon and Microsoft in Seattle, writing random stuff."}
   
   #Serialize data for endpoint
   #data = json.loads(json.dumps(request_body))
   payload = json.dumps(request_body)
   
   #Endpoint invocation
   response = runtime_sm_client.invoke_endpoint(
   EndpointName=endpoint_name,
   ContentType=content_type,
   Body=payload)
   
   #Parse results
   result = json.loads(response['Body'].read().decode())['output']
   result
   ```

   在先前的程式碼範例中，`json.dumps` 方法會序列化 `request_body` 為 JSON 格式的字串，並將其儲存在變數承載。然後，SageMaker AI 執行期用戶端會使用[調用端點](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker-runtime/client/invoke_endpoint.html)方法，將承載傳送至您的端點。結果會包含擷取輸出欄位後來自端點的回應。

   先前的程式碼範例應傳回下列輸出：

   ```
   [['NER', 'ORG'],
    ['America', 'GPE'],
    ['Amazon', 'ORG'],
    ['Microsoft', 'ORG'],
    ['Seattle', 'GPE']]
   ```

1. 請刪除您的端點

   完成調用後，請刪除端點以節省資源。以下程式碼範例說明如何刪除端點：

   ```
   sm_client.delete_endpoint(EndpointName=endpoint_name)
   sm_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
   sm_client.delete_model(ModelName=model_name)
   ```

   如需包含此範例中程式碼的完整筆記本，請參閱 [BYOC-Single-Model](https://github.com/aws-samples/sagemaker-hosting/tree/main/Bring-Your-Own-Container/BYOC-Single-Model)。