本文属于机器翻译版本。若本译文内容与英语原文存在差异，则一律以英文原文为准。

# 自动缩放异步端点
<a name="async-inference-autoscale"></a>

Amazon SageMaker AI 支持自动扩缩（Autoscaling）异步端点。自动扩缩动态调整为模型预置的实例数，以响应工作负载的变化。与 Amazon SageMaker AI 支持的其他托管模型不同，通过异步推理，您还可以将异步端点实例缩减到零。在实例数为零个时收到的请求将排队等待，直到端点纵向扩展后再处理这些请求。

要自动缩放异步端点，您至少必须：
+ 注册已部署的模型（生产变体）。
+ 定义扩展策略。
+ 应用自动缩放策略。

您必须先将模型部署到 SageMaker AI 端点，然后才能使用自动扩缩。部署的模型称为[生产变体](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ProductionVariant.html)。有关将模型部署到端点的更多信息，请参阅[将模型部署到 SageMaker 托管服务](https://docs.aws.amazon.com/sagemaker/latest/dg/ex1-model-deployment.html#ex1-deploy-model)。要为扩展策略指定指标和目标值，请配置扩展策略。有关如何定义扩展策略的信息，请参阅[定义扩展策略](https://docs.aws.amazon.com/sagemaker/latest/dg/endpoint-auto-scaling-add-code-define.html)。注册模型并定义扩展策略后，将扩展策略应用于已注册的模型。有关如何应用扩展策略的信息，请参阅[应用扩展策略](https://docs.aws.amazon.com/sagemaker/latest/dg/endpoint-auto-scaling-add-code-apply.html)。

如需详细了解如何定义可选的额外扩展策略，以便在端点缩减为零后收到请求时扩展端点，请参阅[可选：定义为新请求从零开始纵向扩展的扩展策略](#async-inference-autoscale-scale-up)。如果您未指定此可选策略，则只有在积压请求数超过目标跟踪值后，端点才会从零开始纵向扩展。

 有关用于自动扩缩的其他先决条件和组件的详细信息，请参阅 SageMaker AI 自动扩缩文档中的[先决条件](https://docs.aws.amazon.com/sagemaker/latest/dg/endpoint-auto-scaling-prerequisites.html)部分。

**注意**  
如果您将多个扩展策略附加到同一个 AutoScaling 组，则可能会出现扩展冲突。发生冲突时，Amazon EC2 Auto Scaling 会选择为横向扩展和横向缩减均预置了最大容量的策略。有关此行为的更多信息，请参阅 *Amazon EC2 Auto Scaling 文档*中的[多个动态扩展策略](https://docs.aws.amazon.com/autoscaling/ec2/userguide/as-scale-based-on-demand.html#multiple-scaling-policy-resolution)。

## 定义扩展策略
<a name="async-inference-autoscale-define-async"></a>

要为扩展策略指定指标和目标值，请配置目标跟踪扩展策略。将扩展策略定义为文本文件中的 JSON 块。在调用 AWS CLI 或 Application Auto Scaling API 时，您可以使用该文本文件。有关策略配置语法的更多信息，请参阅《应用程序自动扩缩 API 参考》中的 [https://docs.aws.amazon.com/autoscaling/application/APIReference/API_TargetTrackingScalingPolicyConfiguration.html](https://docs.aws.amazon.com/autoscaling/application/APIReference/API_TargetTrackingScalingPolicyConfiguration.html)。

对于异步端点，SageMaker AI 强烈建议您为变体的目标跟踪扩展创建策略配置。在此配置示例中，我们使用自定义指标 `CustomizedMetricSpecification`，称为 `ApproximateBacklogSizePerInstance`。

```
TargetTrackingScalingPolicyConfiguration={
        'TargetValue': 5.0, # The target value for the metric. Here the metric is: ApproximateBacklogSizePerInstance
        'CustomizedMetricSpecification': {
            'MetricName': 'ApproximateBacklogSizePerInstance',
            'Namespace': 'AWS/SageMaker',
            'Dimensions': [
                {'Name': 'EndpointName', 'Value': <endpoint_name> }
            ],
            'Statistic': 'Average',
        }
    }
```

## 定义可缩减为 0 的扩展策略
<a name="async-inference-autoscale-define-async-zero"></a>

以下内容演示如何使用 适用于 Python (Boto3) 的 AWS SDK 来为您的端点变体定义和注册应用程序自动缩放。在使用 Boto3 定义表示应用程序自动缩放的低级客户端对象之后，我们使用 [https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/application-autoscaling.html#ApplicationAutoScaling.Client.register_scalable_target](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/application-autoscaling.html#ApplicationAutoScaling.Client.register_scalable_target) 方法注册生产变体。之所以将 `MinCapacity` 设置为 0，是因为通过异步推理，您可以在没有要处理的请求时自动缩放到 0。

```
# Common class representing application autoscaling for SageMaker 
client = boto3.client('application-autoscaling') 

# This is the format in which application autoscaling references the endpoint
resource_id='endpoint/' + <endpoint_name> + '/variant/' + <'variant1'> 

# Define and register your endpoint variant
response = client.register_scalable_target(
    ServiceNamespace='sagemaker', 
    ResourceId=resource_id,
    ScalableDimension='sagemaker:variant:DesiredInstanceCount', # The number of EC2 instances for your Amazon SageMaker model endpoint variant.
    MinCapacity=0,
    MaxCapacity=5
)
```

有关应用程序自动缩放 API 的详细描述，请参阅[应用程序扩展 Boto3](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/application-autoscaling.html#ApplicationAutoScaling.Client.register_scalable_target) 文档。

## 可选：定义为新请求从零开始纵向扩展的扩展策略
<a name="async-inference-autoscale-scale-up"></a>

您可能会遇到只有零星请求或有一段时间请求数量很少的使用场景。如果您的端点在这段时间内已缩减为零个实例，则直到队列中的请求数超过扩展策略中指定的目标时，您的端点才会再次纵向扩展。这可能会导致队列中请求的等待时间过长。以下部分介绍如何创建额外的扩展策略，这样在队列中收到任何新请求时，就会将端点从零个实例开始纵向扩展。您的端点将能够更快地响应新请求，而不必等待队列大小超过指定的目标。

要为端点创建从零个实例开始纵向扩展的扩展策略，请执行以下操作：

1. 创建扩展策略来定义所需行为，即在实例为零但队列中有请求时纵向扩展端点。以下内容向您展示如何使用 适用于 Python (Boto3) 的 AWS SDK 定义名为 `HasBacklogWithoutCapacity-ScalingPolicy` 的扩展策略。当队列大于零且您端点的当前实例数也为零时，该策略会纵向扩展端点。在所有其他情况下，该策略不会影响端点的扩展。

   ```
   response = client.put_scaling_policy(
       PolicyName="HasBacklogWithoutCapacity-ScalingPolicy",
       ServiceNamespace="sagemaker",  # The namespace of the service that provides the resource.
       ResourceId=resource_id,  # Endpoint name
       ScalableDimension="sagemaker:variant:DesiredInstanceCount",  # SageMaker supports only Instance Count
       PolicyType="StepScaling",  # 'StepScaling' or 'TargetTrackingScaling'
       StepScalingPolicyConfiguration={
           "AdjustmentType": "ChangeInCapacity", # Specifies whether the ScalingAdjustment value in the StepAdjustment property is an absolute number or a percentage of the current capacity. 
           "MetricAggregationType": "Average", # The aggregation type for the CloudWatch metrics.
           "Cooldown": 300, # The amount of time, in seconds, to wait for a previous scaling activity to take effect. 
           "StepAdjustments": # A set of adjustments that enable you to scale based on the size of the alarm breach.
           [ 
               {
                 "MetricIntervalLowerBound": 0,
                 "ScalingAdjustment": 1
               }
             ]
       },    
   )
   ```

1. 使用自定义指标 `HasBacklogWithoutCapacity` 创建 CloudWatch 警报。触发警报后，警报将启动先前定义的扩展策略。有关 `HasBacklogWithoutCapacity` 指标的更多信息，请参阅[异步推理端点指标](async-inference-monitor.md#async-inference-monitor-cloudwatch-async)。

   ```
   response = cw_client.put_metric_alarm(
       AlarmName=step_scaling_policy_alarm_name,
       MetricName='HasBacklogWithoutCapacity',
       Namespace='AWS/SageMaker',
       Statistic='Average',
       EvaluationPeriods= 2,
       DatapointsToAlarm= 2,
       Threshold= 1,
       ComparisonOperator='GreaterThanOrEqualToThreshold',
       TreatMissingData='missing',
       Dimensions=[
           { 'Name':'EndpointName', 'Value':endpoint_name },
       ],
       Period= 60,
       AlarmActions=[step_scaling_policy_arn]
   )
   ```

现在，只要队列中有待处理请求时，您的扩展策略和 CloudWatch 警报应该可以从零个实例开始纵向扩展端点。