

# Asynchronous inference


Amazon SageMaker Asynchronous Inference is a capability in SageMaker AI that queues incoming requests and processes them asynchronously. This option is ideal for requests with large payload sizes (up to 1GB), long processing times (up to one hour), and near real-time latency requirements. Asynchronous Inference enables you to save on costs by autoscaling the instance count to zero when there are no requests to process, so you only pay when your endpoint is processing requests.

## How It Works


Creating an asynchronous inference endpoint is similar to creating real-time inference endpoints. You can use your existing SageMaker AI models and only need to specify the `AsyncInferenceConfig` object while creating your endpoint configuration with the `EndpointConfig` field in the `CreateEndpointConfig` API. The following diagram shows the architecture and workflow of Asynchronous Inference.

![\[Architecture diagram of Asynchronous Inference showing how a user invokes an endpoint.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/async-architecture.png)


To invoke the endpoint, you need to place the request payload in Amazon S3. You also need to provide a pointer to this payload as a part of the `InvokeEndpointAsync` request. Upon invocation, SageMaker AI queues the request for processing and returns an identifier and output location as a response. Upon processing, SageMaker AI places the result in the Amazon S3 location. You can optionally choose to receive success or error notifications with Amazon SNS. For more information about how to set up asynchronous notifications, see [Check prediction results](async-inference-check-predictions.md).

**Note**  
The presence of an asynchronous inference configuration (`AsyncInferenceConfig`) object in the endpoint configuration implies that the endpoint can only receive asynchronous invocations.

## How Do I Get Started?


If you are a first-time user of Amazon SageMaker Asynchronous Inference, we recommend that you do the following:
+ Read [Asynchronous endpoint operations](async-inference-create-invoke-update-delete.md) for information on how to create, invoke, update, and delete an asynchronous endpoint.
+ Explore the [Asynchronous Inference example notebook](https://github.com/aws/amazon-sagemaker-examples/blob/main/async-inference/Async-Inference-Walkthrough.ipynb) in the [aws/amazon-sagemaker-examples](https://github.com/aws/amazon-sagemaker-examples) GitHub repository.

Note that if your endpoint uses any of the features listed in this [Exclusions](deployment-guardrails-exclusions.md) page, you cannot use Asynchronous Inference.

# Asynchronous endpoint operations


This guide demonstrates the prerequisites you must satisfy to create an asynchronous endpoint, along with how to create, invoke, and delete your asynchronous endpoints. You can create, update, delete, and invoke asynchronous endpoints with the AWS SDKs and the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/overview.html#sagemaker-asynchronous-inference).

**Topics**
+ [

# Complete the prerequisites
](async-inference-create-endpoint-prerequisites.md)
+ [

# How to create an Asynchronous Inference Endpoint
](async-inference-create-endpoint.md)
+ [

# Invoke an Asynchronous Endpoint
](async-inference-invoke-endpoint.md)
+ [

# Update an Asynchronous Endpoint
](async-inference-update-endpoint.md)
+ [

# Delete an Asynchronous Endpoint
](async-inference-delete-endpoint.md)

# Complete the prerequisites


The following topic describes the prerequisites that you must complete before creating an asyncrhonous endpoint. These prerequisites include properly storing your model artifacts, configuring an AWS IAM with the correct permissions, and selecting a container image.

**To complete the prerequisites**

1. **Create an IAM role for Amazon SageMaker AI.**

   Asynchronous Inference needs access to your Amazon S3 bucket URI. To facilitate this, create an IAM role that can run SageMaker AI and has permission to access Amazon S3 and Amazon SNS. Using this role, SageMaker AI can run under your account and access your Amazon S3 bucket and Amazon SNS topics.

   You can create an IAM role by using the IAM console, AWS SDK for Python (Boto3), or AWS CLI. The following is an example of how to create an IAM role and attach the necessary policies with the IAM console.

   1. Sign in to the AWS Management Console and open the IAM console at [https://console.aws.amazon.com/iam/](https://console.aws.amazon.com/iam/).

   1. In the navigation pane of the IAM console, choose **Roles**, and then choose **Create role**.

   1. For **Select type of trusted entity**, choose **AWS service**.

   1. Choose the service that you want to allow to assume this role. In this case, choose **SageMaker AI**. Then choose **Next: Permissions**.
      + This automatically creates an IAM policy that grants access to related services such as Amazon S3, Amazon ECR, and CloudWatch Logs.

   1. Choose **Next: Tags**.

   1. (Optional) Add metadata to the role by attaching tags as key–value pairs. For more information about using tags in IAM, see [Tagging IAM resources](https://docs.aws.amazon.com//IAM/latest/UserGuide/id_tags.html).

   1. Choose **Next: Review**.

   1. Type in a **Role name**. 

   1. If possible, type a role name or role name suffix. Role names must be unique within your AWS account. They are not distinguished by case. For example, you cannot create roles named both `PRODROLE` and `prodrole`. Because other AWS resources might reference the role, you cannot edit the name of the role after it has been created.

   1. (Optional) For **Role description**, type a description for the new role.

   1. Review the role and then choose **Create role**.

      Note the SageMaker AI role ARN. To find the role ARN using the console, do the following:

      1. Go to the IAM console: [https://console.aws.amazon.com/iam/](https://console.aws.amazon.com/iam/)

      1. Select **Roles**.

      1. Search for the role you just created by typing in the name of the role in the search field.

      1. Select the role.

      1. The role ARN is at the top of the **Summary** page.

1. **Add Amazon SageMaker AI, Amazon S3 and Amazon SNS Permissions to your IAM Role.**

   Once the role is created, grant SageMaker AI, Amazon S3, and optionally Amazon SNS permissions to your IAM role.

   Choose **Roles** in the IAM console. Search for the role you created by typing in your role name in the **Search** field.

   1. Choose your role.

   1. Next, choose **Attach Policies**.

   1. Amazon SageMaker Asynchronous Inference needs permission to perform the following actions: `"sagemaker:CreateModel"`, `"sagemaker:CreateEndpointConfig"`, `"sagemaker:CreateEndpoint"`, and `"sagemaker:InvokeEndpointAsync"`. 

      These actions are included in the `AmazonSageMakerFullAccess` policy. Add this policy to your IAM role. Search for `AmazonSageMakerFullAccess` in the **Search** field. Select `AmazonSageMakerFullAccess`.

   1. Choose **Attach policy**.

   1. Next, choose **Attach Policies** to add Amazon S3 permissions.

   1. Select **Create policy**.

   1. Select the `JSON` tab.

   1. Add the following policy statement:

------
#### [ JSON ]

****  

      ```
      {
          "Version":"2012-10-17",		 	 	 
          "Statement": [
              {
                  "Action": [
                      "s3:GetObject",
                      "s3:PutObject",
                      "s3:AbortMultipartUpload",
                      "s3:ListBucket"  
                  ],
                  "Effect": "Allow",
                  "Resource": "arn:aws:s3:::bucket_name/*"
              }
          ]
      }
      ```

------

   1. Choose **Next: Tags**.

   1. Type in a **Policy name**.

   1. Choose **Create policy**.

   1. Repeat the same steps you completed to add Amazon S3 permissions in order to add Amazon SNS permissions. For the policy statement, attach the following:

------
#### [ JSON ]

****  

      ```
      {
          "Version":"2012-10-17",		 	 	 
          "Statement": [
              {
                  "Action": [
                      "sns:Publish"
                  ],
                  "Effect": "Allow",
      "Resource": "arn:aws:sns:us-east-1:111122223333:SNS_Topic"
              }
          ]
      }
      ```

------

1. **Upload your inference data (e.g., machine learning model, sample data) to **Amazon S3**.**

1. **Select a prebuilt Docker inference image or create your own Inference Docker Image.**

   SageMaker AI provides containers for its built-in algorithms and prebuilt Docker images for some of the most common machine learning frameworks, such as Apache MXNet, TensorFlow, PyTorch, and Chainer. For a full list of the available SageMaker AI images, see [Available Deep Learning Containers Images](https://github.com/aws/deep-learning-containers/blob/master/available_images.md). If you choose to use a SageMaker AI provided container, you can increase the endpoint timeout and payload sizes from the default by setting the environment variables in the container. To learn how to set the different environment variables for each framework, see the Create a Model step of creating an asynchronous endpoint.

   If none of the existing SageMaker AI containers meet your needs and you don't have an existing container of your own, you may need to create a new Docker container. See [Containers with custom inference code](your-algorithms-inference-main.md) for information on how to create your Docker image.

1. **Create an Amazon SNS topic (optional)**

   Create an Amazon Simple Notification Service (Amazon SNS) topic that sends notifications about requests that have completed processing. Amazon SNS is a notification service for messaging-oriented applications, with multiple subscribers requesting and receiving "push" notifications of time-critical messages via a choice of transport protocols, including HTTP, Amazon SQS, and email. You can specify Amazon SNS topics when you create an `EndpointConfig` object when you specify `AsyncInferenceConfig` using the `EndpointConfig` API. 

   Follow the steps to create and subscribe to an Amazon SNS topic.

   1. Using Amazon SNS console, create a topic. For instructions, see [Creating an Amazon SNS topic](https://docs.aws.amazon.com/sns/latest/dg/CreateTopic.html) in the *Amazon Simple Notification Service* *Developer Guide*.

   1. Subscribe to the topic. For instructions, see [Subscribing to an Amazon SNS topic](https://docs.aws.amazon.com/sns/latest/dg/sns-create-subscribe-endpoint-to-topic.html) in the *Amazon Simple Notification Service** Developer Guide*.

   1. When you receive email requesting that you confirm your subscription to the topic, confirm the subscription.

   1. Note the topic Amazon Resource Name (ARN). The Amazon SNS topic you created is another resource in your AWS account, and it has a unique ARN. The ARN is in the following format:

      ```
      arn:aws:sns:aws-region:account-id:topic-name
      ```

   For more information about Amazon SNS, see the [Amazon SNS Developer Guide](https://docs.aws.amazon.com/sns/latest/dg/welcome.html).

# How to create an Asynchronous Inference Endpoint
Create

Create an asynchronous endpoint the same way you would create an endpoint using SageMaker AI hosting services:
+ Create a model in SageMaker AI with `CreateModel`.
+ Create an endpoint configuration with `CreateEndpointConfig`.
+ Create an HTTPS endpoint with `CreateEndpoint`.

To create an endpoint, you first create a model with [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateModel.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateModel.html), where you point to the model artifact and a Docker registry path (Image). You then create a configuration using [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateEndpointConfig.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateEndpointConfig.html) where you specify one or more models that were created using the `CreateModel` API to deploy and the resources that you want SageMaker AI to provision. Create your endpoint with [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateEndpoint.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateEndpoint.html) using the endpoint configuration specified in the request. You can update an asynchronous endpoint with the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateEndpoint.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateEndpoint.html) API. Send and receive inference requests from the model hosted at the endpoint with `InvokeEndpointAsync`. You can delete your endpoints with the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DeleteEndpoint.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DeleteEndpoint.html) API.

For a full list of the available SageMaker Images, see [Available Deep Learning Containers Images](https://github.com/aws/deep-learning-containers/blob/master/available_images.md). See [Containers with custom inference code](your-algorithms-inference-main.md) for information on how to create your Docker image.

**Topics**
+ [

# Create a Model
](async-inference-create-endpoint-create-model.md)
+ [

# Create an Endpoint Configuration
](async-inference-create-endpoint-create-endpoint-config.md)
+ [

# Create Endpoint
](async-inference-create-endpoint-create-endpoint.md)

# Create a Model


The following example shows how to create a model using the AWS SDK for Python (Boto3). The first few lines define:
+ `sagemaker_client`: A low-level SageMaker AI client object that makes it easy to send and receive requests to AWS services.
+ `sagemaker_role`: A string variable with the SageMaker AI IAM role Amazon Resource Name (ARN).
+ `aws_region`: A string variable with the name of your AWS region.

```
import boto3

# Specify your AWS Region
aws_region='<aws_region>'

# Create a low-level SageMaker service client.
sagemaker_client = boto3.client('sagemaker', region_name=aws_region)

# Role to give SageMaker permission to access AWS services.
sagemaker_role= "arn:aws:iam::<account>:role/*"
```

Next, specify the location of the pre-trained model stored in Amazon S3. In this example, we use a pre-trained XGBoost model named `demo-xgboost-model.tar.gz`. The full Amazon S3 URI is stored in a string variable `model_url`:

```
#Create a variable w/ the model S3 URI
s3_bucket = '<your-bucket-name>' # Provide the name of your S3 bucket
bucket_prefix='saved_models'
model_s3_key = f"{bucket_prefix}/demo-xgboost-model.tar.gz"

#Specify S3 bucket w/ model
model_url = f"s3://{s3_bucket}/{model_s3_key}"
```

Specify a primary container. For the primary container, you specify the Docker image that contains inference code, artifacts (from prior training), and a custom environment map that the inference code uses when you deploy the model for predictions.

 In this example, we specify an XGBoost built-in algorithm container image: 

```
from sagemaker import image_uris

# Specify an AWS container image. 
container = image_uris.retrieve(region=aws_region, framework='xgboost', version='0.90-1')
```

Create a model in Amazon SageMaker AI with `CreateModel`. Specify the following:
+ `ModelName`: A name for your model (in this example it is stored as a string variable called `model_name`).
+ `ExecutionRoleArn`: The Amazon Resource Name (ARN) of the IAM role that Amazon SageMaker AI can assume to access model artifacts and Docker images for deployment on ML compute instances or for batch transform jobs.
+ `PrimaryContainer`: The location of the primary Docker image containing inference code, associated artifacts, and custom environment maps that the inference code uses when the model is deployed for predictions.

```
model_name = '<The_name_of_the_model>'

#Create model
create_model_response = sagemaker_client.create_model(
    ModelName = model_name,
    ExecutionRoleArn = sagemaker_role,
    PrimaryContainer = {
        'Image': container,
        'ModelDataUrl': model_url,
    })
```

See [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateModel.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateModel.html) description in the SageMaker API Reference Guide for a full list of API parameters.

If you're using a SageMaker AI provided container, you can increase the model server timeout and payload sizes from the default values to the framework‐supported maximums by setting environment variables in this step. You might not be able to leverage the maximum timeout and payload sizes that Asynchronous Inference supports if you don't explicitly set these variables. The following example shows how you can set the environment variables for a PyTorch Inference container based on TorchServe.

```
model_name = '<The_name_of_the_model>'

#Create model
create_model_response = sagemaker_client.create_model(
    ModelName = model_name,
    ExecutionRoleArn = sagemaker_role,
    PrimaryContainer = {
        'Image': container,
        'ModelDataUrl': model_url,
        'Environment': {
            'TS_MAX_REQUEST_SIZE': '100000000',
            'TS_MAX_RESPONSE_SIZE': '100000000',
            'TS_DEFAULT_RESPONSE_TIMEOUT': '1000'
        },
    })
```

After you finish creating your endpoint, you should test that you've set the environment variables correctly by printing them out from your `inference.py` script. The following table lists the environment variables for several frameworks that you can set to change the default values.


| Framework | Environment variables | 
| --- | --- | 
|  PyTorch 1.8 (based on TorchServe)  |  'TS\$1MAX\$1REQUEST\$1SIZE': '100000000' 'TS\$1MAX\$1RESPONSE\$1SIZE': '100000000' 'TS\$1DEFAULT\$1RESPONSE\$1TIMEOUT': '1000'  | 
|  PyTorch 1.4 (based on MMS)  |  'MMS\$1MAX\$1REQUEST\$1SIZE': '1000000000' 'MMS\$1MAX\$1RESPONSE\$1SIZE': '1000000000' 'MMS\$1DEFAULT\$1RESPONSE\$1TIMEOUT': '900'  | 
|  HuggingFace Inference Container (based on MMS)  |  'MMS\$1MAX\$1REQUEST\$1SIZE': '2000000000' 'MMS\$1MAX\$1RESPONSE\$1SIZE': '2000000000' 'MMS\$1DEFAULT\$1RESPONSE\$1TIMEOUT': '900'  | 

# Create an Endpoint Configuration


Once you have a model, create an endpoint configuration with [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateEndpointConfig.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateEndpointConfig.html). Amazon SageMaker AI hosting services uses this configuration to deploy models. In the configuration, you identify one or more models, created using with [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateModel.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateModel.html), to deploy the resources that you want Amazon SageMaker AI to provision. Specify the `AsyncInferenceConfig` object and provide an output Amazon S3 location for `OutputConfig`. You can optionally specify [Amazon SNS](https://docs.aws.amazon.com/sns/latest/dg/welcome.html) topics on which to send notifications about prediction results. For more information about Amazon SNS topics, see [Configuring Amazon SNS](https://docs.aws.amazon.com/sns/latest/dg/sns-configuring.html).

The following example shows how to create an endpoint configuration using AWS SDK for Python (Boto3):

```
import datetime
from time import gmtime, strftime

# Create an endpoint config name. Here we create one based on the date  
# so it we can search endpoints based on creation time.
endpoint_config_name = f"XGBoostEndpointConfig-{strftime('%Y-%m-%d-%H-%M-%S', gmtime())}"

# The name of the model that you want to host. This is the name that you specified when creating the model.
model_name='<The_name_of_your_model>'

create_endpoint_config_response = sagemaker_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name, # You will specify this name in a CreateEndpoint request.
    # List of ProductionVariant objects, one for each model that you want to host at this endpoint.
    ProductionVariants=[
        {
            "VariantName": "variant1", # The name of the production variant.
            "ModelName": model_name, 
            "InstanceType": "ml.m5.xlarge", # Specify the compute instance type.
            "InitialInstanceCount": 1 # Number of instances to launch initially.
        }
    ],
    AsyncInferenceConfig={
        "OutputConfig": {
            # Location to upload response outputs when no location is provided in the request.
            "S3OutputPath": f"s3://{s3_bucket}/{bucket_prefix}/output"
            # (Optional) specify Amazon SNS topics
            "NotificationConfig": {
                "SuccessTopic": "arn:aws:sns:aws-region:account-id:topic-name",
                "ErrorTopic": "arn:aws:sns:aws-region:account-id:topic-name",
            }
        },
        "ClientConfig": {
            # (Optional) Specify the max number of inflight invocations per instance
            # If no value is provided, Amazon SageMaker will choose an optimal value for you
            "MaxConcurrentInvocationsPerInstance": 4
        }
    }
)

print(f"Created EndpointConfig: {create_endpoint_config_response['EndpointConfigArn']}")
```

In the aforementioned example, you specify the following keys for `OutputConfig` for the `AsyncInferenceConfig` field:
+ `S3OutputPath`: Location to upload response outputs when no location is provided in the request.
+ `NotificationConfig`: (Optional) SNS topics that post notifications to you when an inference request is successful (`SuccessTopic`) or if it fails (`ErrorTopic`).

You can also specify the following optional argument for `ClientConfig` in the `AsyncInferenceConfig` field:
+ `MaxConcurrentInvocationsPerInstance`: (Optional) The maximum number of concurrent requests sent by the SageMaker AI client to the model container.

# Create Endpoint


Once you have your model and endpoint configuration, use the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateEndpoint.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateEndpoint.html) API to create your endpoint. The endpoint name must be unique within an AWS Region in your AWS account. 

The following creates an endpoint using the endpoint configuration specified in the request. Amazon SageMaker AI uses the endpoint to provision resources and deploy models.

```
# The name of the endpoint.The name must be unique within an AWS Region in your AWS account.
endpoint_name = '<endpoint-name>' 

# The name of the endpoint configuration associated with this endpoint.
endpoint_config_name='<endpoint-config-name>'

create_endpoint_response = sagemaker_client.create_endpoint(
                                            EndpointName=endpoint_name, 
                                            EndpointConfigName=endpoint_config_name)
```

When you call the `CreateEndpoint` API, Amazon SageMaker Asynchronous Inference sends a test notification to check that you have configured an Amazon SNS topic. Amazon SageMaker Asynchronous Inference also sends test notifications after calls to `UpdateEndpoint` and `UpdateEndpointWeightsAndCapacities`. This lets SageMaker AI check that you have the required permissions. The notification can simply be ignored. The test notification has the following form:

```
{
    "eventVersion":"1.0",
    "eventSource":"aws:sagemaker",
    "eventName":"TestNotification"
}
```

# Invoke an Asynchronous Endpoint
Invoke

Get inferences from the model hosted at your asynchronous endpoint with `InvokeEndpointAsync`. 

**Note**  
If you have not done so already, upload your inference data (e.g., machine learning model, sample data) to Amazon S3.

Specify the following fields in your request:
+ For `InputLocation`, specify the location of your inference data.
+ For `EndpointName`, specify the name of your endpoint.
+ (Optional) For `InvocationTimeoutSeconds`, you can set the max timeout for the requests. You can set this value to a maximum of 3600 seconds (one hour) on a per-request basis. If you don't specify this field in your request, by default the request times out at 15 minutes.

```
# Create a low-level client representing Amazon SageMaker Runtime
sagemaker_runtime = boto3.client("sagemaker-runtime", region_name=<aws_region>)

# Specify the location of the input. Here, a single SVM sample
input_location = "s3://bucket-name/test_point_0.libsvm"

# The name of the endpoint. The name must be unique within an AWS Region in your AWS account. 
endpoint_name='<endpoint-name>'

# After you deploy a model into production using SageMaker AI hosting 
# services, your client applications use this API to get inferences 
# from the model hosted at the specified endpoint.
response = sagemaker_runtime.invoke_endpoint_async(
                            EndpointName=endpoint_name, 
                            InputLocation=input_location,
                            InvocationTimeoutSeconds=3600)
```

You receive a response as a JSON string with your request ID and the name of the Amazon S3 bucket that will have the response to the API call after it is processed.

# Update an Asynchronous Endpoint
Update

Update an asynchronous endpoint with the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateEndpoint.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateEndpoint.html) API. When you update an endpoint, SageMaker AI first provisions and switches to the new endpoint configuration you specify before it deletes the resources that were provisioned in the previous endpoint configuration. Do not delete an `EndpointConfig` with an endpoint that is live or while the `UpdateEndpoint` or `CreateEndpoint` operations are being performed on the endpoint. 

```
# The name of the endpoint. The name must be unique within an AWS Region in your AWS account.
endpoint_name='<endpoint-name>'

# The name of the endpoint configuration associated with this endpoint.
endpoint_config_name='<endpoint-config-name>'

sagemaker_client.update_endpoint(
                                EndpointConfigName=endpoint_config_name,
                                EndpointName=endpoint_name
                                )
```

When Amazon SageMaker AI receives the request, it sets the endpoint status to **Updating**. After updating the asynchronous endpoint, it sets the status to **InService**. To check the status of an endpoint, use the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeEndpoint.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeEndpoint.html) API. For a full list of parameters you can specify when updating an endpoint, see the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateEndpoint.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateEndpoint.html) API.

# Delete an Asynchronous Endpoint
Delete

Delete an asynchronous endpoint in a similar manner to how you would delete a SageMaker AI hosted endpoint with the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DeleteEndpoint.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DeleteEndpoint.html) API. Specify the name of the asynchronous endpoint you want to delete. When you delete an endpoint, SageMaker AI frees up all of the resources that were deployed when the endpoint was created. Deleting a model does not delete model artifacts, inference code, or the IAM role that you specified when creating the model.

Delete your SageMaker AI model with the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DeleteModel.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DeleteModel.html) API or with the SageMaker AI console.

------
#### [ Boto3 ]

```
import boto3 

# Create a low-level SageMaker service client.
sagemaker_client = boto3.client('sagemaker', region_name=<aws_region>)
sagemaker_client.delete_endpoint(EndpointName='<endpoint-name>')
```

------
#### [ SageMaker AI console ]

1. Navigate to the SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. Expand the **Inference** dropdown list.

1. Select **Endpoints**.

1. Search for endpoint in the **Search endpoints** search bar.

1. Select your endpoint.

1. Choose **Delete**.

------

In addition to deleting the asynchronous endpoint, you might want to clear up other resources that were used to create the endpoint, such as the Amazon ECR repository (if you created a custom inference image), the SageMaker AI model, and the asynchronous endpoint configuration itself. 

# Alarms and logs for tracking metrics from asynchronous endpoints
Alarms and logs

You can monitor SageMaker AI using Amazon CloudWatch, which collects raw data and processes it into readable, near real-time metrics. With Amazon CloudWatch, you can access historical information and gain a better perspective on how your web application or service is performing. For more information about Amazon CloudWatch, see [What is Amazon CloudWatch?](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/WhatIsCloudWatch.html)

## Monitoring with CloudWatch


The metrics below are an exhaustive list of metrics for asynchronous endpoints and are in the the `AWS/SageMaker` namespace. Any metric not listed below is not published if the endpoint is enabled for asynchronous inference. Such metrics include (but are not limited to):
+ OverheadLatency
+ Invocations
+ InvocationsPerInstance

### Common Endpoint Metrics


These metrics are the same as the metrics published for real-time endpoints today. For more information about other metrics in Amazon CloudWatch, see [Monitor SageMaker AI with Amazon CloudWatch](https://docs.aws.amazon.com/sagemaker/latest/dg/monitoring-cloudwatch.html).


| Metric Name | Description | Unit/Stats | 
| --- | --- | --- | 
| `Invocation4XXErrors` | The number of requests where the model returned a 4xx HTTP response code. For each 4xx response, 1 is sent; otherwise, 0 is sent. | Units: NoneValid statistics: Average, Sum | 
| `Invocation5XXErrors` | The number of InvokeEndpoint requests where the model returned a 5xx HTTP response code. For each 5xx response, 1 is sent; otherwise, 0 is sent. | Units: NoneValid statistics: Average, Sum | 
| `ModelLatency` | The interval of time taken by a model to respond as viewed from SageMaker AI. This interval includes the local communication times taken to send the request and to fetch the response from the container of a model and the time taken to complete the inference in the container. | Units: Microseconds Valid statistics: Average, Sum, Min, Max, Sample Count | 

### Asynchronous Inference Endpoint Metrics


These metrics are published for endpoints enabled for asynchronous inference. The following metrics are published with the `EndpointName` dimension:


| Metric Name | Description | Unit/Stats | 
| --- | --- | --- | 
| `ApproximateBacklogSize` | The number of items in the queue for an endpoint that are currently being processed or yet to be processed. | Units: Count Valid statistics: Average, Max, Min  | 
| `ApproximateBacklogSizePerInstance` | Number of items in the queue divided by the number of instances behind an endpoint. This metric is primarily used for setting up application autoscaling for an async-enabled endpoint. | Units: CountValid statistics: Average, Max, Min | 
| `ApproximateAgeOfOldestRequest` | Age of the oldest request in the queue. | Units: SecondsValid statistics: Average, Max, Min | 
| `HasBacklogWithoutCapacity` | The value of this metric is `1` when there are requests in the queue but zero instances behind the endpoint. The value is `0` at all other times. You can use this metric for autoscaling your endpoint up from zero instances upon receiving a new request in the queue. | Units: CountValid statistics: Average | 

The following metrics are published with the `EndpointName` and `VariantName` dimensions:


| Metric Name | Description | Unit/Stats | 
| --- | --- | --- | 
| `RequestDownloadFailures` | When an inference failure occurs due to an issue downloading the request from Amazon S3. | Units: CountValid statistics: Sum | 
| `ResponseUploadFailures` | When an inference failure occurs due to an issue uploading the response to Amazon S3. | Units: CountValid statistics: Sum | 
| `NotificationFailures` | When an issue occurs publishing notifications. | Units: CountValid statistics: Sum | 
| `RequestDownloadLatency` | Total time to download the request payload. | Units: MicrosecondsValid statistics: Average, Sum, Min, Max, Sample Count | 
| `ResponseUploadLatency` | Total time to upload the response payload. | Units: Microseconds Valid statistics: Average, Sum, Min, Max, Sample Count | 
| `ExpiredRequests` | Number of requests in the queue that fail due to reaching their specified request TTL. | Units: CountValid statistics: Sum | 
| `InvocationFailures` | If an invocation fails for any reason. | Units: CountValid statistics: Sum | 
| `InvocationsProcesssed` | Number of async invocations processed by the endpoint. | Units: CountValid statistics: Sum | 
| `TimeInBacklog` | Total time the request was queued before being processed. This does not include the actual processing time (i.e. downloading time, uploading time, model latency). | Units: MillisecondsValid statistics: Average, Sum, Min, Max, Sample Count | 
| `TotalProcessingTime` | Time the inference request was recieved by SageMaker AI to the time the request finished processing. This includes time in backlog and time to upload and send response notifications, if any. | Units: MillisecondsValid statistics: Average, Sum, Min, Max, Sample Count | 

Amazon SageMaker Asynchronous Inference also includes host-level metrics. For information on host-level metrics, see [SageMaker AI Jobs and Endpoint Metrics](https://docs.aws.amazon.com/sagemaker/latest/dg/monitoring-cloudwatch.html#cloudwatch-metrics-jobs).

## Logs


In addition to the [Model container logs](https://docs.aws.amazon.com/sagemaker/latest/dg/logging-cloudwatch.html) that are published to Amazon CloudWatch in your account, you also get a new platform log for tracing and debugging inference requests.

The new logs are published under the Endpoint Log Group:

```
/aws/sagemaker/Endpoints/[EndpointName]
```

The log stream name consists of: 

```
[production-variant-name]/[instance-id]/data-log.
```

Log lines contain the request’s inference ID so that errors can be easily mapped to a particular request.

# Check prediction results


There are several ways you can check predictions results from your asynchronous endpoint. Some options are:

1. Amazon SNS topics.

1. Check for outputs in your Amazon S3 bucket.

## Amazon SNS Topics


Amazon SNS is a notification service for messaging-oriented applications, with multiple subscribers requesting and receiving "push" notifications of time-critical messages via a choice of transport protocols, including HTTP, Amazon SQS, and email. Amazon SageMaker Asynchronous Inference posts notifications when you create an endpoint with [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateEndpointConfig.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateEndpointConfig.html) and specify an Amazon SNS topic.

**Note**  
In order to receive Amazon SNS notifications, your IAM role must have `sns:Publish` permissions. See the [Complete the prerequisites](async-inference-create-endpoint-prerequisites.md) for information on requirements you must satisfy to use Asynchronous Inference.

To use Amazon SNS to check prediction results from your asynchronous endpoint, you first need to create a topic, subscribe to the topic, confirm your subscription to the topic, and note the Amazon Resource Name (ARN) of that topic. For detailed information on how to create, subscribe, and find the Amazon ARN of an Amazon SNS topic, see [Configuring Amazon SNS](https://docs.aws.amazon.com/sns/latest/dg/sns-configuring.html).

Provide the Amazon SNS topic ARN(s) in the `AsyncInferenceConfig` field when you create an endpoint configuration with `CreateEndpointConfig`. You can specify both an Amazon SNS `ErrorTopic` and an `SuccessTopic`.

```
import boto3

sagemaker_client = boto3.client('sagemaker', region_name=<aws_region>)

sagemaker_client.create_endpoint_config(
    EndpointConfigName=<endpoint_config_name>, # You specify this name in a CreateEndpoint request.
    # List of ProductionVariant objects, one for each model that you want to host at this endpoint.
    ProductionVariants=[
        {
            "VariantName": "variant1", # The name of the production variant.
            "ModelName": "model_name", 
            "InstanceType": "ml.m5.xlarge", # Specify the compute instance type.
            "InitialInstanceCount": 1 # Number of instances to launch initially.
        }
    ],
    AsyncInferenceConfig={
        "OutputConfig": {
            # Location to upload response outputs when no location is provided in the request.
            "S3OutputPath": "s3://<bucket>/<output_directory>"
            "NotificationConfig": {
                "SuccessTopic": "arn:aws:sns:aws-region:account-id:topic-name",
                "ErrorTopic": "arn:aws:sns:aws-region:account-id:topic-name",
            }
        }
    }
)
```

After creating your endpoint and invoking it, you receive a notification from your Amazon SNS topic. For example, if you subscribed to receive email notifications from your topic, you receive an email notification every time you invoke your endpoint. The following example shows the JSON content of a successful invocation email notification.

```
{
   "awsRegion":"us-east-1",
   "eventTime":"2022-01-25T22:46:00.608Z",
   "receivedTime":"2022-01-25T22:46:00.455Z",
   "invocationStatus":"Completed",
   "requestParameters":{
      "contentType":"text/csv",
      "endpointName":"<example-endpoint>",
      "inputLocation":"s3://<bucket>/<input-directory>/input-data.csv"
   },
   "responseParameters":{
      "contentType":"text/csv; charset=utf-8",
      "outputLocation":"s3://<bucket>/<output_directory>/prediction.out"
   },
   "inferenceId":"11111111-2222-3333-4444-555555555555", 
   "eventVersion":"1.0",
   "eventSource":"aws:sagemaker",
   "eventName":"InferenceResult"
}
```

## Check Your S3 Bucket


When you invoke an endpoint with `InvokeEndpointAsync`, it returns a response object. You can use the response object to get the Amazon S3 URI where your output is stored. With the output location, you can use a SageMaker Python SDK SageMaker AI session class to programmatically check for on an output.

The following stores the output dictionary of `InvokeEndpointAsync` as a variable named response. With the response variable, you then get the Amazon S3 output URI and store it as a string variable called `output_location`. 

```
import uuid
import boto3

sagemaker_runtime = boto3.client("sagemaker-runtime", region_name=<aws_region>)

# Specify the S3 URI of the input. Here, a single SVM sample
input_location = "s3://bucket-name/test_point_0.libsvm" 

response = sagemaker_runtime.invoke_endpoint_async(
    EndpointName='<endpoint-name>',
    InputLocation=input_location,
    InferenceId=str(uuid.uuid4()), 
    ContentType="text/libsvm" #Specify the content type of your data
)

output_location = response['OutputLocation']
print(f"OutputLocation: {output_location}")
```

For information about supported content types, see [Common data formats for inference](cdf-inference.md).

With the Amazon S3 output location, you can then use a [SageMaker Python SDK SageMaker AI Session Class](https://sagemaker.readthedocs.io/en/stable/api/utility/session.html?highlight=session) to read in Amazon S3 files. The following code example shows how to create a function (`get_ouput`) that repeatedly attempts to read a file from the Amazon S3 output location:

```
import sagemaker
import urllib, time
from botocore.exceptions import ClientError

sagemaker_session = sagemaker.session.Session()

def get_output(output_location):
    output_url = urllib.parse.urlparse(output_location)
    bucket = output_url.netloc
    key = output_url.path[1:]
    while True:
        try:
            return sagemaker_session.read_s3_file(
                                        bucket=output_url.netloc, 
                                        key_prefix=output_url.path[1:])
        except ClientError as e:
            if e.response['Error']['Code'] == 'NoSuchKey':
                print("waiting for output...")
                time.sleep(2)
                continue
            raise
            
output = get_output(output_location)
print(f"Output: {output}")
```

# Autoscale an asynchronous endpoint


Amazon SageMaker AI supports automatic scaling (autoscaling) your asynchronous endpoint. Autoscaling dynamically adjusts the number of instances provisioned for a model in response to changes in your workload. Unlike other hosted models Amazon SageMaker AI supports, with Asynchronous Inference you can also scale down your asynchronous endpoints instances to zero. Requests that are received when there are zero instances are queued for processing once the endpoint scales up.

To autoscale your asynchronous endpoint you must at a minimum:
+ Register a deployed model (production variant).
+ Define a scaling policy.
+ Apply the autoscaling policy.

Before you can use autoscaling, you must have already deployed a model to a SageMaker AI endpoint. Deployed models are referred to as a [production variant](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ProductionVariant.html). See [Deploy the Model to SageMaker Hosting Services](https://docs.aws.amazon.com/sagemaker/latest/dg/ex1-model-deployment.html#ex1-deploy-model) for more information about deploying a model to an endpoint. To specify the metrics and target values for a scaling policy, you configure a scaling policy. For information on how to define a scaling policy, see [Define a scaling policy](https://docs.aws.amazon.com/sagemaker/latest/dg/endpoint-auto-scaling-add-code-define.html). After registering your model and defining a scaling policy, apply the scaling policy to the registered model. For information on how to apply the scaling policy, see [Apply a scaling policy](https://docs.aws.amazon.com/sagemaker/latest/dg/endpoint-auto-scaling-add-code-apply.html).

For more information on how to define an optional additional scaling policy that scales up your endpoint upon receiving a request after your endpoint has been scaled down to zero, see [Optional: Define a scaling policy that scales up from zero for new requests](#async-inference-autoscale-scale-up). If you don’t specify this optional policy, then your endpoint only initiates scaling up from zero after the number of backlog requests exceeds the target tracking value.

 For details on other prerequisites and components used with autoscaling, see the [Prerequisites](https://docs.aws.amazon.com/sagemaker/latest/dg/endpoint-auto-scaling-prerequisites.html) section in the SageMaker AI autoscaling documentation.

**Note**  
If you attach multiple scaling policies to the same autoscaling group, you might have scaling conflicts. When a conflict occurs, Amazon EC2 Auto Scaling chooses the policy that provisions the largest capacity for both scale out and scale in. For more information about this behavior, see [Multiple dynamic scaling policies](https://docs.aws.amazon.com/autoscaling/ec2/userguide/as-scale-based-on-demand.html#multiple-scaling-policy-resolution) in the *Amazon EC2 Auto Scaling documentation*.

## Define a scaling policy


To specify the metrics and target values for a scaling policy, you configure a target-tracking scaling policy. Define the scaling policy as a JSON block in a text file. You use that text file when invoking the AWS CLI or the Application Auto Scaling API. For more information about policy configuration syntax, see [https://docs.aws.amazon.com/autoscaling/application/APIReference/API_TargetTrackingScalingPolicyConfiguration.html](https://docs.aws.amazon.com/autoscaling/application/APIReference/API_TargetTrackingScalingPolicyConfiguration.html) in the Application Auto Scaling API Reference.

For asynchronous endpoints SageMaker AI strongly recommends that you create a policy configuration for target-tracking scaling for a variant. In this configuration example, we use a custom metric, `CustomizedMetricSpecification`, called `ApproximateBacklogSizePerInstance`.

```
TargetTrackingScalingPolicyConfiguration={
        'TargetValue': 5.0, # The target value for the metric. Here the metric is: ApproximateBacklogSizePerInstance
        'CustomizedMetricSpecification': {
            'MetricName': 'ApproximateBacklogSizePerInstance',
            'Namespace': 'AWS/SageMaker',
            'Dimensions': [
                {'Name': 'EndpointName', 'Value': <endpoint_name> }
            ],
            'Statistic': 'Average',
        }
    }
```

## Define a scaling policy that scales to zero


The following shows you how to both define and register your endpoint variant with application autoscaling using the AWS SDK for Python (Boto3). After defining a low-level client object representing application autoscaling with Boto3, we use the [https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/application-autoscaling.html#ApplicationAutoScaling.Client.register_scalable_target](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/application-autoscaling.html#ApplicationAutoScaling.Client.register_scalable_target) method to register the production variant. We set `MinCapacity` to 0 because Asynchronous Inference enables you to autoscale to 0 when there are no requests to process.

```
# Common class representing application autoscaling for SageMaker 
client = boto3.client('application-autoscaling') 

# This is the format in which application autoscaling references the endpoint
resource_id='endpoint/' + <endpoint_name> + '/variant/' + <'variant1'> 

# Define and register your endpoint variant
response = client.register_scalable_target(
    ServiceNamespace='sagemaker', 
    ResourceId=resource_id,
    ScalableDimension='sagemaker:variant:DesiredInstanceCount', # The number of EC2 instances for your Amazon SageMaker model endpoint variant.
    MinCapacity=0,
    MaxCapacity=5
)
```

For detailed description about the Application Autoscaling API, see the [Application Scaling Boto3](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/application-autoscaling.html#ApplicationAutoScaling.Client.register_scalable_target) documentation.

## Optional: Define a scaling policy that scales up from zero for new requests


You might have a use case where you have sporadic requests or periods with low numbers of requests. If your endpoint has been scaled down to zero instances during these periods, then your endpoint won’t scale up again until the number of requests in the queue exceeds the target specified in your scaling policy. This can result in long waiting times for requests in the queue. The following section shows you how to create an additional scaling policy that scales your endpoint up from zero instances after receiving any new request in the queue. Your endpoint will be able to respond to new requests more quickly instead of waiting for the queue size to exceed the target.

To create a scaling policy for your endpoint that scales up from zero instances, do the following:

1. Create a scaling policy that defines the desired behavior, which is to scale up your endpoint when it’s at zero instances but has requests in the queue. The following shows you how to define a scaling policy called `HasBacklogWithoutCapacity-ScalingPolicy` using the AWS SDK for Python (Boto3). When the queue is greater than zero and the current instance count for your endpoint is also zero, the policy scales your endpoint up. In all other cases, the policy does not affect scaling for your endpoint.

   ```
   response = client.put_scaling_policy(
       PolicyName="HasBacklogWithoutCapacity-ScalingPolicy",
       ServiceNamespace="sagemaker",  # The namespace of the service that provides the resource.
       ResourceId=resource_id,  # Endpoint name
       ScalableDimension="sagemaker:variant:DesiredInstanceCount",  # SageMaker supports only Instance Count
       PolicyType="StepScaling",  # 'StepScaling' or 'TargetTrackingScaling'
       StepScalingPolicyConfiguration={
           "AdjustmentType": "ChangeInCapacity", # Specifies whether the ScalingAdjustment value in the StepAdjustment property is an absolute number or a percentage of the current capacity. 
           "MetricAggregationType": "Average", # The aggregation type for the CloudWatch metrics.
           "Cooldown": 300, # The amount of time, in seconds, to wait for a previous scaling activity to take effect. 
           "StepAdjustments": # A set of adjustments that enable you to scale based on the size of the alarm breach.
           [ 
               {
                 "MetricIntervalLowerBound": 0,
                 "ScalingAdjustment": 1
               }
             ]
       },    
   )
   ```

1. Create a CloudWatch alarm with the custom metric `HasBacklogWithoutCapacity`. When triggered, the alarm initiates the previously defined scaling policy. For more information about the `HasBacklogWithoutCapacity` metric, see [Asynchronous Inference Endpoint Metrics](async-inference-monitor.md#async-inference-monitor-cloudwatch-async).

   ```
   response = cw_client.put_metric_alarm(
       AlarmName=step_scaling_policy_alarm_name,
       MetricName='HasBacklogWithoutCapacity',
       Namespace='AWS/SageMaker',
       Statistic='Average',
       EvaluationPeriods= 2,
       DatapointsToAlarm= 2,
       Threshold= 1,
       ComparisonOperator='GreaterThanOrEqualToThreshold',
       TreatMissingData='missing',
       Dimensions=[
           { 'Name':'EndpointName', 'Value':endpoint_name },
       ],
       Period= 60,
       AlarmActions=[step_scaling_policy_arn]
   )
   ```

You should now have a scaling policy and CloudWatch alarm that scale up your endpoint from zero instances whenever your queue has pending requests.

# Troubleshooting


The following FAQs can help you troubleshoot issues with your Amazon SageMaker Asynchronous Inference endpoints.

## Q: I have autoscaling enabled. How can I find the instance count behind the endpoint at any given point?


You can use the following methods to find the instance count behind your endpoint:
+ You can use the SageMaker AI [DescribeEndpoint](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeEndpoint.html) API to describe the number of instances behind the endpoint at any given point in time.
+ You can get the instance count by viewing your Amazon CloudWatch metrics. View the [metrics for your endpoint instances](https://docs.aws.amazon.com/sagemaker/latest/dg/monitoring-cloudwatch.html#cloudwatch-metrics-jobs), such as `CPUUtilization` or `MemoryUtilization` and check the sample count statistic for a 1 minute period. The count should be equal to the number of active instances. The following screenshot shows the `CPUUtilization` metric graphed in the CloudWatch console, where the **Statistic** is set to `Sample count`, the **Period** is set to `1 minute`, and the resulting count is 5.

![\[CloudWatch console showing the graph of the count of active instances for an endpoint.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/cloudwatch-sample-count.png)


## Q: What are the common tunable environment variables for SageMaker AI containers?


The following tables outline the common tunable environment variables for SageMaker AI containers by framework type.

**TensorFlow**


| Environment variable | Description | 
| --- | --- | 
|  `SAGEMAKER_TFS_INSTANCE_COUNT`  |  For TensorFlow-based models, the `tensorflow_model_server` binary is the operational piece that is responsible for loading a model in memory, running inputs against a model graph, and deriving outputs. Typically, a single instance of this binary is launched to serve models in an endpoint. This binary is internally multi-threaded and spawns multiple threads to respond to an inference request. In certain instances, if you observe that the CPU is respectably utilized (over 30% utilized) but the memory is underutilized (less than 10% utilization), increasing this parameter might help. Increasing the number of `tensorflow_model_servers` available to serve typically increases the throughput of an endpoint.  | 
|  `SAGEMAKER_TFS_FRACTIONAL_GPU_MEM_MARGIN`  |  This parameter governs the fraction of the available GPU memory to initialize CUDA/cuDNN and other GPU libraries. `0.2` means 20% of the available GPU memory is reserved for initializing CUDA/cuDNN and other GPU libraries, and 80% of the available GPU memory is allocated equally across the TF processes. GPU memory is pre-allocated unless the `allow_growth` option is enabled.  | 
| `SAGEMAKER_TFS_INTER_OP_PARALLELISM` | This ties back to the `inter_op_parallelism_threads` variable. This variable determines the number of threads used by independent non-blocking operations. `0` means that the system picks an appropriate number. | 
| `SAGEMAKER_TFS_INTRA_OP_PARALLELISM` | This ties back to the `intra_op_parallelism_threads` variable. This determines the number of threads that can be used for certain operations like matrix multiplication and reductions for speedups. A value of `0` means that the system picks an appropriate number. | 
| `SAGEMAKER_GUNICORN_WORKERS` | This governs the number of worker processes that Gunicorn is requested to spawn for handling requests. This value is used in combination with other parameters to derive a set that maximizes inference throughput. In addition to this, the `SAGEMAKER_GUNICORN_WORKER_CLASS` governs the type of workers spawned, typically `async` or `gevent`. | 
| `SAGEMAKER_GUNICORN_WORKER_CLASS` | This governs the number of worker processes that Gunicorn is requested to spawn for handling requests. This value is used in combination with other parameters to derive a set that maximizes inference throughput. In addition to this, the `SAGEMAKER_GUNICORN_WORKER_CLASS` governs the type of workers spawned, typically `async` or `gevent`. | 
| `OMP_NUM_THREADS` | Python internally uses OpenMP for implementing multithreading within processes. Typically, threads equivalent to the number of CPU cores are spawned. But when implemented on top of Simultaneous Multi Threading (SMT), such Intel’s HypeThreading, a certain process might oversubscribe a particular core by spawning twice as many threads as the number of actual CPU cores. In certain cases, a Python binary might end up spawning up to four times as many threads as available processor cores. Therefore, an ideal setting for this parameter, if you have oversubscribed available cores using worker threads, is `1`, or half the number of CPU cores on a CPU with SMT turned on. | 
|  `TF_DISABLE_MKL` `TF_DISABLE_POOL_ALLOCATOR`  | In some cases, turning off MKL can speed up inference if `TF_DISABLE_MKL` and `TF_DISABLE_POOL_ALLOCATOR` are set to `1`. | 

**PyTorch**


| Environment variable | Description | 
| --- | --- | 
|  `SAGEMAKER_TS_MAX_BATCH_DELAY`  |  This is the maximum batch delay time TorchServe waits to receive.  | 
|  `SAGEMAKER_TS_BATCH_SIZE`  |  If TorchServe doesn’t receive the number of requests specified in `batch_size` before the timer runs out, it sends the requests that were received to the model handler.  | 
|  `SAGEMAKER_TS_MIN_WORKERS`  |  The minimum number of workers to which TorchServe is allowed to scale down.  | 
|  `SAGEMAKER_TS_MAX_WORKERS`  |  The maximum number of workers to which TorchServe is allowed to scale up.  | 
|  `SAGEMAKER_TS_RESPONSE_TIMEOUT`  |  The time delay, after which inference times out in absence of a response.  | 
|  `SAGEMAKER_TS_MAX_REQUEST_SIZE`  |  The maximum payload size for TorchServe.  | 
|  `SAGEMAKER_TS_MAX_RESPONSE_SIZE`  |  The maximum response size for TorchServe.  | 

**Multi Model Server (MMS)**


| Environment variable | Description | 
| --- | --- | 
|  `job_queue_size`  |  This parameter is useful to tune when you have a scenario where the type of the inference request payload is large, and due to the size of payload being larger, you may have higher heap memory consumption of the JVM in which this queue is being maintained. Ideally you might want to keep the heap memory requirements of JVM lower and allow Python workers to allot more memory for actual model serving. JVM is only for receiving the HTTP requests, queuing them, and dispatching them to the Python-based workers for inference. If you increase the `job_queue_size`, you might end up increasing the heap memory consumption of the JVM and ultimately taking away memory from the host that could have been used by Python workers. Therefore, exercise caution when tuning this parameter as well.  | 
|  `default_workers_per_model`  |  This parameter is for the backend model serving and might be valuable to tune since this is the critical component of the overall model serving, based on which the Python processes spawn threads for each Model. If this component is slower (or not tuned properly), the front-end tuning might not be effective.  | 

## Q: How do I make sure my container supports Asynchronous Inference?


You can use the same container for Asynchronous Inference that you do for Real-Time Inference or Batch Transform. You should confirm that the timeouts and payload size limits on your container are set to handle larger payloads and longer timeouts.

## Q: What are the limits specific to Asynchronous Inference, and can they be adjusted?


Refer to the following limits for Asynchronous Inference:
+ Payload size limit: 1 GB
+ Timeout limit: A request can take up to 60 minutes.
+ Queue message TimeToLive (TTL): 6 hours
+ Number of messages that can be put inside Amazon SQS: Unlimited. However, there is a quota of 120,000 for the number of in-flight messages for a standard queue, and 20,000 for a FIFO queue.

## Q: What metrics are best to define for autoscaling on Asynchronous Inference? Can I have multiple scaling policies?


In general, with Asynchronous Inference, you can scale out based on invocations or instances. For invocation metrics, it's a good idea to look at your `ApproximateBacklogSize`, which is a metric that defines the number of items in your queue that have yet to been processed. You can utilize this metric or your `InvocationsPerInstance` metric to understand what TPS you may be getting throttled at. At the instance level, check your instance type and its CPU/GPU utilization to define when to scale out. If a singular instance is above 60-70% capacity, this is often a good sign that you are saturating your hardware.

We don't recommend having multiple scaling policies, as these can conflict and lead to confusion at the hardware level, causing delays when scaling out.

## Q: Why is my asynchronous endpoint terminating an instance as `Unhealthy` and the update requests from autoscaling are failing?


Check if your container is able to handle ping and invoke requests concurrently. SageMaker AI invoke requests take approximately 3 minutes, and in this duration, usually multiple ping requests end up failing due to the timeout causing SageMaker AI to detect your container as `Unhealthy`.

## Q: Can `MaxConcurrentInvocationsPerInstance` work for my BYOC model container with the ningx/gunicorn/flask settings?


Yes. `MaxConcurrentInvocationsPerInstance` is a feature of asynchronous endpoints. This does not depend on the custom container implementation. `MaxConcurrentInvocationsPerInstance` controls the rate at which invoke requests are sent to the customer container. If this value is set as `1`, then only 1 request is sent to the container at a time, no matter how many workers are on the customer container.

## Q: How can I debug model server errors (500) on my asynchronous endpoint?


The error means that the customer container returned an error. SageMaker AI does not control the behavior of customer containers. SageMaker AI simply returns the response from the `ModelContainer` and does not retry. If you want, you can configure the invocation to retry on failure. We suggest that you turn on container logging and check your container logs to find the root cause of the 500 error from your model. Check the corresponding `CPUUtilization` and `MemoryUtilization` metrics at the point of failure as well. You can also configure the [S3FailurePath](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_AsyncInferenceOutputConfig.html#sagemaker-Type-AsyncInferenceOutputConfig-S3FailurePath) to the model response in Amazon SNS as part of the Async Error Notifications to investiage failures.

## Q: How can I know if `MaxConcurrentInvocationsPerInstance=1` takes effect? Are there any metrics that I can check?


You can check the metric `InvocationsProcesssed`, which should align with the number of invocations that you expect to be processed in a minute based on single concurrency.

## Q: How can I track the success and failures of my invocation requests? What are the best practices?


The best practice is to enable Amazon SNS, which is a notification service for messaging-oriented applications, with multiple subscribers requesting and receiving "push" notifications of time-critical messages from a choice of transport protocols, including HTTP, Amazon SQS, and email. Asynchronous Inference posts notifications when you create an endpoint with `CreateEndpointConfig` and specify an Amazon SNS topic.

To use Amazon SNS to check prediction results from your asynchronous endpoint, you first need to create a topic, subscribe to the topic, confirm your subscription to the topic, and note the Amazon Resource Name (ARN) of that topic. For detailed information on how to create, subscribe, and find the Amazon ARN of an Amazon SNS topic, see [Configuring Amazon SNS](https://docs.aws.amazon.com/sns/latest/dg/sns-configuring.html) in the *Amazon SNS Developer Guide*. For more information about how to use Amazon SNS with Asynchronous Inference, see [Check prediction results](https://docs.aws.amazon.com/sagemaker/latest/dg/async-inference-check-predictions.html).

## Q: Can I define a scaling policy that scales up from zero instances upon receiving a new request?


Yes. Asynchronous Inference provides a mechanism to scale down to zero instances when there are no requests. If your endpoint has been scaled down to zero instances during these periods, then your endpoint won’t scale up again until the number of requests in the queue exceeds the target specified in your scaling policy. This can result in long waiting times for requests in the queue. In such cases, if you want to scale up from zero instances for new requests less than the queue target specified, you can use an additional scaling policy called `HasBacklogWithoutCapacity`. For more information about how to define this scaling policy, see [Autoscale an asynchronous endpoint](https://docs.aws.amazon.com/sagemaker/latest/dg/async-inference-autoscale.html#async-inference-autoscale-scale-up).

## Q: I’m getting an error that the instance type is not supported for Asynchronous Inference. What are the instance types Asynchronous Inference supports?


For an exhaustive list of instances supported by Asynchronous Inference per region, see [SageMaker pricing](https://aws.amazon.com/sagemaker/pricing/). Check if the required instance is available in your region before proceeding.