

# Real-time inference


 Real-time inference is ideal for inference workloads where you have real-time, interactive, low latency requirements. You can deploy your model to SageMaker AI hosting services and get an endpoint that can be used for inference. These endpoints are fully managed and support autoscaling (see [Automatic scaling of Amazon SageMaker AI models](endpoint-auto-scaling.md)). 

**Topics**
+ [

# Deploy models for real-time inference
](realtime-endpoints-deploy-models.md)
+ [

# Invoke models for real-time inference
](realtime-endpoints-test-endpoints.md)
+ [

# Endpoints
](realtime-endpoints-manage.md)
+ [

# Hosting options
](realtime-endpoints-options.md)
+ [

# Automatic scaling of Amazon SageMaker AI models
](endpoint-auto-scaling.md)
+ [

# Instance storage volumes
](host-instance-storage.md)
+ [

# Validation of models in production
](model-validation.md)
+ [

# Online explainability with SageMaker Clarify
](clarify-online-explainability.md)
+ [

# Fine-tune models with adapter inference components
](realtime-endpoints-adapt.md)

# Deploy models for real-time inference
Deploy models

**Important**  
Custom IAM policies that allow Amazon SageMaker Studio or Amazon SageMaker Studio Classic to create Amazon SageMaker resources must also grant permissions to add tags to those resources. The permission to add tags to resources is required because Studio and Studio Classic automatically tag any resources they create. If an IAM policy allows Studio and Studio Classic to create resources but does not allow tagging, "AccessDenied" errors can occur when trying to create resources. For more information, see [Provide permissions for tagging SageMaker AI resources](security_iam_id-based-policy-examples.md#grant-tagging-permissions).  
[AWS managed policies for Amazon SageMaker AI](security-iam-awsmanpol.md) that give permissions to create SageMaker resources already include permissions to add tags while creating those resources.

There are several options to deploy a model using SageMaker AI hosting services. You can interactively deploy a model with SageMaker Studio. Or, you can programmatically deploy a model using an AWS SDK, such as the SageMaker Python SDK or the SDK for Python (Boto3). You can also deploy by using the AWS CLI.

## Before you begin


Before you deploy a SageMaker AI model, locate and make note of the following:
+ The AWS Region where your Amazon S3 bucket is located
+ The Amazon S3 URI path where the model artifacts are stored
+ The IAM role for SageMaker AI
+ The Docker Amazon ECR URI registry path for the custom image that contains the inference code, or the framework and version of a built-in Docker image that is supported and by AWS

 For a list of AWS services available in each AWS Region, see [Region Maps and Edge Networks](https://aws.amazon.com/about-aws/global-infrastructure/regional-product-services/). See [Creating IAM roles](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_create.html) for information on how to create an IAM role.

**Important**  
The Amazon S3 bucket where the model artifacts are stored must be in the same AWS Region as the model that you are creating.

## Shared resource utilization with multiple models
Shared resource utilization

You can deploy one or more models to an endpoint with Amazon SageMaker AI. When multiple models share an endpoint, they jointly utilize the resources that are hosted there, such as the ML compute instances, CPUs, and accelerators. The most flexible way to deploy multiple models to an endpoint is to define each model as an *inference component*.

### Inference components


An inference component is a SageMaker AI hosting object that you can use to deploy a model to an endpoint. In the inference component settings, you specify the model, the endpoint, and how the model utilizes the resources that the endpoint hosts. To specify the model, you can specify a SageMaker AI Model object, or you can directly specify the model artifacts and image.

In the settings, you can optimize resource utilization by tailoring how the required CPU cores, accelerators, and memory are allocated to the model. You can deploy multiple inference components to an endpoint, where each inference component contains one model and the resource utilization needs for that model. 

After you deploy an inference component, you can directly invoke the associated model when you use the InvokeEndpoint action in the SageMaker API.

Inference components provide the following benefits:

**Flexibility**  
The inference component decouples the details of hosting the model from the endpoint itself. This provides more flexibility and control over how models are hosted and served with an endpoint. You can host multiple models on the same infrastructure, and you can add or remove models from an endpoint as needed. You can update each model independently.

**Scalability**  
You can specify how many copies of each model to host, and you can set a minimum number of copies to ensure that the model loads in the quantity that you require to serve requests. You can scale any inference component copy down to zero, which makes room for another copy to scale up. 

SageMaker AI packages your models as inference components when you deploy them by using:
+ SageMaker Studio Classic.
+ The SageMaker Python SDK to deploy a Model object (where you set the endpoint type to `EndpointType.INFERENCE_COMPONENT_BASED`).
+ The AWS SDK for Python (Boto3) to define `InferenceComponent` objects that you deploy to an endpoint.

## Deploy models with SageMaker Studio
Deploy with Studio

Complete the following steps to create and deploy your model interactively through SageMaker Studio. For more information about Studio, see the [Studio](https://docs.aws.amazon.com/sagemaker/latest/dg/studio.html) documentation. For more walkthroughs of various deployment scenarios, see the blog [Package and deploy classical ML models and LLMs easily with Amazon SageMaker AI – Part 2](https://aws.amazon.com/blogs/machine-learning/package-and-deploy-classical-ml-and-llms-easily-with-amazon-sagemaker-part-2-interactive-user-experiences-in-sagemaker-studio/).

### Prepare your artifacts and permissions


Complete this section before creating a model in SageMaker Studio.

You have two options for bringing your artifacts and creating a model in Studio:

1. You can bring a pre-packaged `tar.gz` archive, which should include your model artifacts, any custom inference code, and any dependencies listed in a `requirements.txt` file.

1. SageMaker AI can package your artifacts for you. You only have to bring your raw model artifacts and any dependencies in a `requirements.txt` file, and SageMaker AI can provide default inference code for you (or you can override the default code with your own custom inference code). SageMaker AI supports this option for the following frameworks: PyTorch, XGBoost.

In addition to bringing your model, your AWS Identity and Access Management (IAM) role, and a Docker container (or desired framework and version for which SageMaker AI has a pre-built container), you must also grant permissions to create and deploy models through SageMaker AI Studio.

You should have the [AmazonSageMakerFullAccess](https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AmazonSageMakerFullAccess.html) policy attached to your IAM role so that you can access SageMaker AI and other relevant services. To see the prices of the instance types in Studio, you also must attach the [AWSPriceListServiceFullAccess](https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AWSPriceListServiceFullAccess.html) policy (or if you don’t want to attach the whole policy, more specifically, the `pricing:GetProducts` action).

If you choose to upload your model artifacts when creating a model (or upload a sample payload file for inference recommendations), then you must create an Amazon S3 bucket. The bucket name must be prefixed by the word `SageMaker AI`. Alternate capitalizations of SageMaker AI are also acceptable: `Sagemaker` or `sagemaker`.

We recommend that you use the bucket naming convention `sagemaker-{Region}-{accountID}`. This bucket is used to store the artifacts that you upload.

After creating the bucket, attach the following CORS (cross-origin resource sharing) policy to the bucket:

```
[
    {
        "AllowedHeaders": ["*"],
        "ExposeHeaders": ["Etag"],
        "AllowedMethods": ["PUT", "POST"],
        "AllowedOrigins": ['https://*.sagemaker.aws'],
    }
]
```

You can attach a CORS policy to an Amazon S3 bucket by using any of the following methods:
+ Through the [Edit cross-origin resource sharing (CORS)](https://s3.console.aws.amazon.com/s3/bucket/bucket-name/property/cors/edit) page in the Amazon S3 console
+ Using the Amazon S3 API [PutBucketCors](https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutBucketCors.html)
+ Using the put-bucket-cors AWS CLI command:

  ```
  aws s3api put-bucket-cors --bucket="..." --cors-configuration="..."
  ```

### Create a deployable model


In this step, you create a deployable version of your model in SageMaker AI by providing your artifacts along with additional specifications, such as your desired container and framework, any custom inference code, and network settings.

Create a deployable model in SageMaker Studio by doing the following:

1. Open the SageMaker Studio application.

1. In the left navigation pane, choose **Models**.

1. Choose the **Deployable models** tab.

1. On the **Deployable models** page, choose **Create**.

1. On the **Create deployable model** page, for the **Model name** field, enter a name for the model.

There are several more sections for you to fill out on the **Create deployable model** page.

The **Container definition** section looks like the following screenshot:

![\[Screenshot of the Container definition section for creating a model in Studio.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/inference/studio-container-definition.png)


**For the **Container definition** section, do the following:**

1. For **Container type**, select **Pre-built container** if you'd like to use a SageMaker AI managed container, or select **Bring your own container** if you have your own container.

1. If you selected **Pre-built container**, select the **Container framework**, **Framework version**, and **Hardware type** that you'd like to use.

1. If you selected **Bring your own container**, enter an Amazon ECR path for **ECR path to container image**.

Then, fill out the **Artifacts** section, which looks like the following screenshot:

![\[Screenshot of the Artifacts section for creating a model in Studio.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/inference/studio-artifacts-section.png)


**For the **Artifacts** section, do the following:**

1. If you're using one of the frameworks that SageMaker AI supports for packaging model artifacts (PyTorch or XGBoost), then for **Artifacts**, you can choose the **Upload artifacts** option. With this option, you can simply specify your raw model artifacts, any custom inference code you have, and your requirements.txt file, and SageMaker AI handles packaging the archive for you. Do the following:

   1. For **Artifacts**, select **Upload artifacts** to continue providing your files. Otherwise, if you already have a `tar.gz` archive that contains your model files, inference code, and `requirements.txt` file, then select **Input S3 URI to pre-packaged artifacts**.

   1. If you chose to upload your artifacts, then for **S3 bucket**, enter the Amazon S3 path to a bucket where you'd like SageMaker AI to store your artifacts after packaging them for you. Then, complete the following steps.

   1. For **Upload model artifacts**, upload your model files.

   1. For **Inference code**, select **Use default inference code** if you'd like to use default code that SageMaker AI provides for serving inference. Otherwise, select **Upload customized inference code** to use your own inference code.

   1. For **Upload requirements.txt**, upload a text file that lists any dependencies that you want to install at runtime.

1. If you're not using a framework that SageMaker AI supports for packaging model artifacts, then Studio shows you the **Pre-packaged artifacts** option, and you must provide all of your artifacts already packaged as a `tar.gz` archive. Do the following:

   1. For **Pre-packaged artifacts**, select **Input S3 URI for pre-packaged model artifacts** if you have your `tar.gz` archive already uploaded to Amazon S3. Select **Upload pre-packaged model artifacts** if you want to directly upload your archive to SageMaker AI.

   1. If you selected **Input S3 URI for pre-packaged model artifacts**, enter the Amazon S3 path to your archive for **S3 URI**. Otherwise, select and upload the archive from your local machine.

The next section is **Security**, which looks like the following screenshot:

![\[Screenshot of the Security section for creating a model in Studio.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/inference/studio-security-section.png)


**For the **Security** section, do the following:**

1. For **IAM role**, enter the ARN for an IAM role.

1. (Optional) For **Virtual Private Cloud (VPC)**, you can select an Amazon VPC for storing your model configuration and artifacts.

1. (Optional) Turn on the **Network isolation** toggle if you want to restrict your container's internet access.

Finally, you can optionally fill out the **Advanced options** section, which looks like the following screenshot:

![\[Screenshot of the Advanced options section for creating a model in Studio.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/inference/studio-advanced-options.png)


**(Optional) For the **Advanced options** section, do the following:**

1. Turn on the **Customized instance recommendations** toggle if you want to run an Amazon SageMaker Inference Recommender job on your model after its creation. Inference Recommender is a feature that provides you with recommended instance types for optimizing inference performance and cost. You can view these instance recommendations when preparing to deploy your model.

1. For **Add environment variables**, enter an environment variables for your container as key-value pairs.

1. For **Tags**, enter any tags as key-value pairs.

1. After finishing your model and container configuration, choose **Create deployable model**.

You should now have a model in SageMaker Studio that is ready for deployment.

### Deploy your model


Finally, you deploy the model you configured in the previous step to an HTTPS endpoint. You can deploy either a single model or multiple models to the endpoint.

**Model and endpoint compatibility**  
Before you can deploy a model to an endpoint, the model and endpoint must be compatible by having the same values for the following settings:  
The IAM role
The Amazon VPC, including its subnets and security groups
The network isolation (enabled or disabled)
Studio prevents you from deploying models to incompatible endpoints in the following ways:  
If you attempt to deploy a model to a new endpoint, SageMaker AI configures the endpoint with initial settings that are compatible. If you break the compatibility by changing these settings, Studio shows an alert and prevents your deployment.
If you attempt to deploy to an existing endpoint, and that endpoint is incompatible, Studio shows an alert and prevents your deployment. 
If you attempt to add multiple models to a deployment, Studio prevents you from deploying models that are incompatible with each other.
When Studio shows the alert about model and endpoint incompatibility, you can choose **View details** in the alert to see which settings are incompatible.

One way to deploy a model is by doing the following in Studio:

1. Open the SageMaker Studio application.

1. In the left navigation pane, choose **Models**.

1. On the **Models** page, select one or more models from the list of SageMaker AI models.

1. Choose **Deploy**.

1. For **Endpoint name**, open the dropdown menu. You can either select an existing endpoint or you can create a new endpoint to which you deploy the model.

1. For **Instance type**, select the instance type that you want to use for the endpoint. If you previously ran an Inference Recommender job for the model, your recommended instance types appear in the list under the title **Recommended**. Otherwise, you'll see a few **Prospective instances** that might be suitable for your model.
**Instance type compatibility for JumpStart**  
If you're deploying a JumpStart model, Studio only shows instance types that the model supports.

1. For **Initial instance count**, enter the initial number of instances that you'd like to provision for your endpoint.

1. For **Maximum instance count**, specify the maximum number of instances that the endpoint can provision when it scales up to accommodate an increase in traffic.

1. If the model you're deploying is one of the most used JumpStart LLMs from the model hub, then the **Alternate configurations** option appears after the instance type and instance count fields.

   For the most popular JumpStart LLMs, AWS has pre-benchmarked instance types to optimize for either cost or performance. This data can help you decide which instance type to use for deploying your LLM. Choose **Alternate configurations** to open a dialog box that contains the pre-benchmarked data. The panel looks like the following screenshot:  
![\[Screenshot of the Alternate configurations box\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/inference/studio-jumpstart-alternate-configurations.png)

   In the **Alternate configurations** box, do the following:

   1. Select an instance type. You can choose **Cost per hour** or **Best performance** to see instance types that optimize either cost or performance for the specified model. You can also choose **Other supported instances** to see a list of other instance types that are compatible with the JumpStart model. Note that selecting an instance type here overwrites any previous instance selection specified in Step 6.

   1. (Optional) Turn on the **Customize the selected configuration** toggle to specify **Max total tokens** (the maximum number of tokens that you want to allow, which is the sum of your input tokens and the model's generated output), **Max input token length** (the maximum number of tokens you want to allow for the input of each request), and **Max concurrent requests** (the maximum number of requests that the model can process at a time).

   1. Choose **Select** to confirm your instance type and configuration settings.

1. The **Model** field should already be populated with the name of the model or models that you're deploying. You can choose **Add model** to add more models to the deployment. For each model that you add, fill out the following fields:

   1. For **Number of CPU cores**, enter the CPU cores that you'd like to dedicate for the model's usage.

   1. For **Min number of copies**, enter the minimum number of model copies that you want to have hosted on the endpoint at any given time.

   1. For **Min CPU memory (MB)**, enter the minimum amount of memory (in MB) that the model requires.

   1. For **Max CPU memory (MB)**, enter the maximum amount of memory (in MB) that you'd like to allow the model to use.

1. (Optional) For the **Advanced options**, do the following:

   1. For **IAM role**, use either the default SageMaker AI IAM execution role, or specify your own role that has the permissions you need. Note that this IAM role must be the same as the role that you specified when creating the deployable model.

   1. For **Virtual Private Cloud (VPC)**, you can specify a VPC in which you want to host your endpoint.

   1. For **Encryption KMS key**, select an AWS KMS key to encrypt data on the storage volume attached to the ML compute instance that hosts the endpoint.

   1. Turn on the **Enable network isolation** toggle to restrict your container's internet access.

   1. For **Timeout configuration**, enter values for the **Model data download timeout (seconds)** and **Container startup health check timeout (seconds)** fields. These values determine the maximum amount of time that SageMaker AI allows for downloading the model to the container and starting up the container, respectively.

   1. For **Tags**, enter any tags as key-value pairs.
**Note**  
SageMaker AI configures the IAM role, VPC, and network isolation settings with initial values that are compatible with the model that you're deploying. If you break the compatibility by changing these settings, Studio shows an alert and prevents your deployment.

After configuring your options, the page should look like the following screenshot.

![\[Screenshot of the Deploy model page in Studio.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/inference/studio-deploy-realtime-model-2.png)


After configuring your deployment, choose **Deploy** to create the endpoint and deploy your model.

## Deploy models with the Python SDKs
Deploy with Python

Using the SageMaker Python SDK, you can build your model in two ways. The first is to create a model object from the `Model` or `ModelBuilder` class. If you use the `Model` class to create your `Model` object, you need to specify the model package or inference code (depending on your model server), scripts to handle serialization and deserialization of data between the client and server, and any dependencies to be uploaded to Amazon S3 for consumption. The second way to build your model is to use `ModelBuilder` for which you provide model artifacts or inference code. `ModelBuilder` automatically captures your dependencies, infers the needed serialization and deserialization functions, and packages your dependencies to create your `Model` object. For more information about `ModelBuilder`, see [Create a model in Amazon SageMaker AI with ModelBuilder](how-it-works-modelbuilder-creation.md).

The following section describes both methods to create your model and deploy your model object.

### Set up


The following examples prepare for the model deployment process. They import the necessary libraries and define the S3 URL that locates the model artifacts.

------
#### [ SageMaker Python SDK ]

**Example import statements**  
The following example imports modules from the SageMaker Python SDK, the SDK for Python (Boto3), and the Python Standard Library. These modules provide useful methods that help you deploy models, and they're used by the remaining examples that follow.  

```
import boto3
from datetime import datetime
from sagemaker.compute_resource_requirements.resource_requirements import ResourceRequirements
from sagemaker.predictor import Predictor
from sagemaker.enums import EndpointType
from sagemaker.model import Model
from sagemaker.session import Session
```

------
#### [ boto3 inference components ]

**Example import statements**  
The following example imports modules from the SDK for Python (Boto3) and the Python Standard Library. These modules provide useful methods that help you deploy models, and they're used by the remaining examples that follow.  

```
import boto3
import botocore
import sys
import time
```

------
#### [ boto3 models (without inference components) ]

**Example import statements**  
The following example imports modules from the SDK for Python (Boto3) and the Python Standard Library. These modules provide useful methods that help you deploy models, and they're used by the remaining examples that follow.  

```
import boto3
import botocore
import datetime
from time import gmtime, strftime
```

------

**Example model artifact URL**  
The following code builds an example Amazon S3 URL. The URL locates the model artifacts for a pre-trained model in an Amazon S3 bucket.  

```
# Create a variable w/ the model S3 URL

# The name of your S3 bucket:
s3_bucket = "amzn-s3-demo-bucket"
# The directory within your S3 bucket your model is stored in:
bucket_prefix = "sagemaker/model/path"
# The file name of your model artifact:
model_filename = "my-model-artifact.tar.gz"
# Relative S3 path:
model_s3_key = f"{bucket_prefix}/"+model_filename
# Combine bucket name, model file name, and relate S3 path to create S3 model URL:
model_url = f"s3://{s3_bucket}/{model_s3_key}"
```
The full Amazon S3 URL is stored in the variable `model_url`, which is used in the examples that follow. 

### Overview


There are multiple ways that you can deploy models with the SageMaker Python SDK or the SDK for Python (Boto3). The following sections summarize the steps that you complete for several possible approaches. These steps are demonstrated by the examples that follow.

------
#### [ SageMaker Python SDK ]

Using the SageMaker Python SDK, you can build your model in either of the following ways:
+ **Create a model object from the `Model` class** – You must specify the model package or inference code (depending on your model server), scripts to handle serialization and deserialization of data between the client and server, and any dependencies to be uploaded to Amazon S3 for consumption. 
+ **Create a model object from the `ModelBuilder` class** – You provide model artifacts or inference code, and `ModelBuilder` automatically captures your dependencies, infers the needed serialization and deserialization functions, and packages your dependencies to create your `Model` object.

  For more information about `ModelBuilder`, see [Create a model in Amazon SageMaker AI with ModelBuilder](how-it-works-modelbuilder-creation.md). You can also see the blog [Package and deploy classical ML models and LLMs easily with SageMaker AI – Part 1](https://aws.amazon.com/blogs/machine-learning/package-and-deploy-classical-ml-and-llms-easily-with-amazon-sagemaker-part-1-pysdk-improvements/) for more information.

The examples that follow describe both methods to create your model and deploy your model object. To deploy a model in these ways, you complete the following steps:

1. Define the endpoint resources to allocate to the model with a `ResourceRequirements` object.

1. Create a model object from the `Model` or `ModelBuilder` classes. The `ResourceRequirements` object is specified in the model settings.

1. Deploy the model to an endpoint by using the `deploy` method of the `Model` object.

------
#### [ boto3 inference components ]

The examples that follow demonstrate how to assign a model to an inference component and then deploy the inference component to an endpoint. To deploy a model in this way, you complete the following steps:

1. (Optional) Create a SageMaker AI model object by using the [https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker/client/create_model.html](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker/client/create_model.html) method.

1. Specify the settings for your endpoint by creating an endpoint configuration object. To create one, you use the [https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker/client/create_endpoint_config.html#create-endpoint-config](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker/client/create_endpoint_config.html#create-endpoint-config) method.

1. Create your endpoint by using the [https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker/client/create_endpoint.html](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker/client/create_endpoint.html) method, and in your request, provide the endpoint configuration that you created.

1. Create an inference component by using the `create_inference_component` method. In the settings, you specify a model by doing either of the following:
   + Specifying a SageMaker AI model object
   + Specifying the model image URI and S3 URL

   You also allocate endpoint resources to the model. By creating the inference component, you deploy the model to the endpoint. You can deploy multiple models to an endpoint by creating multiple inference components — one for each model.

------
#### [ boto3 models (without inference components) ]

The examples that follow demonstrate how to create a model object and then deploy the model to an endpoint. To deploy a model in this way, you complete the following steps:

1. Create a SageMaker AI model by using the [https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker/client/create_model.html](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker/client/create_model.html) method.

1. Specify the settings for your endpoint by creating an endpoint configuration object. To create one, you use the [https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker/client/create_endpoint_config.html#create-endpoint-config](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker/client/create_endpoint_config.html#create-endpoint-config) method. In the endpoint configuration, you assign the model object to a production variant.

1. Create your endpoint by using the [https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker/client/create_endpoint.html](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker/client/create_endpoint.html) method. In your request, provide the endpoint configuration that you created. 

   When you create the endpoint, SageMaker AI provisions the endpoint resources, and it deploys the model to the endpoint.

------

### Configure


The following examples configure the resources that you require to deploy a model to an endpoint.

------
#### [ SageMaker Python SDK ]

The following example assigns endpoint resources to a model with a `ResourceRequirements` object. These resources include CPU cores, accelerators, and memory. Then, the example creates a model object from the `Model` class. Alternatively you can create a model object by instantiating the [ModelBuilder](https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-modelbuilder-creation.html) class and running `build`—this method is also shown in the example. `ModelBuilder` provides a unified interface for model packaging, and in this instance, prepares a model for a large model deployment. The example utilizes `ModelBuilder` to construct a Hugging Face model. (You can also pass a JumpStart model). Once you build the model, you can specify resource requirements in the model object. In the next step, you use this object to deploy the model to an endpoint. 

```
resources = ResourceRequirements(
    requests = {
        "num_cpus": 2,  # Number of CPU cores required:
        "num_accelerators": 1, # Number of accelerators required
        "memory": 8192,  # Minimum memory required in Mb (required)
        "copies": 1,
    },
    limits = {},
)

now = datetime.now()
dt_string = now.strftime("%d-%m-%Y-%H-%M-%S")
model_name = "my-sm-model"+dt_string

# build your model with Model class
model = Model(
    name = "model-name",
    image_uri = "image-uri",
    model_data = model_url,
    role = "arn:aws:iam::111122223333:role/service-role/role-name",
    resources = resources,
    predictor_cls = Predictor,
)
                        
# Alternate mechanism using ModelBuilder
# uncomment the following section to use ModelBuilder
/*
model_builder = ModelBuilder(
    model="<HuggingFace-ID>", # like "meta-llama/Llama-2-7b-hf"
    schema_builder=SchemaBuilder(sample_input,sample_output),
    env_vars={ "HUGGING_FACE_HUB_TOKEN": "<HuggingFace_token>}" }
)

# build your Model object
model = model_builder.build()

# create a unique name from string 'mb-inference-component'
model.model_name = unique_name_from_base("mb-inference-component")

# assign resources to your model
model.resources = resources
*/
```

------
#### [ boto3 inference components ]

The following example configures an endpoint with the `create_endpoint_config` method. You assign this configuration to an endpoint when you create it. In the configuration, you define one or more production variants. For each variant, you can choose the instance type that you want Amazon SageMaker AI to provision, and you can enable managed instance scaling.

```
endpoint_config_name = "endpoint-config-name"
endpoint_name = "endpoint-name"
inference_component_name = "inference-component-name"
variant_name = "variant-name"

sagemaker_client.create_endpoint_config(
    EndpointConfigName = endpoint_config_name,
    ExecutionRoleArn = "arn:aws:iam::111122223333:role/service-role/role-name",
    ProductionVariants = [
        {
            "VariantName": variant_name,
            "InstanceType": "ml.p4d.24xlarge",
            "InitialInstanceCount": 1,
            "ManagedInstanceScaling": {
                "Status": "ENABLED",
                "MinInstanceCount": 1,
                "MaxInstanceCount": 2,
            },
        }
    ],
)
```

------
#### [ boto3 models (without inference components) ]

**Example model definition**  
The following example defines a SageMaker AI model with the `create_model` method in the AWS SDK for Python (Boto3).  

```
model_name = "model-name"

create_model_response = sagemaker_client.create_model(
    ModelName = model_name,
    ExecutionRoleArn = "arn:aws:iam::111122223333:role/service-role/role-name",
    PrimaryContainer = {
        "Image": "image-uri",
        "ModelDataUrl": model_url,
    }
)
```
This example specifies the following:  
+ `ModelName`: A name for your model (in this example it is stored as a string variable called `model_name`).
+ `ExecutionRoleArn`: The Amazon Resource Name (ARN) of the IAM role that Amazon SageMaker AI can assume to access model artifacts and Docker images for deployment on ML compute instances or for batch transform jobs.
+ `PrimaryContainer`: The location of the primary Docker image containing inference code, associated artifacts, and custom environment maps that the inference code uses when the model is deployed for predictions.

**Example endpoint configuration**  
The following example configures an endpoint with the `create_endpoint_config` method. Amazon SageMaker AI uses this configuration to deploy models. In the configuration, you identify one or more models, created with the `create_model` method, to deploy the resources that you want Amazon SageMaker AI to provision.  

```
endpoint_config_response = sagemaker_client.create_endpoint_config(
    EndpointConfigName = "endpoint-config-name", 
    # List of ProductionVariant objects, one for each model that you want to host at this endpoint:
    ProductionVariants = [
        {
            "VariantName": "variant-name", # The name of the production variant.
            "ModelName": model_name, 
            "InstanceType": "ml.p4d.24xlarge",
            "InitialInstanceCount": 1 # Number of instances to launch initially.
        }
    ]
)
```
This example specifies the following keys for the `ProductionVariants` field:  
+ `VariantName`: The name of the production variant.
+ `ModelName`: The name of the model that you want to host. This is the name that you specified when creating the model.
+ `InstanceType`: The compute instance type. See the `InstanceType` field in [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ProductionVariant.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ProductionVariant.html) and [SageMaker AI Pricing](https://aws.amazon.com/sagemaker/pricing/) for a list of supported compute instance types and pricing for each instance type.

------

### Deploy


The following examples deploy a model to an endpoint.

------
#### [ SageMaker Python SDK ]

The following example deploys the model to a real-time, HTTPS endpoint with the `deploy` method of the model object. If you specify a value for the `resources` argument for both model creation and deployment, the resources you specify for deployment take precedence.

```
predictor = model.deploy(
    initial_instance_count = 1,
    instance_type = "ml.p4d.24xlarge", 
    endpoint_type = EndpointType.INFERENCE_COMPONENT_BASED,
    resources = resources,
)
```

For the `instance_type` field, the example specifies the name of the Amazon EC2 instance type for the model. For the `initial_instance_count` field, it specifies the initial number of instances to run the endpoint on.

The following code sample demonstrates another case where you deploy a model to an endpoint and then deploy another model to the same endpoint. In this case you must supply the same endpoint name to the `deploy` methods of both models.

```
# Deploy the model to inference-component-based endpoint
falcon_predictor = falcon_model.deploy(
    initial_instance_count = 1,
    instance_type = "ml.p4d.24xlarge", 
    endpoint_type = EndpointType.INFERENCE_COMPONENT_BASED,
    endpoint_name = "<endpoint_name>"
    resources = resources,
)

# Deploy another model to the same inference-component-based endpoint
llama2_predictor = llama2_model.deploy( # resources already set inside llama2_model
    endpoint_type = EndpointType.INFERENCE_COMPONENT_BASED,
    endpoint_name = "<endpoint_name>"  # same endpoint name as for falcon model
)
```

------
#### [ boto3 inference components ]

Once you have an endpoint configuration, use the [create\$1endpoint](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker/client/create_endpoint.html) method to create your endpoint. The endpoint name must be unique within an AWS Region in your AWS account. 

The following example creates an endpoint using the endpoint configuration specified in the request. Amazon SageMaker AI uses the endpoint to provision resources.

```
sagemaker_client.create_endpoint(
    EndpointName = endpoint_name,
    EndpointConfigName = endpoint_config_name,
)
```

After you've created an endpoint, you can deploy one or models to it by creating inference components. The following example creates one with the `create_inference_component` method.

```
sagemaker_client.create_inference_component(
    InferenceComponentName = inference_component_name,
    EndpointName = endpoint_name,
    VariantName = variant_name,
    Specification = {
        "Container": {
            "Image": "image-uri",
            "ArtifactUrl": model_url,
        },
        "ComputeResourceRequirements": {
            "NumberOfCpuCoresRequired": 1, 
            "MinMemoryRequiredInMb": 1024
        }
    },
    RuntimeConfig = {"CopyCount": 2}
)
```

------
#### [ boto3 models (without inference components) ]

**Example deployment**  

Provide the endpoint configuration to SageMaker AI. The service launches the ML compute instances and deploys the model or models as specified in the configuration.

Once you have your model and endpoint configuration, use the [create\$1endpoint](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker/client/create_endpoint.html) method to create your endpoint. The endpoint name must be unique within an AWS Region in your AWS account. 

The following example creates an endpoint using the endpoint configuration specified in the request. Amazon SageMaker AI uses the endpoint to provision resources and deploy models.

```
create_endpoint_response = sagemaker_client.create_endpoint(
    # The endpoint name must be unique within an AWS Region in your AWS account:
    EndpointName = "endpoint-name"
    # The name of the endpoint configuration associated with this endpoint:
    EndpointConfigName = "endpoint-config-name")
```

------

## Deploy models with the AWS CLI
Deploy with the AWS CLI

You can deploy a model to an endpoint by using the AWS CLI.

### Overview


When you deploy a model with the AWS CLI, you can deploy it with or without using an inference component. The following sections summarize the commands that you run for both approaches. These commands are demonstrated by the examples that follow.

------
#### [ With inference components ]

To deploy a model with an inference component, do the following:

1. (Optional) Create a model with the [https://docs.aws.amazon.com/cli/latest/reference/sagemaker/create-model.html](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/create-model.html) command.

1. Specify the settings for your endpoint by creating an endpoint configuration. To create one, you run the [https://docs.aws.amazon.com/cli/latest/reference/sagemaker/create-endpoint-config.html](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/create-endpoint-config.html) command.

1. Create your endpoint by using the [https://docs.aws.amazon.com/cli/latest/reference/sagemaker/create-endpoint.html](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/create-endpoint.html) command. In the command body, specify the endpoint configuration that you created.

1. Create an inference component by using the `create-inference-component` command. In the settings, you specify a model by doing either of the following:
   + Specifying a SageMaker AI model object
   + Specifying the model image URI and S3 URL

   You also allocate endpoint resources to the model. By creating the inference component, you deploy the model to the endpoint. You can deploy multiple models to an endpoint by creating multiple inference components — one for each model.

------
#### [ Without inference components ]

To deploy a model without using an inference component, do the following:

1. Create a SageMaker AI model by using the [https://docs.aws.amazon.com/cli/latest/reference/sagemaker/create-model.html](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/create-model.html) command.

1. Specify the settings for your endpoint by creating an endpoint configuration object. To create one, you use the [https://docs.aws.amazon.com/cli/latest/reference/sagemaker/create-endpoint-config.html](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/create-endpoint-config.html) command. In the endpoint configuration, you assign the model object to a production variant.

1. Create your endpoint by using the [https://docs.aws.amazon.com/cli/latest/reference/sagemaker/create-endpoint.html](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/create-endpoint.html) command. In your command body, specify the endpoint configuration that you created.

   When you create the endpoint, SageMaker AI provisions the endpoint resources, and it deploys the model to the endpoint.

------

### Configure


The following examples configure the resources that you require to deploy a model to an endpoint.

------
#### [ With inference components ]

**Example create-endpoint-config command**  
The following example creates an endpoint configuration with the [create-endpoint-config](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/create-endpoint-config.html) command.  

```
aws sagemaker create-endpoint-config \
--endpoint-config-name endpoint-config-name \
--execution-role-arn arn:aws:iam::111122223333:role/service-role/role-name\
--production-variants file://production-variants.json
```
In this example, the file `production-variants.json` defines a production variant with the following JSON:  

```
[
    {
        "VariantName": "variant-name",
        "ModelName": "model-name",
        "InstanceType": "ml.p4d.24xlarge",
        "InitialInstanceCount": 1
    }
]
```
If the command succeeds, the AWS CLI responds with the ARN for the resource you created.  

```
{
    "EndpointConfigArn": "arn:aws:sagemaker:us-west-2:111122223333:endpoint-config/endpoint-config-name"
}
```

------
#### [ Without inference components ]

**Example create-model command**  
The following example creates a model with the [create-model](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/create-model.html) command.  

```
aws sagemaker create-model \
--model-name model-name \
--execution-role-arn arn:aws:iam::111122223333:role/service-role/role-name \
--primary-container "{ \"Image\": \"image-uri\", \"ModelDataUrl\": \"model-s3-url\"}"
```
If the command succeeds, the AWS CLI responds with the ARN for the resource you created.  

```
{
    "ModelArn": "arn:aws:sagemaker:us-west-2:111122223333:model/model-name"
}
```

**Example create-endpoint-config command**  
The following example creates an endpoint configuration with the [create-endpoint-config](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/create-endpoint-config.html) command.  

```
aws sagemaker create-endpoint-config \
--endpoint-config-name endpoint-config-name \
--production-variants file://production-variants.json
```
In this example, the file `production-variants.json` defines a production variant with the following JSON:  

```
[
    {
        "VariantName": "variant-name",
        "ModelName": "model-name",
        "InstanceType": "ml.p4d.24xlarge",
        "InitialInstanceCount": 1
    }
]
```
If the command succeeds, the AWS CLI responds with the ARN for the resource you created.  

```
{
    "EndpointConfigArn": "arn:aws:sagemaker:us-west-2:111122223333:endpoint-config/endpoint-config-name"
}
```

------

### Deploy


The following examples deploy a model to an endpoint.

------
#### [ With inference components ]

**Example create-endpoint command**  
The following example creates an endpoint with the [create-endpoint](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/create-endpoint.html) command.  

```
aws sagemaker create-endpoint \
--endpoint-name endpoint-name \
--endpoint-config-name endpoint-config-name
```
If the command succeeds, the AWS CLI responds with the ARN for the resource you created.  

```
{
    "EndpointArn": "arn:aws:sagemaker:us-west-2:111122223333:endpoint/endpoint-name"
}
```

**Example create-inference-component command**  
The following example creates an inference component with the create-inference-component command.  

```
aws sagemaker create-inference-component \
--inference-component-name inference-component-name \
--endpoint-name endpoint-name \
--variant-name variant-name \
--specification file://specification.json \
--runtime-config "{\"CopyCount\": 2}"
```
In this example, the file `specification.json` defines the container and compute resources with the following JSON:  

```
{
    "Container": {
        "Image": "image-uri",
        "ArtifactUrl": "model-s3-url"
    },
    "ComputeResourceRequirements": {
        "NumberOfCpuCoresRequired": 1,
        "MinMemoryRequiredInMb": 1024
    }
}
```
If the command succeeds, the AWS CLI responds with the ARN for the resource you created.  

```
{
    "InferenceComponentArn": "arn:aws:sagemaker:us-west-2:111122223333:inference-component/inference-component-name"
}
```

------
#### [ Without inference components ]

**Example create-endpoint command**  
The following example creates an endpoint with the [create-endpoint](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/create-endpoint.html) command.  

```
aws sagemaker create-endpoint \
--endpoint-name endpoint-name \
--endpoint-config-name endpoint-config-name
```
If the command succeeds, the AWS CLI responds with the ARN for the resource you created.  

```
{
    "EndpointArn": "arn:aws:sagemaker:us-west-2:111122223333:endpoint/endpoint-name"
}
```

------

# Invoke models for real-time inference
Invoke models

After you use Amazon SageMaker AI to deploy a model to an endpoint, you can interact with the model by sending inference requests to it. To send an inference request to a model, you invoke the endpoint that hosts it. You can invoke your endpoints using Amazon SageMaker Studio, the AWS SDKs, or the AWS CLI.

## Invoke Your Model Using Amazon SageMaker Studio
Invoke Using Studio Classic

After you deploy your model to an endpoint, you can view the endpoint through Amazon SageMaker Studio and test your endpoint by sending single inference requests.

**Note**  
SageMaker AI only supports endpoint testing in Studio for real-time endpoints.

**To send a test inference request to your endpoint**

1. Launch Amazon SageMaker Studio.

1. In the navigation pane on the left, choose **Deployments**.

1. From the dropdown, choose **Endpoints**.

1. Find for your endpoint by name, and choose the name in the table. The endpoint names listed in the **Endpoints** panel are defined when you deploy a model. The Studio workspace opens the **Endpoint** page in a new tab.

1. Choose the **Test inference** tab.

1. For **Testing Options**, select one of the following:

   1. Select **Test the sample request** to immediately send a request to your endpoint. Use the **JSON editor** to provide sample data in JSON format, and choose **Send Request** to submit the request to your endpoint. After submitting your request, Studio shows the inference output in a card to the right of the JSON editor.

   1. Select **Use Python SDK example code** to view the code for sending a request to the endpoint. Then, copy the code example from the **Example inference request** section and run the code from your testing environment.

The top of the card shows the type of request that was sent to the endpoint (only JSON is accepted). The card shows the following fields:
+ **Status** – displays one of the following status types:
  + `Success` – The request succeeded.
  + `Failed` – The request failed. A response appears under **Failure Reason**.
  + `Pending` – While the inference request is pending, the status shows a spinning, circular icon.
+ **Execution Length** – How long the invocation took (end time minus the start time) in milliseconds.
+ **Request Time** – How many minutes have passed since the request was sent.
+ **Result Time** – How many minutes have passed since the result was returned.

## Invoke Your Model by Using the AWS SDK for Python (Boto3)
Invoke Using the SDK for Python (Boto3)

If you want to invoke a model endpoint in your application code, you can use one of the AWS SDKs, including the AWS SDK for Python (Boto3). To invoke your endpoint with this SDK, you use one of the following Python methods:
+ `invoke_endpoint` – Sends an inference request to a model endpoint and returns the response that the model generates. 

  This method returns the inference payload as one response after the model finishes generating it. For more information, see [invoke\$1endpoint](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker-runtime/client/invoke_endpoint.html) in the *AWS SDK for Python (Boto3) API Reference*.
+ `invoke_endpoint_with_response_stream` – Sends an inference request to a model endpoint and streams the response incrementally while the model generates it. 

  With this method, your application receives parts of the response as soon as the parts become available. For more information, see [invoke\$1endpoint](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker-runtime/client/invoke_endpoint.html) in the *AWS SDK for Python (Boto3) API Reference*.

  Use this method only to invoke models that support inference streaming.

Before you can use these methods in your application code, you must initialize a SageMaker AI Runtime client, and you must specify the name of your endpoint. The following example sets up the client and endpoint for the rest of the examples that follow:

```
import boto3

sagemaker_runtime = boto3.client(
    "sagemaker-runtime", region_name='aws_region')

endpoint_name='endpoint-name'
```

### Invoke to Get an Inference Response


The following example uses the `invoke_endpoint` method to invoke an endpoint with the AWS SDK for Python (Boto3):

```
# Gets inference from the model hosted at the specified endpoint:
response = sagemaker_runtime.invoke_endpoint(
    EndpointName=endpoint_name, 
    Body=bytes('{"features": ["This is great!"]}', 'utf-8')
    )

# Decodes and prints the response body:
print(response['Body'].read().decode('utf-8'))
```

This example provide input data in the `Body` field for SageMaker AI to pass to the model. This data must be in the same format that was used for training. The example assigns the response to the `response` variable.

The `response` variable provides access to the HTTP status, the name of the deployed model, and other fields. The following snippet prints the HTTP status code:

```
print(response["HTTPStatusCode"])
```

### Invoke to Stream an Inference Response


If you deployed a model that supports inference streaming, you can invoke the model to receive its inference payload as a stream of parts. The model delivers these parts incrementally as the model generates them. When an application receives an inference stream, the application doesn't need to wait for the model to generate the whole response payload. Instead, the application immediately receives parts of the response as they become available. 

By consuming an inference stream in your application, you can create interactions where your users perceive the inference to be fast because they get the first part immediately. You can implement streaming to support fast interactive experiences, such as chatbots, virtual assistants, and music generators. For example, you could create a chatbot that incrementally shows the text generated by a large language model (LLM).

To get an inference stream, you can use the `invoke_endpoint_with_response_stream` method. In the response body, the SDK provides an `EventStream` object, which gives the inference as a series of `PayloadPart` objects.

**Example Inference Stream**  
The following example is a stream of `PayloadPart` objects:  

```
{'PayloadPart': {'Bytes': b'{"outputs": [" a"]}\n'}}
{'PayloadPart': {'Bytes': b'{"outputs": [" challenging"]}\n'}}
{'PayloadPart': {'Bytes': b'{"outputs": [" problem"]}\n'}}
. . .
```
In each payload part, the `Bytes` field provides a portion of the inference response from the model. This portion can be any content type that a model generates, such as text, image, or audio data. In this example, the portions are JSON objects that contain generated text from an LLM.  
Usually, the payload part contains a discrete chunk of data from the model. In this example, the discrete chunks are whole JSON objects. Occasionally, the streaming response splits the chunks over multiple payload parts, or it combines multiple chunks into one payload part. The following example shows a chunk of data in JSON format that's split over two payload parts:  

```
{'PayloadPart': {'Bytes': b'{"outputs": '}}
{'PayloadPart': {'Bytes': b'[" problem"]}\n'}}
```
When you write application code that processes an inference stream, include logic that handles these occasional splits and combinations of data. As one strategy, you could write code that concatenates the contents of `Bytes` while your application receives the payload parts. By concatenating the example JSON data here, you would combine the data into a newline-delimited JSON body. Then, your code could process the stream by parsing the whole JSON object on each line.  
The following example shows the newline-delimited JSON that you would create when you concatenate the example contents of `Bytes`:  

```
{"outputs": [" a"]}
{"outputs": [" challenging"]}
{"outputs": [" problem"]}
. . .
```

**Example Code to Process an Inference Stream**  

The following example Python class, `SmrInferenceStream`, demonstrates how you can process an inference stream that sends text data in JSON format:

```
import io
import json

# Example class that processes an inference stream:
class SmrInferenceStream:
    
    def __init__(self, sagemaker_runtime, endpoint_name):
        self.sagemaker_runtime = sagemaker_runtime
        self.endpoint_name = endpoint_name
        # A buffered I/O stream to combine the payload parts:
        self.buff = io.BytesIO() 
        self.read_pos = 0
        
    def stream_inference(self, request_body):
        # Gets a streaming inference response 
        # from the specified model endpoint:
        response = self.sagemaker_runtime\
            .invoke_endpoint_with_response_stream(
                EndpointName=self.endpoint_name, 
                Body=json.dumps(request_body), 
                ContentType="application/json"
        )
        # Gets the EventStream object returned by the SDK:
        event_stream = response['Body']
        for event in event_stream:
            # Passes the contents of each payload part
            # to be concatenated:
            self._write(event['PayloadPart']['Bytes'])
            # Iterates over lines to parse whole JSON objects:
            for line in self._readlines():
                resp = json.loads(line)
                part = resp.get("outputs")[0]
                # Returns parts incrementally:
                yield part
    
    # Writes to the buffer to concatenate the contents of the parts:
    def _write(self, content):
        self.buff.seek(0, io.SEEK_END)
        self.buff.write(content)

    # The JSON objects in buffer end with '\n'.
    # This method reads lines to yield a series of JSON objects:
    def _readlines(self):
        self.buff.seek(self.read_pos)
        for line in self.buff.readlines():
            self.read_pos += len(line)
            yield line[:-1]
```

This example processes the inference stream by doing the following:
+ Initializes a SageMaker AI Runtime client and sets the name of a model endpoint. Before you can get an inference stream, the model that the endpoint hosts must support inference streaming.
+ In the example `stream_inference` method, receives a request body and passes it to the `invoke_endpoint_with_response_stream` method of the SDK.
+ Iterates over each event in the `EventStream` object that the SDK returns.
+ From each event, gets the contents of the `Bytes` object in the `PayloadPart` object.
+ In the example `_write` method, writes to a buffer to concatenate the contents of the `Bytes` objects. The combined contents form a newline-delimited JSON body.
+ Uses the example `_readlines` method to get an iterable series of JSON objects.
+ In each JSON object, gets a piece of the inference.
+ With the `yield` expression, returns the pieces incrementally.

The following example creates and uses a `SmrInferenceStream` object:

```
request_body = {"inputs": ["Large model inference is"],
                "parameters": {"max_new_tokens": 100,
                               "enable_sampling": "true"}}
smr_inference_stream = SmrInferenceStream(
    sagemaker_runtime, endpoint_name)
stream = smr_inference_stream.stream_inference(request_body)
for part in stream:
    print(part, end='')
```

This example passes a request body to the `stream_inference` method. It iterates over the response to print each piece that the inference stream returns.

The example assumes that the model at the specified endpoint is an LLM that generates text. The output from this example is a body of generated text that prints incrementally:

```
a challenging problem in machine learning. The goal is to . . .
```

## Invoke Your Model by Using the AWS CLI
Invoke Using the AWS CLI

You can invoke your model endpoint by running commands with the AWS Command Line Interface (AWS CLI). The AWS CLI supports standard inference requests with the `invoke-endpoint` command, and it supports asynchronous inference requests with the `invoke-endpoint-async` command.

**Note**  
The AWS CLI doesn't support streaming inference requests.

The following example uses the `invoke-endpoint` command to send an inference request to a model endpoint:

```
aws sagemaker-runtime invoke-endpoint \
    --endpoint-name endpoint_name \
    --body fileb://$file_name \
    output_file.txt
```

For the `--endpoint-name` parameter, provide the endpoint name that you specified when you created the endpoint. For the `--body` parameter, provide input data for SageMaker AI to pass to the model. The data must be in the same format that was used for training. This example shows how to send binary data to your endpoint.

For more information on when to use `file://` over `fileb://` when passing the contents of a file to a parameter of the AWS CLI, see [Best Practices for Local File Parameters](https://aws.amazon.com/blogs/developer/best-practices-for-local-file-parameters/).

For more information, and to see additional parameters that you can pass, see [https://docs.aws.amazon.com/cli/latest/reference/sagemaker-runtime/invoke-endpoint.html](https://docs.aws.amazon.com/cli/latest/reference/sagemaker-runtime/invoke-endpoint.html) in the *AWS CLI Command Reference*.

If the `invoke-endpoint` command succeeds it returns a response such as the following:

```
{
    "ContentType": "<content_type>; charset=utf-8",
    "InvokedProductionVariant": "<Variant>"
}
```

If the command doesn't succeed, check whether the input payload is in the correct format.

View the output of the invocation by checking the file output file (`output_file.txt` in this example).

```
more output_file.txt
```

## Invoke Your Model by Using the AWS SDK for Python
Invoke Using the AWS SDK for Python

### Invoke to Bidirectionally Stream an Inference Request and Response


If you want to invoke a model endpoint in your application code to supports bidirectional streaming, you can use the [new experimental SDK for Python](https://github.com/awslabs/aws-sdk-python) that supports bidirectional streaming capability with HTTP/2 support. This SDK enables real-time, two-way communication between your client application and the SageMaker endpoint, allowing you to send inference requests incrementally while simultaneously receiving streaming responses as the model generates them. This is particularly useful for interactive applications where both the client and server need to exchange data continuously over a persistent connection.

**Note**  
The new experimental SDK is different from the standard Boto3 SDK and supports persistent bidirectional connections for data exchange. While using the experimental Python SDK we strongly advise strict pinning to a version of the SDK for any non-experimental use cases.

To invoke your endpoint with bidirectional streaming, use the `invoke_endpoint_with_bidirectional_stream` method. This method establishes a persistent connection that allows you to stream multiple payload chunks to your model while receiving responses in real-time as the model processes data. The connection remains open until you explicitly close the input stream or the endpoint closes the connection, supporting up to 30 minutes of connection time.

### Prerequisites


Before you can use bidirectional streaming in your application code, you must:

1. Install the experimental SageMaker Runtime HTTP/2 SDK

1. Set up AWS credentials for your SageMaker Runtime client

1. Deploy a model that supports bidirectional streaming to a SageMaker endpoint

### Set up the bidirectional streaming client


The following example shows how to initialize the required components for bidirectional streaming:

```
from sagemaker_runtime_http2.client import SageMakerRuntimeHTTP2Client
from sagemaker_runtime_http2.config import Config, HTTPAuthSchemeResolver
from smithy_aws_core.identity import EnvironmentCredentialsResolver
from smithy_aws_core.auth.sigv4 import SigV4AuthScheme

# Configuration
AWS_REGION = "us-west-2"
BIDI_ENDPOINT = f"https://runtime.sagemaker.{AWS_REGION}.amazonaws.com:8443"
ENDPOINT_NAME = "your-endpoint-name"

# Initialize the client configuration
config = Config(
    endpoint_uri=BIDI_ENDPOINT,
    region=AWS_REGION,
    aws_credentials_identity_resolver=EnvironmentCredentialsResolver(),
    auth_scheme_resolver=HTTPAuthSchemeResolver(),
    auth_schemes={"aws.auth#sigv4": SigV4AuthScheme(service="sagemaker")}
)

# Create the SageMaker Runtime HTTP/2 client
client = SageMakerRuntimeHTTP2Client(config=config)
```

### Complete bidirectional streaming client


The following example demonstrates how to create a bidirectional streaming client that sends multiple text payloads to a SageMaker endpoint and processes responses in real-time:

```
import asyncio
import logging
from sagemaker_runtime_http2.client import SageMakerRuntimeHTTP2Client
from sagemaker_runtime_http2.config import Config, HTTPAuthSchemeResolver
from sagemaker_runtime_http2.models import (
    InvokeEndpointWithBidirectionalStreamInput, 
    RequestStreamEventPayloadPart, 
    RequestPayloadPart
)
from smithy_aws_core.identity import EnvironmentCredentialsResolver
from smithy_aws_core.auth.sigv4 import SigV4AuthScheme

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

class SageMakerBidirectionalClient:
    
    def __init__(self, endpoint_name, region="us-west-2"):
        self.endpoint_name = endpoint_name
        self.region = region
        self.client = None
        self.stream = None
        self.response_task = None
        self.is_active = False
        
    def _initialize_client(self):
        bidi_endpoint = f"runtime.sagemaker.{self.region}.amazonaws.com:8443"
        config = Config(
            endpoint_uri=bidi_endpoint,
            region=self.region,
            aws_credentials_identity_resolver=EnvironmentCredentialsResolver(),
            auth_scheme_resolver=HTTPAuthSchemeResolver(),
            auth_schemes={"aws.auth#sigv4": SigV4AuthScheme(service="sagemaker")}
        )
        self.client = SageMakerRuntimeHTTP2Client(config=config)
    
    async def start_session(self):
        """Establish a bidirectional streaming connection with the endpoint."""
        if not self.client:
            self._initialize_client()
            
        logger.info(f"Starting session with endpoint: {self.endpoint_name}")
        self.stream = await self.client.invoke_endpoint_with_bidirectional_stream(
            InvokeEndpointWithBidirectionalStreamInput(endpoint_name=self.endpoint_name)
        )
        self.is_active = True
        
        # Start processing responses concurrently
        self.response_task = asyncio.create_task(self._process_responses())
    
    async def send_message(self, message):
        """Send a single message to the endpoint."""
        if not self.is_active:
            raise RuntimeError("Session not active. Call start_session() first.")
            
        logger.info(f"Sending message: {message}")
        payload = RequestPayloadPart(bytes_=message.encode('utf-8'))
        event = RequestStreamEventPayloadPart(value=payload)
        await self.stream.input_stream.send(event)
    
    async def send_multiple_messages(self, messages, delay=1.0):
        """Send multiple messages with a delay between each."""
        for message in messages:
            await self.send_message(message)
            await asyncio.sleep(delay)
    
    async def end_session(self):
        """Close the bidirectional streaming connection."""
        if not self.is_active:
            return
            
        await self.stream.input_stream.close()
        self.is_active = False
        logger.info("Stream closed")
        
        # Cancel the response processing task
        if self.response_task and not self.response_task.done():
            self.response_task.cancel()
    
    async def _process_responses(self):
        """Process incoming responses from the endpoint."""
        try:
            output = await self.stream.await_output()
            output_stream = output[1]
            
            while self.is_active:
                result = await output_stream.receive()
                
                if result is None:
                    logger.info("No more responses")
                    break
                
                if result.value and result.value.bytes_:
                    response_data = result.value.bytes_.decode('utf-8')
                    logger.info(f"Received: {response_data}")
                    
        except Exception as e:
            logger.error(f"Error processing responses: {e}")

# Example usage
async def run_bidirectional_client():
    client = SageMakerBidirectionalClient(endpoint_name="your-endpoint-name")
    
    try:
        # Start the session
        await client.start_session()
        
        # Send multiple messages
        messages = [
            "I need help with", 
            "my account balance", 
            "I can help with that", 
            "and recent charges"
        ]
        await client.send_multiple_messages(messages)
        
        # Wait for responses to be processed
        await asyncio.sleep(2)
        
        # End the session
        await client.end_session()
        logger.info("Session ended successfully")
        
    except Exception as e:
        logger.error(f"Client error: {e}")
        await client.end_session()

if __name__ == "__main__":
    asyncio.run(run_bidirectional_client())
```

The client initializes the SageMaker Runtime HTTP/2 client with the regional endpoint URI on port 8443, which is required for bidirectional streaming connections. The start\$1`session()` method calls `invoke_endpoint_with_bidirectional_stream()` to establish the persistent connection and creates an asynchronous task to process incoming responses concurrently.

The `send_event()` method wraps payload data in the appropriate request objects and sends them through the input stream, while the `_process_responses()` method continuously listens for and processes responses from the endpoint as they arrive. This bidirectional approach enables real-time interaction where both sending requests and receiving responses happen simultaneously over the same connection.

# Endpoints


After deploying your model to an endpoint, you might want to view and manage the endpoint. With SageMaker AI, you can view the status and details of your endpoint, check metrics and logs to monitor your endpoint’s performance, update the models deployed to your endpoint, and more.

The following sections show how you can manage endpoints within Amazon SageMaker Studio or within the AWS Management Console.

The following page describes how to interactively view and make changes to your endpoints using the Amazon SageMaker AI console or SageMaker Studio.

**Topics**
+ [

# View endpoint details in SageMaker Studio
](manage-endpoints-studio.md)
+ [

# View endpoint details in the SageMaker AI console
](manage-endpoints-console.md)

# View endpoint details in SageMaker Studio


In Amazon SageMaker Studio, you can view and manage your SageMaker AI Hosting endpoints. To learn more about Studio, see [Amazon SageMaker Studio](https://docs.aws.amazon.com/sagemaker/latest/dg/studio.html).

To find the list of your endpoints in SageMaker Studio do the following:

1. Open the Studio application.

1. In the left navigation pane, choose **Deployments**.

1. From the dropdown menu, choose **Endpoints**.

The **Endpoints** page opens, which lists all of your SageMaker AI Hosting endpoints. From this page, you can see the endpoints and their **Status**. You can also create a new endpoint, edit an existing endpoint, or delete an endpoint.

To see the details for a specific endpoint, choose an endpoint from the list. On the endpoint’s details page, you get an overview like the following screenshot.

![\[Screenshot of an endpoint's main page showing a summary of the endpoint details in Studio.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/inference/studio-endpoint-details-page.png)


Each endpoint details page contains the following tabs of information:

# View Variants (or Models)


The **Variants** tab (also called the **Models** tab if your endpoint has multiple models deployed) shows you the list of [model variants](https://docs.aws.amazon.com/sagemaker/latest/dg/model-ab-testing.html) or models currently deployed to your endpoint. The following screenshot shows you what the overview and **Models** section looks like for an endpoint with multiple models deployed.

![\[Screenshot of an endpoint's main page showing multiple models deployed.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/inference/studio-goldfinch-multi-model-endpoint.png)


You can add or edit the settings for each variant or model. You can also select a variant and enable a default auto-scaling policy, which you can edit later in the **Auto-scaling** tab.

# View settings


On the **Settings** tab, you can view the endpoint’s associated AWS IAM role, the AWS KMS key used for encryption (if applicable), the name of your VPC, and the network isolation settings.

# Test inference


On the **Test inference** tab, you can send a test inference request to a deployed model. This is useful if you’d like to verify that your endpoint responds to requests as expected.

To test inference, do the following:

1. On the model's **Test inference** tab, choose one of the following options:

   1. Select **Enter the request body** if you’d like to test the endpoint and receive a response through the Studio interface.

   1. Select **Copy example code (Python)** if you’d like to copy an AWS SDK for Python (Boto3) example that you can use to invoke your endpoint from a local environment and receive a response programmatically.

1. For **Model**, select the model that you want to test on the endpoint.

1. If you chose the Studio interface testing method, then you can also choose your desired **Content type** for the response from the dropdown.

After configuring your request, then you can either choose **Send request** (to receive a response through the Studio interface) or **Copy** to copy the Python example.

If you receive a response through the Studio interface, it’ll look like the following screenshot.

![\[Screenshot of a successful inference test request on an endpoint in Studio.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/inference/endpoint-test-inference.png)


# Auto-scaling


On the **Auto-scaling** tab, you can view any auto-scaling policies configured for the models hosted on your endpoint. The following screenshot shows you the **Auto-scaling** tab.

![\[Screenshot of the Auto-scaling tab, showing one active policy.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/inference/studio-endpoint-autoscaling.png)


You can choose **Edit auto-scaling** to change any of the policies and turn on or turn off the default auto-scaling policy.

To learn more about auto-scaling for real-time endpoints, see [Automatically Scale Amazon SageMaker AI Models](https://docs.aws.amazon.com/sagemaker/latest/dg/endpoint-auto-scaling.html). If you’re not sure how to configure an auto-scaling policy for your endpoint, you can use an [Inference Recommender autoscaling recommendations job](https://docs.aws.amazon.com/sagemaker/latest/dg/inference-recommender-autoscaling.html) to get recommendations for an auto-scaling policy.

# View endpoint details in the SageMaker AI console


To view your endpoints in the SageMaker AI console, do the following:

1. Go to the SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. In the left navigation pane, choose **Inference**.

1. From the dropdown list, choose **Endpoints**.

1. On the **Endpoints** page, choose your endpoint.

The endpoint details page should open, showing you a summary of your endpoint and metrics that have been collected for your endpoint.

The following sections describe the tabs on the endpoints details page.

# Endpoints monitoring


After creating a SageMaker AI Hosting endpoint, you can monitor your endpoint using Amazon CloudWatch, which collects raw data and processes it into readable, near real-time metrics. Using these metrics, you can access historical information and gain a better perspective on how your endpoint is performing. For more information, see the *[Amazon CloudWatch User Guide](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/)*.

From the **Monitoring** tab on the endpoint details page, you can view CloudWatch metrics data that has been collected from your endpoint.

The **Monitoring** tab includes the following sections:
+ **Operational metrics**: View metrics that track the utilization of your endpoint’s resources, such as CPU Utilization and Memory Utilization.
+ **Invocation metrics**: View metrics that track the number, health, and status of `InvokeEndpoint` requests coming to your endpoint, such as Invocation Model Errors and Model Latency.
+ **Health metrics**: View metrics that track your endpoint’s overall health, such as Invocation Failures and Notification Failures.

For detailed descriptions of each metric, see [Monitor SageMaker AI with CloudWatch](https://docs.aws.amazon.com/sagemaker/latest/dg/monitoring-cloudwatch.html).

The following screenshot shows the **Operational metrics** section for a serverless endpoint.

![\[Screenshot of metrics graphs in the operational metrics section of the endpoint details page.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/hosting-operational-metrics.png)




You can adjust the **Period** and **Statistic** that you want to track for the metrics in a given section, as well as the length of time for which you want to view metrics data. You can also add and remove metric widgets from the view for each section by choosing **Add widget**. In the **Add widget **dialog box, you can select and deselect the metrics that you want to see.

The metrics that are available may depend on your endpoint type. For example, serverless endpoints have some metrics that aren’t available for real-time endpoints. For more specific metrics information by endpoint type, see the following pages:
+ [Monitor a serverless endpoint](https://docs.aws.amazon.com/sagemaker/latest/dg/serverless-endpoints-monitoring.html)
+ [Monitor an asynchronous endpoint](https://docs.aws.amazon.com/sagemaker/latest/dg/async-inference-monitor.html)
+ [CW Metrics for Multi-Model Endpoint Deployments](https://docs.aws.amazon.com/sagemaker/latest/dg/multi-model-endpoint-cloudwatch-metrics.html)
+ [Inference Pipeline Logs and Metrics](https://docs.aws.amazon.com/sagemaker/latest/dg/inference-pipeline-logs-metrics.html)

# Settings


You can choose the **Settings** tab to view additional information about your endpoint, such as the data capture settings, the endpoint configuration, and tags.

# Create and view alarms


From the **Alarms** tab on your endpoint details page, you can view and create simple static threshold metric alarms, where you specify a threshold value for a metric. If the metric breaches the threshold value, the alarm goes into the `ALARM` state. For more information about CloudWatch alarms, see [Using Amazon CloudWatch alarms](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html).

In the **Endpoint summary** section, you can view the **Alarms** field, which tells you how many alarms are currently active on your endpoint.

To view which alarms are in the `ALARM` state, choose the **Alarms** tab. The **Alarms** tab shows you a full list of your endpoint alarms, along with details about their status and conditions. The following screenshot shows a list of alarms in this section that have been configured for an endpoint.

![\[Screenshot of the alarms tab on the endpoint details page which shows a list of CloudWatch alarms.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/hosting-alarms-tab.png)


An alarm’s status can be `In alarm`, `OK`, or `Insufficient data` if there isn’t enough metrics data being collected.

To create a new alarm for your endpoint, do the following:

1. In the **Alarms** tab, choose **Create alarm**.

1. The **Create alarm** page opens. For **Alarm name**, enter a name for the alarm.

1. (Optional) Enter a description for the alarm.

1. For **Metric**, choose the CloudWatch metric that you want the alarm to track.

1. For **Variant name**, choose the endpoint model variant that you want to monitor.

1. For **Statistic**, choose one of the available statistics for the metric you selected.

1. For **Period**, choose the time period to use for calculating each statistical value. For example, if you choose the Average statistic and a 5 minute period, each data point monitored by the alarm is the average of the metric’s data points at 5 minute intervals.

1. For** Evaluation periods**, enter the number of data points that you want the alarm to consider when evaluating whether to enter the alarm state or not.

1. For **Condition**, choose the conditional that you want to use for your alarm threshold.

1. For **Threshold value**, enter the desired value for your threshold.

1. (Optional) For **Notification**, you can choose **Add notification** to create or specify an Amazon SNS topic that receives a notification when your alarm state changes.

1. Choose **Create alarm**.

After creating your alarm, you can return to the **Alarms** tab to view its status at any time. From this section, you can also select the alarm and either **Edit** or **Delete** it.

# Hosting options


The following topics describe available SageMaker AI realtime hosting options along with how to set up, invoke, and delete each hosting option.

**Topics**
+ [

# Single-model endpoints
](realtime-single-model.md)
+ [

# Multi-model endpoints
](multi-model-endpoints.md)
+ [

# Multi-container endpoints
](multi-container-endpoints.md)
+ [

# Inference pipelines in Amazon SageMaker AI
](inference-pipelines.md)
+ [

# Delete Endpoints and Resources
](realtime-endpoints-delete-resources.md)

# Single-model endpoints


You can create, update, and delete real-time inference endpoints that host a single model with Amazon SageMaker Studio, the AWS SDK for Python (Boto3), the SageMaker Python SDK, or the AWS CLI. For procedures and code examples, see [Deploy models for real-time inference](realtime-endpoints-deploy-models.md).

# Multi-model endpoints


Multi-model endpoints provide a scalable and cost-effective solution to deploying large numbers of models. They use the same fleet of resources and a shared serving container to host all of your models. This reduces hosting costs by improving endpoint utilization compared with using single-model endpoints. It also reduces deployment overhead because Amazon SageMaker AI manages loading models in memory and scaling them based on the traffic patterns to your endpoint.

The following diagram shows how multi-model endpoints work compared to single-model endpoints.

![\[Diagram that shows how multi-model versus how single-model endpoints host models.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/multi-model-endpoints-diagram.png)


Multi-model endpoints are ideal for hosting a large number of models that use the same ML framework on a shared serving container. If you have a mix of frequently and infrequently accessed models, a multi-model endpoint can efficiently serve this traffic with fewer resources and higher cost savings. Your application should be tolerant of occasional cold start-related latency penalties that occur when invoking infrequently used models.

Multi-model endpoints support hosting both CPU and GPU backed models. By using GPU backed models, you can lower your model deployment costs through increased usage of the endpoint and its underlying accelerated compute instances.

Multi-model endpoints also enable time-sharing of memory resources across your models. This works best when the models are fairly similar in size and invocation latency. When this is the case, multi-model endpoints can effectively use instances across all models. If you have models that have significantly higher transactions per second (TPS) or latency requirements, we recommend hosting them on dedicated endpoints.

You can use multi-model endpoints with the following features:
+ [AWS PrivateLink](https://docs.aws.amazon.com/vpc/latest/userguide/endpoint-services-overview.html) and VPCs
+ [Auto scaling](multi-model-endpoints-autoscaling.md)
+ [Serial inference pipelines](https://docs.aws.amazon.com/sagemaker/latest/dg/inference-pipelines.html) (but only one multi-model enabled container can be included in an inference pipeline)
+ A/B testing

You can use the AWS SDK for Python (Boto) or the SageMaker AI console to create a multi-model endpoint. For CPU backed multi-model endpoints, you can create your endpoint with custom-built containers by integrating the [Multi Model Server](https://github.com/awslabs/multi-model-server) library.

**Topics**
+ [

## How multi-model endpoints work
](#how-multi-model-endpoints-work)
+ [

## Sample notebooks for multi-model endpoints
](#multi-model-endpoint-sample-notebooks)
+ [

# Supported algorithms, frameworks, and instances for multi-model endpoints
](multi-model-support.md)
+ [

# Instance recommendations for multi-model endpoint deployments
](multi-model-endpoint-instance.md)
+ [

# Create a Multi-Model Endpoint
](create-multi-model-endpoint.md)
+ [

# Invoke a Multi-Model Endpoint
](invoke-multi-model-endpoint.md)
+ [

# Add or Remove Models
](add-models-to-endpoint.md)
+ [

# Build Your Own Container for SageMaker AI Multi-Model Endpoints
](build-multi-model-build-container.md)
+ [

# Multi-Model Endpoint Security
](multi-model-endpoint-security.md)
+ [

# CloudWatch Metrics for Multi-Model Endpoint Deployments
](multi-model-endpoint-cloudwatch-metrics.md)
+ [

# Set SageMaker AI multi-model endpoint model caching behavior
](multi-model-caching.md)
+ [

# Set Auto Scaling Policies for Multi-Model Endpoint Deployments
](multi-model-endpoints-autoscaling.md)

## How multi-model endpoints work
How Amazon SageMaker AI multi-model endpoints work

 SageMaker AI manages the lifecycle of models hosted on multi-model endpoints in the container's memory. Instead of downloading all of the models from an Amazon S3 bucket to the container when you create the endpoint, SageMaker AI dynamically loads and caches them when you invoke them. When SageMaker AI receives an invocation request for a particular model, it does the following: 

1. Routes the request to an instance behind the endpoint.

1. Downloads the model from the S3 bucket to that instance's storage volume.

1. Loads the model to the container's memory (CPU or GPU, depending on whether you have CPU or GPU backed instances) on that accelerated compute instance. If the model is already loaded in the container's memory, invocation is faster because SageMaker AI doesn't need to download and load it.

SageMaker AI continues to route requests for a model to the instance where the model is already loaded. However, if the model receives many invocation requests, and there are additional instances for the multi-model endpoint, SageMaker AI routes some requests to another instance to accommodate the traffic. If the model isn't already loaded on the second instance, the model is downloaded to that instance's storage volume and loaded into the container's memory.

When an instance's memory utilization is high and SageMaker AI needs to load another model into memory, it unloads unused models from that instance's container to ensure that there is enough memory to load the model. Models that are unloaded remain on the instance's storage volume and can be loaded into the container's memory later without being downloaded again from the S3 bucket. If the instance's storage volume reaches its capacity, SageMaker AI deletes any unused models from the storage volume.

To delete a model, stop sending requests and delete it from the S3 bucket. SageMaker AI provides multi-model endpoint capability in a serving container. Adding models to, and deleting them from, a multi-model endpoint doesn't require updating the endpoint itself. To add a model, you upload it to the S3 bucket and invoke it. You don’t need code changes to use it.

**Note**  
When you update a multi-model endpoint, initial invocation requests on the endpoint might experience higher latencies as Smart Routing in multi-model endpoints adapt to your traffic pattern. However, once it learns your traffic pattern, you can experience low latencies for most frequently used models. Less frequently used models may incur some cold start latencies since the models are dynamically loaded to an instance.

## Sample notebooks for multi-model endpoints
Sample notebooks

To learn more about how to use multi-model endpoints, you can try the following sample notebooks:
+ Examples for multi-model endpoints using CPU backed instances:
  + [Multi-Model Endpoint XGBoost Sample Notebook](https://sagemaker-examples.readthedocs.io/en/latest/advanced_functionality/multi_model_xgboost_home_value/xgboost_multi_model_endpoint_home_value.html) – This notebook shows how to deploy multiple XGBoost models to an endpoint.
  + [Multi-Model Endpoints BYOC Sample Notebook](https://sagemaker-examples.readthedocs.io/en/latest/advanced_functionality/multi_model_bring_your_own/multi_model_endpoint_bring_your_own.html) – This notebook shows how to set up and deploy a customer container that supports multi-model endpoints in SageMaker AI.
+ Example for multi-model endpoints using GPU backed instances:
  + [Run multiple deep learning models on GPUs with Amazon SageMaker AI Multi-model endpoints (MME)](https://github.com/aws/amazon-sagemaker-examples/blob/main/multi-model-endpoints/mme-on-gpu/cv/resnet50_mme_with_gpu.ipynb) – This notebook shows how to use an NVIDIA Triton Inference container to deploy ResNet-50 models to a multi-model endpoint.

For instructions on how to create and access Jupyter notebook instances that you can use to run the previous examples in SageMaker AI, see [Amazon SageMaker notebook instances](nbi.md). After you've created a notebook instance and opened it, choose the **SageMaker AI Examples** tab to see a list of all the SageMaker AI samples. The multi-model endpoint notebooks are located in the **ADVANCED FUNCTIONALITY** section. To open a notebook, choose its **Use** tab and choose **Create copy**.

For more information about use cases for multi-model endpoints, see the following blogs and resources:
+ Video: [Hosting thousands of models on SageMaker AI](https://www.youtube.com/watch?v=XqCNTWmHsLc&t=751s)
+ Video: [SageMaker AI ML for SaaS](https://www.youtube.com/watch?v=BytpYlJ3vsQ)
+ Blog: [How to scale machine learning inference for multi-tenant SaaS use cases](https://aws.amazon.com/blogs/machine-learning/how-to-scale-machine-learning-inference-for-multi-tenant-saas-use-cases/)
+ Case study: [Veeva Systems](https://aws.amazon.com/partners/success/advanced-clinical-veeva/)

# Supported algorithms, frameworks, and instances for multi-model endpoints


For information about the algorithms, frameworks, and instance types that you can use with multi-model endpoints, see the following sections.

## Supported algorithms, frameworks, and instances for multi-model endpoints using CPU backed instances


The inference containers for the following algorithms and frameworks support multi-model endpoints:
+ [XGBoost algorithm with Amazon SageMaker AI](xgboost.md)
+ [K-Nearest Neighbors (k-NN) Algorithm](k-nearest-neighbors.md)
+ [Linear Learner Algorithm](linear-learner.md)
+ [Random Cut Forest (RCF) Algorithm](randomcutforest.md)
+ [Resources for using TensorFlow with Amazon SageMaker AI](tf.md)
+ [Resources for using Scikit-learn with Amazon SageMaker AI](sklearn.md)
+ [Resources for using Apache MXNet with Amazon SageMaker AI](mxnet.md)
+ [Resources for using PyTorch with Amazon SageMaker AI](pytorch.md)

To use any other framework or algorithm, use the SageMaker AI inference toolkit to build a container that supports multi-model endpoints. For information, see [Build Your Own Container for SageMaker AI Multi-Model Endpoints](build-multi-model-build-container.md).

Multi-model endpoints support all of the CPU instance types.

## Supported algorithms, frameworks, and instances for multi-model endpoints using GPU backed instances


Hosting multiple GPU backed models on multi-model endpoints is supported through the [SageMaker AI Triton Inference server](https://docs.aws.amazon.com/sagemaker/latest/dg/triton.html). This supports all major inference frameworks such as NVIDIA® TensorRT™, PyTorch, MXNet, Python, ONNX, XGBoost, scikit-learn, RandomForest, OpenVINO, custom C\$1\$1, and more.

To use any other framework or algorithm, you can use Triton backend for Python or C\$1\$1 to write your model logic and serve any custom model. After you have the server ready, you can start deploying 100s of Deep Learning models behind one endpoint.

Multi-model endpoints support the following GPU instance types:


| Instance family | Instance type | vCPUs | GiB of memory per vCPU | GPUs | GPU memory | 
| --- | --- | --- | --- | --- | --- | 
| p2 | ml.p2.xlarge | 4 | 15.25 | 1 | 12 | 
| p3 | ml.p3.2xlarge | 8 | 7.62 | 1 | 16 | 
| g5 | ml.g5.xlarge | 4 | 4 | 1 | 24 | 
| g5 | ml.g5.2xlarge | 8 | 4 | 1 | 24 | 
| g5 | ml.g5.4xlarge | 16 | 4 | 1 | 24 | 
| g5 | ml.g5.8xlarge | 32 | 4 | 1 | 24 | 
| g5 | ml.g5.16xlarge | 64 | 4 | 1 | 24 | 
| g4dn | ml.g4dn.xlarge | 4 | 4 | 1 | 16 | 
| g4dn | ml.g4dn.2xlarge | 8 | 4 | 1 | 16 | 
| g4dn | ml.g4dn.4xlarge | 16 | 4 | 1 | 16 | 
| g4dn | ml.g4dn.8xlarge | 32 | 4 | 1 | 16 | 
| g4dn | ml.g4dn.16xlarge | 64 | 4 | 1 | 16 | 

# Instance recommendations for multi-model endpoint deployments


There are several items to consider when selecting a SageMaker AI ML instance type for a multi-model endpoint:
+ Provision sufficient [ Amazon Elastic Block Store (Amazon EBS)](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AmazonEBS.html) capacity for all of the models that need to be served.
+ Balance performance (minimize cold starts) and cost (don’t over-provision instance capacity). For information about the size of the storage volume that SageMaker AI attaches for each instance type for an endpoint and for a multi-model endpoint, see [Instance storage volumes](host-instance-storage.md).
+ For a container configured to run in `MultiModel` mode, the storage volume provisioned for its instances are larger than the default `SingleModel` mode. This allows more models to be cached on the instance storage volume than in `SingleModel` mode.

When choosing a SageMaker AI ML instance type, consider the following:
+ Multi-model endpoints are currently supported for all CPU instances types and on single-GPU instance types.
+ For the traffic distribution (access patterns) to the models that you want to host behind the multi-model endpoint, along with the model size (how many models could be loaded in memory on the instance), keep the following information in mind:
  + Think of the amount of memory on an instance as the cache space for models to be loaded, and think of the number of vCPUs as the concurrency limit to perform inference on the loaded models (assuming that invoking a model is bound to CPU).
  + For CPU backed instances, the number of vCPUs impacts your maximum concurrent invocations per instance (assuming that invoking a model is bound to CPU). A higher amount of vCPUs enables you to invoke more unique models concurrently.
  + For GPU backed instances, a higher amount of instance and GPU memory enables you to have more models loaded and ready to serve inference requests.
  + For both CPU and GPU backed instances, have some "slack" memory available so that unused models can be unloaded, and especially for multi-model endpoints with multiple instances. If an instance or an Availability Zone fails, the models on those instances will be rerouted to other instances behind the endpoint.
+ Determine your tolerance to loading/downloading times:
  + d instance type families (for example, m5d, c5d, or r5d) and g5s come with an NVMe (non-volatile memory express) SSD, which offers high I/O performance and might reduce the time it takes to download models to the storage volume and for the container to load the model from the storage volume.
  + Because d and g5 instance types come with an NVMe SSD storage, SageMaker AI does not attach an Amazon EBS storage volume to these ML compute instances that hosts the multi-model endpoint. Auto scaling works best when the models are similarly sized and homogenous, that is when they have similar inference latency and resource requirements.

You can also use the following guidance to help you optimize model loading on your multi-model endpoints:

**Choosing an instance type that can't hold all of the targeted models in memory**

In some cases, you might opt to reduce costs by choosing an instance type that can't hold all of the targeted models in memory at once. SageMaker AI dynamically unloads models when it runs out of memory to make room for a newly targeted model. For infrequently requested models, you sacrifice dynamic load latency. In cases with more stringent latency needs, you might opt for larger instance types or more instances. Investing time up front for performance testing and analysis helps you to have successful production deployments.

**Evaluating your model cache hits**

Amazon CloudWatch metrics can help you evaluate your models. For more information about metrics you can use with multi-model endpoints, see [CloudWatch Metrics for Multi-Model Endpoint Deployments](multi-model-endpoint-cloudwatch-metrics.md).

 You can use the `Average` statistic of the `ModelCacheHit` metric to monitor the ratio of requests where the model is already loaded. You can use the `SampleCount` statistic for the `ModelUnloadingTime` metric to monitor the number of unload requests sent to the container during a time period. If models are unloaded too frequently (an indicator of *thrashing*, where models are being unloaded and loaded again because there is insufficient cache space for the working set of models), consider using a larger instance type with more memory or increasing the number of instances behind the multi-model endpoint. For multi-model endpoints with multiple instances, be aware that a model might be loaded on more than 1 instance.

# Create a Multi-Model Endpoint


You can use the SageMaker AI console or the AWS SDK for Python (Boto) to create a multi-model endpoint. To create either a CPU or GPU backed endpoint through the console, see the console procedure in the following sections. If you want to create a multi-model endpoint with the AWS SDK for Python (Boto), use either the CPU or GPU procedure in the following sections. The CPU and GPU workflows are similar but have several differences, such as the container requirements.

**Topics**
+ [

## Create a multi-model endpoint (console)
](#create-multi-model-endpoint-console)
+ [

## Create a multi-model endpoint using CPUs with the AWS SDK for Python (Boto3)
](#create-multi-model-endpoint-sdk-cpu)
+ [

## Create a multi-model endpoint using GPUs with the AWS SDK for Python (Boto3)
](#create-multi-model-endpoint-sdk-gpu)

## Create a multi-model endpoint (console)
Create a multi-model endpoint (console)

You can create both CPU and GPU backed multi-model endpoints through the console. Use the following procedure to create a multi-model endpoint through the SageMaker AI console.

**To create a multi-model endpoint (console)**

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. Choose **Model**, and then from the **Inference** group, choose **Create model**. 

1. For **Model name**, enter a name.

1. For **IAM role**, choose or create an IAM role that has the `AmazonSageMakerFullAccess` IAM policy attached. 

1.  In the **Container definition** section, for **Provide model artifacts and inference image options**, choose **Use multiple models**.  
![\[The section of the Create model page where you can choose Use multiple models.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/mme-create-model-ux-2.PNG)

1. For the **Inference container image**, enter the Amazon ECR path for your desired container image.

   For GPU models, you must use a container backed by the NVIDIA Triton Inference Server. For a list of container images that work with GPU backed endpoints, see the [NVIDIA Triton Inference Containers (SM support only)](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#nvidia-triton-inference-containers-sm-support-only). For more information about the NVIDIA Triton Inference Server, see [Use Triton Inference Server with SageMaker AI](https://docs.aws.amazon.com/sagemaker/latest/dg/triton.html).

1. Choose **Create model**.

1. Deploy your multi-model endpoint as you would a single model endpoint. For instructions, see [Deploy the Model to SageMaker AI Hosting Services](ex1-model-deployment.md#ex1-deploy-model).

## Create a multi-model endpoint using CPUs with the AWS SDK for Python (Boto3)
Create a multi-model endpoint backed by CPUs (SDK)

Use the following section to create a multi-model endpoint backed by CPU instances. You create a multi-model endpoint using the Amazon SageMaker AI [https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_model](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_model), [https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_endpoint_config](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_endpoint_config), and [https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_endpoint](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_endpoint) APIs just as you would create a single model endpoint, but with two changes. When defining the model container, you need to pass a new `Mode` parameter value, `MultiModel`. You also need to pass the `ModelDataUrl` field that specifies the prefix in Amazon S3 where the model artifacts are located, instead of the path to a single model artifact, as you would when deploying a single model.

For a sample notebook that uses SageMaker AI to deploy multiple XGBoost models to an endpoint, see [Multi-Model Endpoint XGBoost Sample Notebook](https://sagemaker-examples.readthedocs.io/en/latest/advanced_functionality/multi_model_xgboost_home_value/xgboost_multi_model_endpoint_home_value.html). 

The following procedure outlines the key steps used in that sample to create a CPU backed multi-model endpoint.

**To deploy the model (AWS SDK for Python (Boto 3))**

1. Get a container with an image that supports deploying multi-model endpoints. For a list of built-in algorithms and framework containers that support multi-model endpoints, see [Supported algorithms, frameworks, and instances for multi-model endpoints](multi-model-support.md). For this example, we use the [K-Nearest Neighbors (k-NN) Algorithm](k-nearest-neighbors.md) built-in algorithm. We call the [SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/v2.html) utility function `image_uris.retrieve()` to get the address for the K-Nearest Neighbors built-in algorithm image.

   ```
   import sagemaker
   region = sagemaker_session.boto_region_name
   image = sagemaker.image_uris.retrieve("knn",region=region)
   container = { 
                 'Image':        image,
                 'ModelDataUrl': 's3://<BUCKET_NAME>/<PATH_TO_ARTIFACTS>',
                 'Mode':         'MultiModel'
               }
   ```

1. Get an AWS SDK for Python (Boto3) SageMaker AI client and create the model that uses this container.

   ```
   import boto3
   sagemaker_client = boto3.client('sagemaker')
   response = sagemaker_client.create_model(
                 ModelName        = '<MODEL_NAME>',
                 ExecutionRoleArn = role,
                 Containers       = [container])
   ```

1. (Optional) If you are using a serial inference pipeline, get the additional container(s) to include in the pipeline, and include it in the `Containers` argument of `CreateModel`:

   ```
   preprocessor_container = { 
                  'Image': '<ACCOUNT_ID>.dkr.ecr.<REGION_NAME>.amazonaws.com/<PREPROCESSOR_IMAGE>:<TAG>'
               }
   
   multi_model_container = { 
                 'Image': '<ACCOUNT_ID>.dkr.ecr.<REGION_NAME>.amazonaws.com/<IMAGE>:<TAG>',
                 'ModelDataUrl': 's3://<BUCKET_NAME>/<PATH_TO_ARTIFACTS>',
                 'Mode':         'MultiModel'
               }
   
   response = sagemaker_client.create_model(
                 ModelName        = '<MODEL_NAME>',
                 ExecutionRoleArn = role,
                 Containers       = [preprocessor_container, multi_model_container]
               )
   ```
**Note**  
You can use only one multi-model-enabled endpoint in a serial inference pipeline.

1. (Optional) If your use case does not benefit from model caching, set the value of the `ModelCacheSetting` field of the `MultiModelConfig` parameter to `Disabled`, and include it in the `Container` argument of the call to `create_model`. The value of the `ModelCacheSetting` field is `Enabled` by default.

   ```
   container = { 
                   'Image': image, 
                   'ModelDataUrl': 's3://<BUCKET_NAME>/<PATH_TO_ARTIFACTS>',
                   'Mode': 'MultiModel' 
                   'MultiModelConfig': {
                           // Default value is 'Enabled'
                           'ModelCacheSetting': 'Disabled'
                   }
              }
   
   response = sagemaker_client.create_model(
                 ModelName        = '<MODEL_NAME>',
                 ExecutionRoleArn = role,
                 Containers       = [container]
               )
   ```

1. Configure the multi-model endpoint for the model. We recommend configuring your endpoints with at least two instances. This allows SageMaker AI to provide a highly available set of predictions across multiple Availability Zones for the models.

   ```
   response = sagemaker_client.create_endpoint_config(
                   EndpointConfigName = '<ENDPOINT_CONFIG_NAME>',
                   ProductionVariants=[
                        {
                           'InstanceType':        'ml.m4.xlarge',
                           'InitialInstanceCount': 2,
                           'InitialVariantWeight': 1,
                           'ModelName':            '<MODEL_NAME>',
                           'VariantName':          'AllTraffic'
                         }
                   ]
              )
   ```
**Note**  
You can use only one multi-model-enabled endpoint in a serial inference pipeline.

1. Create the multi-model endpoint using the `EndpointName` and `EndpointConfigName` parameters.

   ```
   response = sagemaker_client.create_endpoint(
                 EndpointName       = '<ENDPOINT_NAME>',
                 EndpointConfigName = '<ENDPOINT_CONFIG_NAME>')
   ```

## Create a multi-model endpoint using GPUs with the AWS SDK for Python (Boto3)
Create a multi-model endpoint backed by GPUs (SDK)

Use the following section to create a GPU backed multi-model endpoint. You create a multi-model endpoint using the Amazon SageMaker AI [https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_model](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_model), [https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_endpoint_config](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_endpoint_config), and [https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_endpoint](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_endpoint) APIs similarly to creating single model endpoints, but there are several changes. When defining the model container, you need to pass a new `Mode` parameter value, `MultiModel`. You also need to pass the `ModelDataUrl` field that specifies the prefix in Amazon S3 where the model artifacts are located, instead of the path to a single model artifact, as you would when deploying a single model. For GPU backed multi-model endpoints, you also must use a container with the NVIDIA Triton Inference Server that is optimized for running on GPU instances. For a list of container images that work with GPU backed endpoints, see the [NVIDIA Triton Inference Containers (SM support only)](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#nvidia-triton-inference-containers-sm-support-only).

For an example notebook that demonstrates how to create a multi-model endpoint backed by GPUs, see [Run mulitple deep learning models on GPUs with Amazon SageMaker AI Multi-model endpoints (MME)](https://github.com/aws/amazon-sagemaker-examples/blob/main/multi-model-endpoints/mme-on-gpu/cv/resnet50_mme_with_gpu.ipynb).

The following procedure outlines the key steps to create a GPU backed multi-model endpoint.

**To deploy the model (AWS SDK for Python (Boto 3))**

1. Define the container image. To create a multi-model endpoint with GPU support for ResNet models, define the container to use the [NVIDIA Triton Server image](https://docs.aws.amazon.com/sagemaker/latest/dg/triton.html). This container supports multi-model endpoints and is optimized for running on GPU instances. We call the [SageMaker AI Python SDK](https://sagemaker.readthedocs.io/en/stable/v2.html) utility function `image_uris.retrieve()` to get the address for the image. For example:

   ```
   import sagemaker
   region = sagemaker_session.boto_region_name
   
   // Find the sagemaker-tritonserver image at 
   // https://github.com/aws/amazon-sagemaker-examples/blob/main/sagemaker-triton/resnet50/triton_resnet50.ipynb
   // Find available tags at https://github.com/aws/deep-learning-containers/blob/master/available_images.md#nvidia-triton-inference-containers-sm-support-only
   
   image = "<ACCOUNT_ID>.dkr.ecr.<REGION_NAME>.amazonaws.com/sagemaker-tritonserver:<TAG>".format(
       account_id=account_id_map[region], region=region
   )
   
   container = { 
                 'Image':        image,
                 'ModelDataUrl': 's3://<BUCKET_NAME>/<PATH_TO_ARTIFACTS>',
                 'Mode':         'MultiModel',
                 "Environment": {"SAGEMAKER_TRITON_DEFAULT_MODEL_NAME": "resnet"},
               }
   ```

1. Get an AWS SDK for Python (Boto3) SageMaker AI client and create the model that uses this container.

   ```
   import boto3
   sagemaker_client = boto3.client('sagemaker')
   response = sagemaker_client.create_model(
                 ModelName        = '<MODEL_NAME>',
                 ExecutionRoleArn = role,
                 Containers       = [container])
   ```

1. (Optional) If you are using a serial inference pipeline, get the additional container(s) to include in the pipeline, and include it in the `Containers` argument of `CreateModel`:

   ```
   preprocessor_container = { 
                  'Image': '<ACCOUNT_ID>.dkr.ecr.<REGION_NAME>.amazonaws.com/<PREPROCESSOR_IMAGE>:<TAG>'
               }
   
   multi_model_container = { 
                 'Image': '<ACCOUNT_ID>.dkr.ecr.<REGION_NAME>.amazonaws.com/<IMAGE>:<TAG>',
                 'ModelDataUrl': 's3://<BUCKET_NAME>/<PATH_TO_ARTIFACTS>',
                 'Mode':         'MultiModel'
               }
   
   response = sagemaker_client.create_model(
                 ModelName        = '<MODEL_NAME>',
                 ExecutionRoleArn = role,
                 Containers       = [preprocessor_container, multi_model_container]
               )
   ```
**Note**  
You can use only one multi-model-enabled endpoint in a serial inference pipeline.

1. (Optional) If your use case does not benefit from model caching, set the value of the `ModelCacheSetting` field of the `MultiModelConfig` parameter to `Disabled`, and include it in the `Container` argument of the call to `create_model`. The value of the `ModelCacheSetting` field is `Enabled` by default.

   ```
   container = { 
                   'Image': image, 
                   'ModelDataUrl': 's3://<BUCKET_NAME>/<PATH_TO_ARTIFACTS>',
                   'Mode': 'MultiModel' 
                   'MultiModelConfig': {
                           // Default value is 'Enabled'
                           'ModelCacheSetting': 'Disabled'
                   }
              }
   
   response = sagemaker_client.create_model(
                 ModelName        = '<MODEL_NAME>',
                 ExecutionRoleArn = role,
                 Containers       = [container]
               )
   ```

1. Configure the multi-model endpoint with GPU backed instances for the model. We recommend configuring your endpoints with more than one instance to allow for high availability and higher cache hits.

   ```
   response = sagemaker_client.create_endpoint_config(
                   EndpointConfigName = '<ENDPOINT_CONFIG_NAME>',
                   ProductionVariants=[
                        {
                           'InstanceType':        'ml.g4dn.4xlarge',
                           'InitialInstanceCount': 2,
                           'InitialVariantWeight': 1,
                           'ModelName':            '<MODEL_NAME>',
                           'VariantName':          'AllTraffic'
                         }
                   ]
              )
   ```

1. Create the multi-model endpoint using the `EndpointName` and `EndpointConfigName` parameters.

   ```
   response = sagemaker_client.create_endpoint(
                 EndpointName       = '<ENDPOINT_NAME>',
                 EndpointConfigName = '<ENDPOINT_CONFIG_NAME>')
   ```

# Invoke a Multi-Model Endpoint
Invoke a Multi-Model Endpoint

To invoke a multi-model endpoint, use the [https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker-runtime.html#SageMakerRuntime.Client.invoke_endpoint](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker-runtime.html#SageMakerRuntime.Client.invoke_endpoint) from the SageMaker AI Runtime just as you would invoke a single model endpoint, with one change. Pass a new `TargetModel` parameter that specifies which of the models at the endpoint to target. The SageMaker AI Runtime `InvokeEndpoint` request supports `X-Amzn-SageMaker-Target-Model` as a new header that takes the relative path of the model specified for invocation. The SageMaker AI system constructs the absolute path of the model by combining the prefix that is provided as part of the `CreateModel` API call with the relative path of the model.

The following procedures are the same for both CPU and GPU-backed multi-model endpoints.

------
#### [ AWS SDK for Python (Boto 3) ]

The following example prediction request uses the [AWS SDK for Python (Boto 3)](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker-runtime.html) in the sample notebook.

```
response = runtime_sagemaker_client.invoke_endpoint(
                        EndpointName = "<ENDPOINT_NAME>",
                        ContentType  = "text/csv",
                        TargetModel  = "<MODEL_FILENAME>.tar.gz",
                        Body         = body)
```

------
#### [ AWS CLI ]

 The following example shows how to make a CSV request with two rows using the AWS Command Line Interface (AWS CLI):

```
aws sagemaker-runtime invoke-endpoint \
  --endpoint-name "<ENDPOINT_NAME>" \
  --body "1.0,2.0,5.0"$'\n'"2.0,3.0,4.0" \
  --content-type "text/csv" \
  --target-model "<MODEL_NAME>.tar.gz"
  output_file.txt
```

An `output_file.txt` with information about your inference requests is made if the inference was successful. For more examples on how to make predictions with the AWS CLI, see [Making predictions with the AWS CLI](https://sagemaker.readthedocs.io/en/stable/frameworks/tensorflow/deploying_tensorflow_serving.html#making-predictions-with-the-aws-cli) in the SageMaker Python SDK documentation.

------

The multi-model endpoint dynamically loads target models as needed. You can observe this when running the [MME Sample Notebook](https://sagemaker-examples.readthedocs.io/en/latest/advanced_functionality/multi_model_xgboost_home_value/xgboost_multi_model_endpoint_home_value.html) as it iterates through random invocations against multiple target models hosted behind a single endpoint. The first request against a given model takes longer because the model has to be downloaded from Amazon Simple Storage Service (Amazon S3) and loaded into memory. This is called a *cold start*, and it is expected on multi-model endpoints to optimize for better price performance for customers. Subsequent calls finish faster because there's no additional overhead after the model has loaded.

**Note**  
For GPU backed instances, the HTTP response code with 507 from the GPU container indicates a lack of memory or other resources. This causes unused models to be unloaded from the container in order to load more frequently used models.

## Retry Requests on ModelNotReadyException Errors


The first time you call `invoke_endpoint` for a model, the model is downloaded from Amazon Simple Storage Service and loaded into the inference container. This makes the first call take longer to return. Subsequent calls to the same model finish faster, because the model is already loaded.

SageMaker AI returns a response for a call to `invoke_endpoint` within 60 seconds. Some models are too large to download within 60 seconds. If the model does not finish loading before the 60 second timeout limit, the request to `invoke_endpoint` returns with the error code `ModelNotReadyException`, and the model continues to download and load into the inference container for up to 360 seconds. If you get a `ModelNotReadyException` error code for an `invoke_endpoint` request, retry the request. By default, the AWS SDKs for Python (Boto 3) (using [Legacy retry mode](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/retries.html#legacy-retry-mode)) and Java retry `invoke_endpoint` requests that result in `ModelNotReadyException` errors. You can configure the retry strategy to continue retrying the request for up to 360 seconds. If you expect your model to take longer than 60 seconds to download and load into the container, set the SDK socket timeout to 70 seconds. For more information about configuring the retry strategy for the AWS SDK for Python (Boto3), see [Configuring a retry mode](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/retries.html#configuring-a-retry-mode). The following code shows an example that configures the retry strategy to retry calls to `invoke_endpoint` for up to 180 seconds.

```
import boto3
from botocore.config import Config

# This example retry strategy sets the retry attempts to 2. 
# With this setting, the request can attempt to download and/or load the model 
# for upto 180 seconds: 1 orginal request (60 seconds) + 2 retries (120 seconds)
config = Config(
    read_timeout=70,
    retries={
        'max_attempts': 2  # This value can be adjusted to 5 to go up to the 360s max timeout
    }
)
runtime_sagemaker_client = boto3.client('sagemaker-runtime', config=config)
```

# Add or Remove Models
Add or Remove Models

You can deploy additional models to a multi-model endpoint and invoke them through that endpoint immediately. When adding a new model, you don't need to update or bring down the endpoint, so you avoid the cost of creating and running a separate endpoint for each new model. The process for adding and removing models is the same for CPU and GPU-backed multi-model endpoints.

 SageMaker AI unloads unused models from the container when the instance is reaching memory capacity and more models need to be downloaded into the container. SageMaker AI also deletes unused model artifacts from the instance storage volume when the volume is reaching capacity and new models need to be downloaded. The first invocation to a newly added model takes longer because the endpoint takes time to download the model from S3 to the container's memory in instance hosting the endpoint

With the endpoint already running, copy a new set of model artifacts to the Amazon S3 location there you store your models.

```
# Add an AdditionalModel to the endpoint and exercise it
aws s3 cp AdditionalModel.tar.gz s3://amzn-s3-demo-bucket/path/to/artifacts/
```

**Important**  
To update a model, proceed as you would when adding a new model. Use a new and unique name. Don't overwrite model artifacts in Amazon S3 because the old version of the model might still be loaded in the containers or on the storage volume of the instances on the endpoint. Invocations to the new model could then invoke the old version of the model. 

Client applications can request predictions from the additional target model as soon as it is stored in S3.

```
response = runtime_sagemaker_client.invoke_endpoint(
                        EndpointName='<ENDPOINT_NAME>',
                        ContentType='text/csv',
                        TargetModel='AdditionalModel.tar.gz',
                        Body=body)
```

To delete a model from a multi-model endpoint, stop invoking the model from the clients and remove it from the S3 location where model artifacts are stored.

# Build Your Own Container for SageMaker AI Multi-Model Endpoints
Bring Your Own Container

Refer to the following sections for bringing your own container and dependencies to multi-model endpoints.

**Topics**
+ [

## Bring your own dependencies for multi-model endpoints on CPU backed instances
](#build-multi-model-container-cpu)
+ [

## Bring your own dependencies for multi-model endpoints on GPU backed instances
](#build-multi-model-container-gpu)
+ [

## Use the SageMaker AI Inference Toolkit
](#multi-model-inference-toolkit)
+ [

# Custom Containers Contract for Multi-Model Endpoints
](mms-container-apis.md)

## Bring your own dependencies for multi-model endpoints on CPU backed instances


If none of the pre-built container images serve your needs, you can build your own container for use with CPU backed multi-model endpoints.

Custom Amazon Elastic Container Registry (Amazon ECR) images deployed in Amazon SageMaker AI are expected to adhere to the basic contract described in [Custom Inference Code with Hosting Services](your-algorithms-inference-code.md) that govern how SageMaker AI interacts with a Docker container that runs your own inference code. For a container to be capable of loading and serving multiple models concurrently, there are additional APIs and behaviors that must be followed. This additional contract includes new APIs to load, list, get, and unload models, and a different API to invoke models. There are also different behaviors for error scenarios that the APIs need to abide by. To indicate that the container complies with the additional requirements, you can add the following command to your Docker file:

```
LABEL com.amazonaws.sagemaker.capabilities.multi-models=true
```

SageMaker AI also injects an environment variable into the container

```
SAGEMAKER_MULTI_MODEL=true
```

If you are creating a multi-model endpoint for a serial inference pipline, your Docker file must have the required labels for both multi-models and serial inference pipelines. For more information about serial information pipelines, see [Run Real-time Predictions with an Inference Pipeline](inference-pipeline-real-time.md).

To help you implement these requirements for a custom container, two libraries are available:
+ [Multi Model Server](https://github.com/awslabs/multi-model-server) is an open source framework for serving machine learning models that can be installed in containers to provide the front end that fulfills the requirements for the new multi-model endpoint container APIs. It provides the HTTP front end and model management capabilities required by multi-model endpoints to host multiple models within a single container, load models into and unload models out of the container dynamically, and performs inference on a specified loaded model. It also provides a pluggable backend that supports a pluggable custom backend handler where you can implement your own algorithm.
+ [SageMaker AI Inference Toolkit](https://github.com/aws/sagemaker-inference-toolkit) is a library that bootstraps Multi Model Server with a configuration and settings that make it compatible with SageMaker AI multi-model endpoints. It also allows you to tweak important performance parameters, such as the number of workers per model, depending on the needs of your scenario. 

## Bring your own dependencies for multi-model endpoints on GPU backed instances


The bring your own container (BYOC) capability on multi-model endpoints with GPU backed instances is not currently supported by the Multi Model Server and SageMaker AI Inference Toolkit libraries.

For creating multi-model endpoints with GPU backed instances, you can use the SageMaker AI supported [NVIDIA Triton Inference Server](https://docs.aws.amazon.com/sagemaker/latest/dg/triton.html). with the [NVIDIA Triton Inference Containers](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#nvidia-triton-inference-containers-sm-support-only). To bring your own dependencies, you can build your own container with the SageMaker AI supported [NVIDIA Triton Inference Server](https://docs.aws.amazon.com/sagemaker/latest/dg/triton.html) as the base image to your Docker file:

```
FROM 301217895009.dkr.ecr.us-west-2.amazonaws.com/sagemaker-tritonserver:22.07-py3
```

**Important**  
Containers with the Triton Inference Server are the only supported containers you can use for GPU backed multi-model endpoints.

## Use the SageMaker AI Inference Toolkit


**Note**  
The SageMaker AI Inference Toolkit is only supported for CPU backed multi-model endpoints. The SageMaker AI Inference Toolkit is not currently not supported for GPU backed multi-model endpoints.

Pre-built containers that support multi-model endpoints are listed in [Supported algorithms, frameworks, and instances for multi-model endpoints](multi-model-support.md). If you want to use any other framework or algorithm, you need to build a container. The easiest way to do this is to use the [SageMaker AI Inference Toolkit](https://github.com/aws/sagemaker-inference-toolkit) to extend an existing pre-built container. The SageMaker AI inference toolkit is an implementation for the multi-model server (MMS) that creates endpoints that can be deployed in SageMaker AI. For a sample notebook that shows how to set up and deploy a custom container that supports multi-model endpoints in SageMaker AI, see the [Multi-Model Endpoint BYOC Sample Notebook](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/advanced_functionality/multi_model_bring_your_own).

**Note**  
The SageMaker AI inference toolkit supports only Python model handlers. If you want to implement your handler in any other language, you must build your own container that implements the additional multi-model endpoint APIs. For information, see [Custom Containers Contract for Multi-Model Endpoints](mms-container-apis.md).

**To extend a container by using the SageMaker AI inference toolkit**

1. Create a model handler. MMS expects a model handler, which is a Python file that implements functions to pre-process, get preditions from the model, and process the output in a model handler. For an example of a model handler, see [model\$1handler.py](https://github.com/awslabs/amazon-sagemaker-examples/blob/master/advanced_functionality/multi_model_bring_your_own/container/model_handler.py) from the sample notebook.

1. Import the inference toolkit and use its `model_server.start_model_server` function to start MMS. The following example is from the `dockerd-entrypoint.py` file from the sample notebook. Notice that the call to `model_server.start_model_server` passes the model handler described in the previous step:

   ```
   import subprocess
   import sys
   import shlex
   import os
   from retrying import retry
   from subprocess import CalledProcessError
   from sagemaker_inference import model_server
   
   def _retry_if_error(exception):
       return isinstance(exception, CalledProcessError or OSError)
   
   @retry(stop_max_delay=1000 * 50,
          retry_on_exception=_retry_if_error)
   def _start_mms():
       # by default the number of workers per model is 1, but we can configure it through the
       # environment variable below if desired.
       # os.environ['SAGEMAKER_MODEL_SERVER_WORKERS'] = '2'
       model_server.start_model_server(handler_service='/home/model-server/model_handler.py:handle')
   
   def main():
       if sys.argv[1] == 'serve':
           _start_mms()
       else:
           subprocess.check_call(shlex.split(' '.join(sys.argv[1:])))
   
       # prevent docker exit
       subprocess.call(['tail', '-f', '/dev/null'])
       
   main()
   ```

1. In your `Dockerfile`, copy the model handler from the first step and specify the Python file from the previous step as the entrypoint in your `Dockerfile`. The following lines are from the [Dockerfile](https://github.com/awslabs/amazon-sagemaker-examples/blob/master/advanced_functionality/multi_model_bring_your_own/container/Dockerfile) used in the sample notebook:

   ```
   # Copy the default custom service file to handle incoming data and inference requests
   COPY model_handler.py /home/model-server/model_handler.py
   
   # Define an entrypoint script for the docker image
   ENTRYPOINT ["python", "/usr/local/bin/dockerd-entrypoint.py"]
   ```

1. Build and register your container. The following shell script from the sample notebook builds the container and uploads it to an Amazon Elastic Container Registry repository in your AWS account:

   ```
   %%sh
   
   # The name of our algorithm
   algorithm_name=demo-sagemaker-multimodel
   
   cd container
   
   account=$(aws sts get-caller-identity --query Account --output text)
   
   # Get the region defined in the current configuration (default to us-west-2 if none defined)
   region=$(aws configure get region)
   region=${region:-us-west-2}
   
   fullname="${account}.dkr.ecr.${region}.amazonaws.com/${algorithm_name}:latest"
   
   # If the repository doesn't exist in ECR, create it.
   aws ecr describe-repositories --repository-names "${algorithm_name}" > /dev/null 2>&1
   
   if [ $? -ne 0 ]
   then
       aws ecr create-repository --repository-name "${algorithm_name}" > /dev/null
   fi
   
   # Get the login command from ECR and execute it directly
   $(aws ecr get-login --region ${region} --no-include-email)
   
   # Build the docker image locally with the image name and then push it to ECR
   # with the full name.
   
   docker build -q -t ${algorithm_name} .
   docker tag ${algorithm_name} ${fullname}
   
   docker push ${fullname}
   ```

You can now use this container to deploy multi-model endpoints in SageMaker AI.

**Topics**
+ [

## Bring your own dependencies for multi-model endpoints on CPU backed instances
](#build-multi-model-container-cpu)
+ [

## Bring your own dependencies for multi-model endpoints on GPU backed instances
](#build-multi-model-container-gpu)
+ [

## Use the SageMaker AI Inference Toolkit
](#multi-model-inference-toolkit)
+ [

# Custom Containers Contract for Multi-Model Endpoints
](mms-container-apis.md)

# Custom Containers Contract for Multi-Model Endpoints
API Container Contract

To handle multiple models, your container must support a set of APIs that enable Amazon SageMaker AI to communicate with the container for loading, listing, getting, and unloading models as required. The `model_name` is used in the new set of APIs as the key input parameter. The customer container is expected to keep track of the loaded models using `model_name` as the mapping key. Also, the `model_name` is an opaque identifier and is not necessarily the value of the `TargetModel` parameter passed into the `InvokeEndpoint` API. The original `TargetModel` value in the `InvokeEndpoint` request is passed to container in the APIs as a `X-Amzn-SageMaker-Target-Model` header that can be used for logging purposes.

**Note**  
Multi-model endpoints for GPU backed instances are currently supported only with SageMaker AI's [NVIDIA Triton Inference Server container](https://docs.aws.amazon.com/sagemaker/latest/dg/triton.html). This container already implements the contract defined below. Customers can directly use this container with their multi-model GPU endpoints, without any additional work.

You can configure the following APIs on your containers for CPU backed multi-model endpoints.

**Topics**
+ [

## Load Model API
](#multi-model-api-load-model)
+ [

## List Model API
](#multi-model-api-list-model)
+ [

## Get Model API
](#multi-model-api-get-model)
+ [

## Unload Model API
](#multi-model-api-unload-model)
+ [

## Invoke Model API
](#multi-model-api-invoke-model)

## Load Model API
Load Model

Instructs the container to load a particular model present in the `url` field of the body into the memory of the customer container and to keep track of it with the assigned `model_name`. After a model is loaded, the container should be ready to serve inference requests using this `model_name`.

```
POST /models HTTP/1.1
Content-Type: application/json
Accept: application/json

{
     "model_name" : "{model_name}",
     "url" : "/opt/ml/models/{model_name}/model",
}
```

**Note**  
If `model_name` is already loaded, this API should return 409. Any time a model cannot be loaded due to lack of memory or to any other resource, this API should return a 507 HTTP status code to SageMaker AI, which then initiates unloading unused models to reclaim.

## List Model API
List Model

Returns the list of models loaded into the memory of the customer container.

```
GET /models HTTP/1.1
Accept: application/json

Response = 
{
    "models": [
        {
             "modelName" : "{model_name}",
             "modelUrl" : "/opt/ml/models/{model_name}/model",
        },
        {
            "modelName" : "{model_name}",
            "modelUrl" : "/opt/ml/models/{model_name}/model",
        },
        ....
    ]
}
```

This API also supports pagination.

```
GET /models HTTP/1.1
Accept: application/json

Response = 
{
    "models": [
        {
             "modelName" : "{model_name}",
             "modelUrl" : "/opt/ml/models/{model_name}/model",
        },
        {
            "modelName" : "{model_name}",
            "modelUrl" : "/opt/ml/models/{model_name}/model",
        },
        ....
    ]
}
```

SageMaker AI can initially call the List Models API without providing a value for `next_page_token`. If a `nextPageToken` field is returned as part of the response, it will be provided as the value for `next_page_token` in a subsequent List Models call. If a `nextPageToken` is not returned, it means that there are no more models to return.

## Get Model API
Get Model

This is a simple read API on the `model_name` entity.

```
GET /models/{model_name} HTTP/1.1
Accept: application/json

{
     "modelName" : "{model_name}",
     "modelUrl" : "/opt/ml/models/{model_name}/model",
}
```

**Note**  
If `model_name` is not loaded, this API should return 404.

## Unload Model API
Unload Model API

Instructs the SageMaker AI platform to instruct the customer container to unload a model from memory. This initiates the eviction of a candidate model as determined by the platform when starting the process of loading a new model. The resources provisioned to `model_name` should be reclaimed by the container when this API returns a response.

```
DELETE /models/{model_name}
```

**Note**  
If `model_name` is not loaded, this API should return 404.

## Invoke Model API
Invoke Model

Makes a prediction request from the particular `model_name` supplied. The SageMaker AI Runtime `InvokeEndpoint` request supports `X-Amzn-SageMaker-Target-Model` as a new header that takes the relative path of the model specified for invocation. The SageMaker AI system constructs the absolute path of the model by combining the prefix that is provided as part of the `CreateModel` API call with the relative path of the model.

```
POST /models/{model_name}/invoke HTTP/1.1
Content-Type: ContentType
Accept: Accept
X-Amzn-SageMaker-Custom-Attributes: CustomAttributes
X-Amzn-SageMaker-Target-Model: [relativePath]/{artifactName}.tar.gz
```

**Note**  
If `model_name` is not loaded, this API should return 404.

Additionally, on GPU instances, if `InvokeEndpoint` fails due to a lack of memory or other resources, this API should return a 507 HTTP status code to SageMaker AI, which then initiates unloading unused models to reclaim.

# Multi-Model Endpoint Security
Security

Models and data in a multi-model endpoint are co-located on instance storage volume and in container memory. All instances for Amazon SageMaker AI endpoints run on a single tenant container that you own. Only your models can run on your multi-model endpoint. It's your responsibility to manage the mapping of requests to models and to provide access for users to the correct target models. SageMaker AI uses [IAM roles](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles.html) to provide IAM identity-based policies that you use to specify allowed or denied actions and resources and the conditions under which actions are allowed or denied.

By default, an IAM principal with [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_InvokeEndpoint.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_InvokeEndpoint.html) permissions on a multi-model endpoint can invoke any model at the address of the S3 prefix defined in the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateModel.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateModel.html) operation, provided that the IAM Execution Role defined in operation has permissions to download the model. If you need to restrict [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_InvokeEndpoint.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_InvokeEndpoint.html) access to a limited set of models in S3, you can do one of the following:
+ Restrict `InvokeEndpont` calls to specific models hosted at the endpoint by using the `sagemaker:TargetModel` IAM condition key. For example, the following policy allows `InvokeEndpont` requests only when the value of the `TargetModel` field matches one of the specified regular expressions:

------
#### [ JSON ]

****  

  ```
  {
      "Version":"2012-10-17",		 	 	 
      "Statement": [
          {
              "Action": [
                  "sagemaker:InvokeEndpoint"
              ],
              "Effect": "Allow",
              "Resource":
              "arn:aws:sagemaker:us-east-1:111122223333:endpoint/endpoint_name",
              "Condition": {
                  "StringLike": {
                      "sagemaker:TargetModel": ["company_a/*", "common/*"]
                  }
              }
          }
      ]
  }
  ```

------

  For information about SageMaker AI condition keys, see [Condition Keys for Amazon SageMaker AI](https://docs.aws.amazon.com/IAM/latest/UserGuide/list_amazonsagemaker.html#amazonsagemaker-policy-keys) in the *AWS Identity and Access Management User Guide*.
+ Create multi-model endpoints with more restrictive S3 prefixes. 

For more information about how SageMaker AI uses roles to manage access to endpoints and perform operations on your behalf, see [How to use SageMaker AI execution roles](sagemaker-roles.md). Your customers might also have certain data isolation requirements dictated by their own compliance requirements that can be satisfied using IAM identities.

# CloudWatch Metrics for Multi-Model Endpoint Deployments


Amazon SageMaker AI provides metrics for endpoints so you can monitor the cache hit rate, the number of models loaded and the model wait times for loading, downloading, and uploading at a multi-model endpoint. Some of the metrics are different for CPU and GPU backed multi-model endpoints, so the following sections describe the Amazon CloudWatch metrics that you can use for each type of multi-model endpoint.

For more information about the metrics, see **Multi-Model Endpoint Model Loading Metrics** and **Multi-Model Endpoint Model Instance Metrics** in [Amazon SageMaker AI metrics in Amazon CloudWatch](monitoring-cloudwatch.md). Per-model metrics aren't supported. 

## CloudWatch metrics for CPU backed multi-model endpoints


You can monitor the following metrics on CPU backed multi-model endpoints.

The `AWS/SageMaker` namespace includes the following model loading metrics from calls to [ InvokeEndpoint](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_InvokeEndpoint.html).

Metrics are available at a 1-minute frequency.

For information about how long CloudWatch metrics are retained for, see [GetMetricStatistics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/API_GetMetricStatistics.html) in the *Amazon CloudWatch API Reference*.

**Multi-Model Endpoint Model Loading Metrics**


| Metric | Description | 
| --- | --- | 
| ModelLoadingWaitTime  |  The interval of time that an invocation request has waited for the target model to be downloaded, or loaded, or both in order to perform inference.  Units: Microseconds  Valid statistics: Average, Sum, Min, Max, Sample Count   | 
| ModelUnloadingTime  |  The interval of time that it took to unload the model through the container's `UnloadModel` API call.  Units: Microseconds  Valid statistics: Average, Sum, Min, Max, Sample Count   | 
| ModelDownloadingTime |  The interval of time that it took to download the model from Amazon Simple Storage Service (Amazon S3). Units: Microseconds Valid statistics: Average, Sum, Min, Max, Sample Count   | 
| ModelLoadingTime  |  The interval of time that it took to load the model through the container's `LoadModel` API call. Units: Microseconds  Valid statistics: Average, Sum, Min, Max, Sample Count   | 
| ModelCacheHit  |  The number of `InvokeEndpoint` requests sent to the multi-model endpoint for which the model was already loaded. The Average statistic shows the ratio of requests for which the model was already loaded. Units: None Valid statistics: Average, Sum, Sample Count  | 

**Dimensions for Multi-Model Endpoint Model Loading Metrics**


| Dimension | Description | 
| --- | --- | 
| EndpointName, VariantName |  Filters endpoint invocation metrics for a `ProductionVariant` of the specified endpoint and variant.  | 

The `/aws/sagemaker/Endpoints` namespaces include the following instance metrics from calls to [ InvokeEndpoint](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_InvokeEndpoint.html).

Metrics are available at a 1-minute frequency.

For information about how long CloudWatch metrics are retained for, see [GetMetricStatistics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/API_GetMetricStatistics.html) in the *Amazon CloudWatch API Reference*.

**Multi-Model Endpoint Model Instance Metrics**


| Metric | Description | 
| --- | --- | 
| LoadedModelCount  |  The number of models loaded in the containers of the multi-model endpoint. This metric is emitted per instance. The Average statistic with a period of 1 minute tells you the average number of models loaded per instance. The Sum statistic tells you the total number of models loaded across all instances in the endpoint. The models that this metric tracks are not necessarily unique because a model might be loaded in multiple containers at the endpoint. Units: None Valid statistics: Average, Sum, Min, Max, Sample Count  | 
| CPUUtilization  |  The sum of each individual CPU core's utilization. The CPU utilization of each core range is 0–100. For example, if there are four CPUs, the `CPUUtilization` range is 0%–400%. For endpoint variants, the value is the sum of the CPU utilization of the primary and supplementary containers on the instance. Units: Percent  | 
| MemoryUtilization |  The percentage of memory that is used by the containers on an instance. This value range is 0%–100%. For endpoint variants, the value is the sum of the memory utilization of the primary and supplementary containers on the instance. Units: Percent  | 
| DiskUtilization |  The percentage of disk space used by the containers on an instance. This value range is 0%–100%. For endpoint variants, the value is the sum of the disk space utilization of the primary and supplementary containers on the instance. Units: Percent  | 

## CloudWatch metrics for GPU multi-model endpoint deployments


You can monitor the following metrics on GPU backed multi-model endpoints.

The `AWS/SageMaker` namespace includes the following model loading metrics from calls to [ InvokeEndpoint](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_InvokeEndpoint.html).

Metrics are available at a 1-minute frequency.

For information about how long CloudWatch metrics are retained for, see [GetMetricStatistics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/API_GetMetricStatistics.html) in the *Amazon CloudWatch API Reference*.

**Multi-Model Endpoint Model Loading Metrics**


| Metric | Description | 
| --- | --- | 
| ModelLoadingWaitTime  |  The interval of time that an invocation request has waited for the target model to be downloaded, or loaded, or both in order to perform inference.  Units: Microseconds  Valid statistics: Average, Sum, Min, Max, Sample Count   | 
| ModelUnloadingTime  |  The interval of time that it took to unload the model through the container's `UnloadModel` API call.  Units: Microseconds  Valid statistics: Average, Sum, Min, Max, Sample Count   | 
| ModelDownloadingTime |  The interval of time that it took to download the model from Amazon Simple Storage Service (Amazon S3). Units: Microseconds Valid statistics: Average, Sum, Min, Max, Sample Count   | 
| ModelLoadingTime  |  The interval of time that it took to load the model through the container's `LoadModel` API call. Units: Microseconds  Valid statistics: Average, Sum, Min, Max, Sample Count   | 
| ModelCacheHit  |  The number of `InvokeEndpoint` requests sent to the multi-model endpoint for which the model was already loaded. The Average statistic shows the ratio of requests for which the model was already loaded. Units: None Valid statistics: Average, Sum, Sample Count  | 

**Dimensions for Multi-Model Endpoint Model Loading Metrics**


| Dimension | Description | 
| --- | --- | 
| EndpointName, VariantName |  Filters endpoint invocation metrics for a `ProductionVariant` of the specified endpoint and variant.  | 

The `/aws/sagemaker/Endpoints` namespaces include the following instance metrics from calls to [ InvokeEndpoint](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_InvokeEndpoint.html).

Metrics are available at a 1-minute frequency.

For information about how long CloudWatch metrics are retained for, see [GetMetricStatistics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/APIReference/API_GetMetricStatistics.html) in the *Amazon CloudWatch API Reference*.

**Multi-Model Endpoint Model Instance Metrics**


| Metric | Description | 
| --- | --- | 
| LoadedModelCount  |  The number of models loaded in the containers of the multi-model endpoint. This metric is emitted per instance. The Average statistic with a period of 1 minute tells you the average number of models loaded per instance. The Sum statistic tells you the total number of models loaded across all instances in the endpoint. The models that this metric tracks are not necessarily unique because a model might be loaded in multiple containers at the endpoint. Units: None Valid statistics: Average, Sum, Min, Max, Sample Count  | 
| CPUUtilization  |  The sum of each individual CPU core's utilization. The CPU utilization of each core range is 0‐100. For example, if there are four CPUs, the `CPUUtilization` range is 0%–400%. For endpoint variants, the value is the sum of the CPU utilization of the primary and supplementary containers on the instance. Units: Percent  | 
| MemoryUtilization |  The percentage of memory that is used by the containers on an instance. This value range is 0%‐100%. For endpoint variants, the value is the sum of the memory utilization of the primary and supplementary containers on the instance. Units: Percent  | 
| GPUUtilization |  The percentage of GPU units that are used by the containers on an instance. The value can range betweenrange is 0‐100 and is multiplied by the number of GPUs. For example, if there are four GPUs, the `GPUUtilization` range is 0%–400%. For endpoint variants, the value is the sum of the GPU utilization of the primary and supplementary containers on the instance. Units: Percent  | 
| GPUMemoryUtilization |  The percentage of GPU memory used by the containers on an instance. The value range is 0‐100 and is multiplied by the number of GPUs. For example, if there are four GPUs, the `GPUMemoryUtilization` range is 0%‐400%. For endpoint variants, the value is the sum of the GPU memory utilization of the primary and supplementary containers on the instance. Units: Percent  | 
| DiskUtilization |  The percentage of disk space used by the containers on an instance. This value range is 0%–100%. For endpoint variants, the value is the sum of the disk space utilization of the primary and supplementary containers on the instance. Units: Percent  | 

# Set SageMaker AI multi-model endpoint model caching behavior


By default, multi-model endpoints cache frequently used models in memory (CPU or GPU, depending on whether you have CPU or GPU backed instances) and on disk to provide low latency inference. The cached models are unloaded and/or deleted from disk only when a container runs out of memory or disk space to accommodate a newly targeted model.

You can change the caching behavior of a multi-model endpoint and explicitly enable or disable model caching by setting the parameter `ModelCacheSetting` when you call [create\$1model](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_model).

We recommend setting the value of the `ModelCacheSetting` parameter to `Disabled` for use cases that do not benefit from model caching. For example, when a large number of models need to be served from the endpoint but each model is invoked only once (or very infrequently). For such use cases, setting the value of the `ModelCacheSetting` parameter to `Disabled` allows higher transactions per second (TPS) for `invoke_endpoint` requests compared to the default caching mode. Higher TPS in these use cases is because SageMaker AI does the following after the `invoke_endpoint` request:
+ Asynchronously unloads the model from memory and deletes it from disk immediately after it is invoked.
+ Provides higher concurrency for downloading and loading models in the inference container. For both CPU and GPU backed endpoints, the concurrency is a factor of the number of the vCPUs of the container instance.

For guidelines on choosing a SageMaker AI ML instance type for a multi-model endpoint, see [Instance recommendations for multi-model endpoint deployments](multi-model-endpoint-instance.md).

# Set Auto Scaling Policies for Multi-Model Endpoint Deployments


SageMaker AI multi-model endpoints fully support automatic scaling, which manages replicas of models to ensure models scale based on traffic patterns. We recommend that you configure your multi-model endpoint and the size of your instances based on [Instance recommendations for multi-model endpoint deployments](multi-model-endpoint-instance.md) and also set up instance based auto scaling for your endpoint. The invocation rates used to trigger an auto-scale event are based on the aggregate set of predictions across the full set of models served by the endpoint. For additional details on setting up endpoint auto scaling, see [Automatically Scale Amazon SageMaker AI Models](https://docs.aws.amazon.com/sagemaker/latest/dg/endpoint-auto-scaling.html).

You can set up auto scaling policies with predefined and custom metrics on both CPU and GPU backed multi-model endpoints.

**Note**  
SageMaker AI multi-model endpoint metrics are available at one-minute granularity.

## Define a scaling policy


To specify the metrics and target values for a scaling policy, you can configure a target-tracking scaling policy. You can use either a predefined metric or a custom metric.

Scaling policy configuration is represented by a JSON block. You save your scaling policy configuration as a JSON block in a text file. You use that text file when invoking the AWS CLI or the Application Auto Scaling API. For more information about policy configuration syntax, see `[TargetTrackingScalingPolicyConfiguration](https://docs.aws.amazon.com/autoscaling/application/APIReference/API_TargetTrackingScalingPolicyConfiguration.html)` in the *Application Auto Scaling API Reference*.

The following options are available for defining a target-tracking scaling policy configuration.

### Use a predefined metric


To quickly define a target-tracking scaling policy for a variant, use the `SageMakerVariantInvocationsPerInstance` predefined metric. `SageMakerVariantInvocationsPerInstance` is the average number of times per minute that each instance for a variant is invoked. We strongly recommend using this metric.

To use a predefined metric in a scaling policy, create a target tracking configuration for your policy. In the target tracking configuration, include a `PredefinedMetricSpecification` for the predefined metric and a `TargetValue` for the target value of that metric.

The following example is a typical policy configuration for target-tracking scaling for a variant. In this configuration, we use the `SageMakerVariantInvocationsPerInstance` predefined metric to adjust the number of variant instances so that each instance has an `InvocationsPerInstance` metric of `70`.

```
{"TargetValue": 70.0,
    "PredefinedMetricSpecification":
    {
        "PredefinedMetricType": "InvocationsPerInstance"
    }
}
```

**Note**  
We recommend that you use `InvocationsPerInstance` while using multi-model endpoints. The `TargetValue` for this metric depends on your application’s latency requirements. We also recommend that you load test your endpoints to set up suitable scaling parameter values. To learn more about load testing and setting up autoscaling for your endpoints, see the blog [Configuring autoscaling inference endpoints in Amazon SageMaker AI](https://aws.amazon.com/blogs/machine-learning/configuring-autoscaling-inference-endpoints-in-amazon-sagemaker/).

### Use a custom metric


If you need to define a target-tracking scaling policy that meets your custom requirements, define a custom metric. You can define a custom metric based on any production variant metric that changes in proportion to scaling.

Not all SageMaker AI metrics work for target tracking. The metric must be a valid utilization metric, and it must describe how busy an instance is. The value of the metric must increase or decrease in inverse proportion to the number of variant instances. That is, the value of the metric should decrease when the number of instances increases.

**Important**  
Before deploying automatic scaling in production, you must test automatic scaling with your custom metric.

#### Example custom metric for a CPU backed multi-model endpoint


The following example is a target-tracking configuration for a scaling policy. In this configuration, for a model named `my-model`, a custom metric of `CPUUtilization` adjusts the instance count on the endpoint based on an average CPU utilization of 50% across all instances.

```
{"TargetValue": 50,
    "CustomizedMetricSpecification":
    {"MetricName": "CPUUtilization",
        "Namespace": "/aws/sagemaker/Endpoints",
        "Dimensions": [
            {"Name": "EndpointName", "Value": "my-endpoint" },
            {"Name": "ModelName","Value": "my-model"}
        ],
        "Statistic": "Average",
        "Unit": "Percent"
    }
}
```

#### Example custom metric for a GPU backed multi-model endpoint


The following example is a target-tracking configuration for a scaling policy. In this configuration, for a model named `my-model`, a custom metric of `GPUUtilization` adjusts the instance count on the endpoint based on an average GPU utilization of 50% across all instances.

```
{"TargetValue": 50,
    "CustomizedMetricSpecification":
    {"MetricName": "GPUUtilization",
        "Namespace": "/aws/sagemaker/Endpoints",
        "Dimensions": [
            {"Name": "EndpointName", "Value": "my-endpoint" },
            {"Name": "ModelName","Value": "my-model"}
        ],
        "Statistic": "Average",
        "Unit": "Percent"
    }
}
```

## Add a cooldown period


To add a cooldown period for scaling out your endpoint, specify a value, in seconds, for `ScaleOutCooldown`. Similarly, to add a cooldown period for scaling in your model, add a value, in seconds, for `ScaleInCooldown`. For more information about `ScaleInCooldown` and `ScaleOutCooldown`, see `[TargetTrackingScalingPolicyConfiguration](https://docs.aws.amazon.com/autoscaling/application/APIReference/API_TargetTrackingScalingPolicyConfiguration.html)` in the *Application Auto Scaling API Reference*.

The following is an example target-tracking configuration for a scaling policy. In this configuration, the `SageMakerVariantInvocationsPerInstance` predefined metric is used to adjust scaling based on an average of `70` across all instances of that variant. The configuration provides a scale-in cooldown period of 10 minutes and a scale-out cooldown period of 5 minutes.

```
{"TargetValue": 70.0,
    "PredefinedMetricSpecification":
    {"PredefinedMetricType": "SageMakerVariantInvocationsPerInstance"
    },
    "ScaleInCooldown": 600,
    "ScaleOutCooldown": 300
}
```

# Multi-container endpoints


SageMaker AI multi-container endpoints enable customers to deploy multiple containers, that use different models or frameworks, on a single SageMaker AI endpoint. The containers can be run in a sequence as an inference pipeline, or each container can be accessed individually by using direct invocation to improve endpoint utilization and optimize costs.

For information about invoking the containers in a multi-container endpoint in sequence, see [Inference pipelines in Amazon SageMaker AI](inference-pipelines.md).

For information about invoking a specific container in a multi-container endpoint, see [Invoke a multi-container endpoint with direct invocation](multi-container-direct.md)

**Topics**
+ [

# Create a multi-container endpoint (Boto 3)
](multi-container-create.md)
+ [

# Update a multi-container endpoint
](multi-container-update.md)
+ [

# Invoke a multi-container endpoint with direct invocation
](multi-container-direct.md)
+ [

# Security with multi-container endpoints with direct invocation
](multi-container-security.md)
+ [

# Metrics for multi-container endpoints with direct invocation
](multi-container-metrics.md)
+ [

# Autoscale multi-container endpoints
](multi-container-auto-scaling.md)
+ [

# Troubleshoot multi-container endpoints
](multi-container-troubleshooting.md)

# Create a multi-container endpoint (Boto 3)


Create a Multi-container endpoint by calling [CreateModel](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateModel.html), [CreateEndpointConfig](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateEndpointConfig.html), and [CreateEndpoint](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateEndpoint.html) APIs as you would to create any other endpoints. You can run these containers sequentially as an inference pipeline, or run each individual container by using direct invocation. Multi-container endpoints have the following requirements when you call `create_model`:
+ Use the `Containers` parameter instead of `PrimaryContainer`, and include more than one container in the `Containers` parameter.
+ The `ContainerHostname` parameter is required for each container in a multi-container endpoint with direct invocation.
+ Set the `Mode` parameter of the `InferenceExecutionConfig` field to `Direct` for direct invocation of each container, or `Serial` to use containers as an inference pipeline. The default mode is `Serial`. 

**Note**  
Currently there is a limit of up to 15 containers supported on a multi-container endpoint.

The following example creates a multi-container model for direct invocation.

1. Create container elements and `InferenceExecutionConfig` with direct invocation.

   ```
   container1 = {
                    'Image': '123456789012.dkr.ecr.us-east-1.amazonaws.com/myimage1:mytag',
                    'ContainerHostname': 'firstContainer'
                }
   
   container2 = {
                    'Image': '123456789012.dkr.ecr.us-east-1.amazonaws.com/myimage2:mytag',
                    'ContainerHostname': 'secondContainer'
                }
   inferenceExecutionConfig = {'Mode': 'Direct'}
   ```

1. Create the model with the container elements and set the `InferenceExecutionConfig` field.

   ```
   import boto3
   sm_client = boto3.Session().client('sagemaker')
   
   response = sm_client.create_model(
                  ModelName = 'my-direct-mode-model-name',
                  InferenceExecutionConfig = inferenceExecutionConfig,
                  ExecutionRoleArn = role,
                  Containers = [container1, container2]
              )
   ```

To create an endoint, you would then call [create\$1endpoint\$1config](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_endpoint_config) and [create\$1endpoint](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_endpoint) as you would to create any other endpoint.

# Update a multi-container endpoint


To update an Amazon SageMaker AI multi-container endpoint, complete the following steps.

1.  Call [create\$1model](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_model) to create a new model with a new value for the `Mode` parameter in the `InferenceExecutionConfig` field.

1.  Call [create\$1endpoint\$1config](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_endpoint_config) to create a new endpoint config with a different name by using the new model you created in the previous step.

1.  Call [update\$1endpoint](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.update_endpoint) to update the endpoint with the new endpoint config you created in the previous step. 

# Invoke a multi-container endpoint with direct invocation


SageMaker AI multi-container endpoints enable customers to deploy multiple containers to deploy different models on a SageMaker AI endpoint. You can host up to 15 different inference containers on a single endpoint. By using direct invocation, you can send a request to a specific inference container hosted on a multi-container endpoint.

 To invoke a multi-container endpoint with direct invocation, call [invoke\$1endpoint](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker-runtime.html#SageMakerRuntime.Client.invoke_endpoint) as you would invoke any other endpoint, and specify which container you want to invoke by using the `TargetContainerHostname` parameter.

 

 The following example directly invokes the `secondContainer` of a multi-container endpoint to get a prediction.

```
import boto3
runtime_sm_client = boto3.Session().client('sagemaker-runtime')

response = runtime_sm_client.invoke_endpoint(
   EndpointName ='my-endpoint',
   ContentType = 'text/csv',
   TargetContainerHostname='secondContainer', 
   Body = body)
```

 For each direct invocation request to a multi-container endpoint, only the container with the `TargetContainerHostname` processes the invocation request. You will get validation errors if you do any of the following:
+ Specify a `TargetContainerHostname` that does not exist in the endpoint
+ Do not specify a value for `TargetContainerHostname` in a request to an endpoint configured for direct invocation
+ Specify a value for `TargetContainerHostname` in a request to an endpoint that is not configured for direct invocation.

# Security with multi-container endpoints with direct invocation


 For multi-container endpoints with direct invocation, there are multiple containers hosted in a single instance by sharing memory and a storage volume. It's your responsibility to use secure containers, maintain the correct mapping of requests to target containers, and provide users with the correct access to target containers. SageMaker AI uses IAM roles to provide IAM identity-based policies that you use to specify whether access to a resource is allowed or denied to that role, and under what conditions. For information about IAM roles, see [IAM roles](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles.html) in the *AWS Identity and Access Management User Guide*. For information about identity-based policies, see [Identity-based policies and resource-based policies](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_identity-vs-resource.html).

By default, an IAM principal with `InvokeEndpoint` permissions on a multi-container endpoint with direct invocation can invoke any container inside the endpoint with the endpoint name that you specify when you call `invoke_endpoint`. If you need to restrict `invoke_endpoint` access to a limited set of containers inside a multi-container endpoint, use the `sagemaker:TargetContainerHostname` IAM condition key. The following policies show how to limit calls to specific containers within an endpoint.

The following policy allows `invoke_endpoint` requests only when the value of the `TargetContainerHostname` field matches one of the specified regular expressions.

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Action": [
                "sagemaker:InvokeEndpoint"
            ],
            "Effect": "Allow",
            "Resource": "arn:aws:sagemaker:us-east-1:111122223333:endpoint/endpoint_name",
            "Condition": {
                "StringLike": {
                    "sagemaker:TargetModel": [
                        "customIps*",
                        "common*"
                    ]
                }
            }
        }
    ]
}
```

------

The following policy denies `invoke_endpoint` requests when the value of the `TargetContainerHostname` field matches one of the specified regular expressions in the `Deny` statement.

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Action": [
                "sagemaker:InvokeEndpoint"
            ],
            "Effect": "Allow",
            "Resource": "arn:aws:sagemaker:us-east-1:111122223333:endpoint/endpoint_name",
            "Condition": {
                "StringLike": {
                    "sagemaker:TargetModel": [
                        "model_name*"
                    ]
                }
            }
        },
        {
            "Action": [
                "sagemaker:InvokeEndpoint"
            ],
            "Effect": "Deny",
            "Resource": "arn:aws:sagemaker:us-east-1:111122223333:endpoint/endpoint_name",
            "Condition": {
                "StringLike": {
                    "sagemaker:TargetModel": [
                        "special-model_name*"
                    ]
                }
            }
        }
    ]
}
```

------

 For information about SageMaker AI condition keys, see [Condition Keys for SageMaker AI](https://docs.aws.amazon.com/IAM/latest/UserGuide/list_amazonsagemaker.html#amazonsagemaker-policy-keys) in the *AWS Identity and Access Management User Guide*.

# Metrics for multi-container endpoints with direct invocation


In addition to the endpoint metrics that are listed in [Amazon SageMaker AI metrics in Amazon CloudWatch](monitoring-cloudwatch.md), SageMaker AI also provides per-container metrics.

Per-container metrics for multi-container endpoints with direct invocation are located in CloudWatch and categorized into two namespaces: `AWS/SageMaker` and `aws/sagemaker/Endpoints`. The `AWS/SageMaker` namespace includes invocation-related metrics, and the `aws/sagemaker/Endpoints` namespace includes memory and CPU utilization metrics.

The following table lists the per-container metrics for multi-container endpoints with direct invocation. All the metrics use the [`EndpointName, VariantName, ContainerName`] dimension, which filters metrics at a specific endpoint, for a specific variant and corresponding to a specific container. These metrics share the same metric names as in those for inference pipelines, but at a per-container level [`EndpointName, VariantName, ContainerName`].

 


|  |  |  |  | 
| --- |--- |--- |--- |
|  Metric Name  |  Description  |  Dimension  |  NameSpace  | 
|  Invocations  |  The number of InvokeEndpoint requests sent to a container inside an endpoint. To get the total number of requests sent to that container, use the Sum statistic. Units: None Valid statistics: Sum, Sample Count |  EndpointName, VariantName, ContainerName  | AWS/SageMaker | 
|  Invocation4XX Errors  |  The number of InvokeEndpoint requests that the model returned a 4xx HTTP response code for on a specific container. For each 4xx response, SageMaker AI sends a 1. Units: None Valid statistics: Average, Sum  |  EndpointName, VariantName, ContainerName  | AWS/SageMaker | 
|  Invocation5XX Errors  |  The number of InvokeEndpoint requests that the model returned a 5xx HTTP response code for on a specific container. For each 5xx response, SageMaker AI sends a 1. Units: None Valid statistics: Average, Sum  |  EndpointName, VariantName, ContainerName  | AWS/SageMaker | 
|  ContainerLatency  |  The time it took for the target container to respond as viewed from SageMaker AI. ContainerLatency includes the time it took to send the request, to fetch the response from the model's container, and to complete inference in the container. Units: Microseconds Valid statistics: Average, Sum, Min, Max, Sample Count |  EndpointName, VariantName, ContainerName  | AWS/SageMaker | 
|  OverheadLatency  |  The time added to the time taken to respond to a client request by SageMaker AI for overhead. OverheadLatency is measured from the time that SageMaker AI receives the request until it returns a response to the client, minus theModelLatency. Overhead latency can vary depending on request and response payload sizes, request frequency, and authentication or authorization of the request, among other factors. Units: Microseconds Valid statistics: Average, Sum, Min, Max, `Sample Count `  |  EndpointName, VariantName, ContainerName  | AWS/SageMaker | 
|  CPUUtilization  | The percentage of CPU units that are used by each container running on an instance. The value ranges from 0% to 100%, and is multiplied by the number of CPUs. For example, if there are four CPUs, CPUUtilization can range from 0% to 400%. For endpoints with direct invocation, the number of CPUUtilization metrics equals the number of containers in that endpoint. Units: Percent  |  EndpointName, VariantName, ContainerName  | aws/sagemaker/Endpoints | 
|  MemoryUtilizaton  |  The percentage of memory that is used by each container running on an instance. This value ranges from 0% to 100%. Similar as CPUUtilization, in endpoints with direct invocation, the number of MemoryUtilization metrics equals the number of containers in that endpoint. Units: Percent  |  EndpointName, VariantName, ContainerName  | aws/sagemaker/Endpoints | 

All the metrics in the previous table are specific to multi-container endpoints with direct invocation. Besides these special per-container metrics, there are also metrics at the variant level with dimension `[EndpointName, VariantName]` for all the metrics in the table expect `ContainerLatency`.

# Autoscale multi-container endpoints


If you want to configure automatic scaling for a multi-container endpoint using the `InvocationsPerInstance` metric, we recommend that the model in each container exhibits similar CPU utilization and latency on each inference request. This is recommended because if traffic to the multi-container endpoint shifts from a low CPU utilization model to a high CPU utilization model, but the overall call volume remains the same, the endpoint does not scale out and there may not be enough instances to handle all the requests to the high CPU utilization model. For information about automatically scaling endpoints, see [Automatic scaling of Amazon SageMaker AI models](endpoint-auto-scaling.md).

# Troubleshoot multi-container endpoints


The following sections can help you troubleshoot errors with multi-container endpoints.

## Ping Health Check Errors


 With multiple containers, endpoint memory and CPU are under higher pressure during endpoint creation. Specifically, the `MemoryUtilization` and `CPUUtilization` metrics are higher than for single-container endpoints, because utilization pressure is proportional to the number of containers. Because of this, we recommend that you choose instance types with enough memory and CPU to ensure that there is enough memory on the instance to have all the models loaded (the same guidance applies to deploying an inference pipeline). Otherwise, your endpoint creation might fail with an error such as `XXX did not pass the ping health check`.

## Missing accept-bind-to-port=true Docker label


The containers in a multi-container endpoints listen on the port specified in the `SAGEMAKER_BIND_TO_PORT` environment variable instead of port 8080. When a container runs in a multi-container endpoint, SageMaker AI automatically provides this environment variable to the container. If this environment variable isn't present, containers default to using port 8080. To indicate that your container complies with this requirement, use the following command to add a label to your Dockerfile: 

```
LABEL com.amazonaws.sagemaker.capabilities.accept-bind-to-port=true
```

 Otherwise, You will see an error message such as `Your Ecr Image XXX does not contain required com.amazonaws.sagemaker.capabilities.accept-bind-to-port=true Docker label(s).`

 If your container needs to listen on a second port, choose a port in the range specified by the `SAGEMAKER_SAFE_PORT_RANGE` environment variable. Specify the value as an inclusive range in the format *XXXX*-*YYYY*, where XXXX and YYYY are multi-digit integers. SageMaker AI provides this value automatically when you run the container in a multi-container endpoint. 

# Inference pipelines in Amazon SageMaker AI
Inference pipelines

An *inference pipeline* is a Amazon SageMaker AI model that is composed of a linear sequence of two to fifteen containers that process requests for inferences on data. You use an inference pipeline to define and deploy any combination of pretrained SageMaker AI built-in algorithms and your own custom algorithms packaged in Docker containers. You can use an inference pipeline to combine preprocessing, predictions, and post-processing data science tasks. Inference pipelines are fully managed.

You can add SageMaker AI Spark ML Serving and scikit-learn containers that reuse the data transformers developed for training models. The entire assembled inference pipeline can be considered as a SageMaker AI model that you can use to make either real-time predictions or to process batch transforms directly without any external preprocessing. 

Within an inference pipeline model, SageMaker AI handles invocations as a sequence of HTTP requests. The first container in the pipeline handles the initial request, then the intermediate response is sent as a request to the second container, and so on, for each container in the pipeline. SageMaker AI returns the final response to the client. 

When you deploy the pipeline model, SageMaker AI installs and runs all of the containers on each Amazon Elastic Compute Cloud (Amazon EC2) instance in the endpoint or transform job. Feature processing and inferences run with low latency because the containers are co-located on the same EC2 instances. You define the containers for a pipeline model using the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateModel.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateModel.html) operation or from the console. Instead of setting one `PrimaryContainer`, you use the `Containers` parameter to set the containers that make up the pipeline. You also specify the order in which the containers are executed. 

A pipeline model is immutable, but you can update an inference pipeline by deploying a new one using the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateEndpoint.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateEndpoint.html) operation. This modularity supports greater flexibility during experimentation. 

For information on how to create an inference pipeline with the SageMaker Model Registry, see [Model Registration Deployment with Model Registry](model-registry.md).

There are no additional costs for using this feature. You pay only for the instances running on an endpoint.

**Topics**
+ [

## Sample Notebooks for Inference Pipelines
](#inference-pipeline-sample-notebooks)
+ [

# Feature Processing with Spark ML and Scikit-learn
](inference-pipeline-mleap-scikit-learn-containers.md)
+ [

# Create a Pipeline Model
](inference-pipeline-create-console.md)
+ [

# Run Real-time Predictions with an Inference Pipeline
](inference-pipeline-real-time.md)
+ [

# Batch transforms with inference pipelines
](inference-pipeline-batch.md)
+ [

# Inference Pipeline Logs and Metrics
](inference-pipeline-logs-metrics.md)
+ [

# Troubleshoot Inference Pipelines
](inference-pipeline-troubleshoot.md)

## Sample Notebooks for Inference Pipelines
Sample Notebooks

For an example that shows how to create and deploy inference pipelines, see the [Inference Pipeline with Scikit-learn and Linear Learner](https://github.com/aws/amazon-sagemaker-examples/tree/main/sagemaker-python-sdk/scikit_learn_inference_pipeline) sample notebook. For instructions on creating and accessing Jupyter notebook instances that you can use to run the example in SageMaker AI, see [Amazon SageMaker notebook instances](nbi.md). 

To see a list of all the SageMaker AI samples, after creating and opening a notebook instance, choose the **SageMaker AI Examples** tab. There are three inference pipeline notebooks. The first two inference pipeline notebooks just described are located in the `advanced_functionality` folder and the third notebook is in the `sagemaker-python-sdk` folder. To open a notebook, choose its **Use** tab, then choose **Create copy**.

# Feature Processing with Spark ML and Scikit-learn
Process Features with Spark ML and Scikit-learn

Before training a model with either Amazon SageMaker AI built-in algorithms or custom algorithms, you can use Spark and scikit-learn preprocessors to transform your data and engineer features. 

## Feature Processing with Spark ML


You can run Spark ML jobs with [AWS Glue](https://docs.aws.amazon.com/glue/latest/dg/what-is-glue.html), a serverless ETL (extract, transform, load) service, from your SageMaker AI notebook. You can also connect to existing EMR clusters to run Spark ML jobs with [Amazon EMR](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-what-is-emr.html). To do this, you need an AWS Identity and Access Management (IAM) role that grants permission for making calls from your SageMaker AI notebook to AWS Glue. 

**Note**  
To see which Python and Spark versions AWS Glue supports, refer to [AWS Glue Release Notes](/glue/latest/dg/release-notes.html).

After engineering features, you package and serialize Spark ML jobs with MLeap into MLeap containers that you can add to an inference pipeline. You don't need to use externally managed Spark clusters. With this approach, you can seamlessly scale from a sample of rows to terabytes of data. The same transformers work for both training and inference, so you don't need to duplicate preprocessing and feature engineering logic or develop a one-time solution to make the models persist. With inference pipelines, you don't need to maintain outside infrastructure, and you can make predictions directly from data inputs.

When you run a Spark ML job on AWS Glue, a Spark ML pipeline is serialized into [MLeap](https://github.com/combust/mleap) format. Then, you can use the job with the [SparkML Model Serving Container](https://github.com/aws/sagemaker-sparkml-serving-container) in a SageMaker AI Inference Pipeline. *MLeap* is a serialization format and execution engine for machine learning pipelines. It supports Spark, Scikit-learn, and TensorFlow for training pipelines and exporting them to a serialized pipeline called an MLeap Bundle. You can deserialize Bundles back into Spark for batch-mode scoring or into the MLeap runtime to power real-time API services. 

For an example that shows how to feature process with Spark ML, see the [Train an ML Model using Apache Spark in Amazon EMR and deploy in SageMaker AI](https://github.com/aws/amazon-sagemaker-examples/tree/main/sagemaker-python-sdk/sparkml_serving_emr_mleap_abalone) sample notebook.

## Feature Processing with Scikit-Learn


You can run and package scikit-learn jobs into containers directly in Amazon SageMaker AI. For an example of Python code for building a scikit-learn featurizer model that trains on [Fisher's Iris flower data set](http://archive.ics.uci.edu/ml/datasets/Iris) and predicts the species of Iris based on morphological measurements, see [IRIS Training and Prediction with Sagemaker Scikit-learn](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-python-sdk/scikit_learn_iris). 

# Create a Pipeline Model
Create a Pipeline Model

To create a pipeline model that can be deployed to an endpoint or used for a batch transform job, use the Amazon SageMaker AI console or the `CreateModel` operation. 

**To create an inference pipeline (console)**

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. Choose **Models**, and then choose **Create models** from the **Inference** group. 

1. On the **Create model** page, provide a model name, choose an IAM role, and, if you want to use a private VPC, specify VPC values.   
![\[The page for creating a model for an Inference Pipeline.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/create-pipeline-model.png)

1. To add information about the containers in the inference pipeline, choose **Add container**, then choose **Next**.

1. Complete the fields for each container in the order that you want to execute them, up to the maximum of fifteen. Complete the **Container input options**, , **Location of inference code image**, and, optionally, **Location of model artifacts**, **Container host name**, and **Environmental variables** fields. .  
![\[Creating a pipeline model with containers.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/create-pipeline-model-containers.png)

   The **MyInferencePipelineModel** page summarizes the settings for the containers that provide input for the model. If you provided the environment variables in a corresponding container definition, SageMaker AI shows them in the **Environment variables** field.  
![\[The summary of container settings for the pipeline model.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/pipeline-MyInferencePipelinesModel-recap.png)

# Run Real-time Predictions with an Inference Pipeline
Real-time Inference

You can use trained models in an inference pipeline to make real-time predictions directly without performing external preprocessing. When you configure the pipeline, you can choose to use the built-in feature transformers already available in Amazon SageMaker AI. Or, you can implement your own transformation logic using just a few lines of scikit-learn or Spark code. 

[MLeap](https://combust.github.io/mleap-docs/), a serialization format and execution engine for machine learning pipelines, supports Spark, scikit-learn, and TensorFlow for training pipelines and exporting them to a serialized pipeline called an MLeap Bundle. You can deserialize Bundles back into Spark for batch-mode scoring or into the MLeap runtime to power real-time API services.

The containers in a pipeline listen on the port specified in the `SAGEMAKER_BIND_TO_PORT` environment variable (instead of 8080). When running in an inference pipeline, SageMaker AI automatically provides this environment variable to containers. If this environment variable isn't present, containers default to using port 8080. To indicate that your container complies with this requirement, use the following command to add a label to your Dockerfile:

```
LABEL com.amazonaws.sagemaker.capabilities.accept-bind-to-port=true
```

If your container needs to listen on a second port, choose a port in the range specified by the `SAGEMAKER_SAFE_PORT_RANGE` environment variable. Specify the value as an inclusive range in the format **"XXXX-YYYY"**, where `XXXX` and `YYYY` are multi-digit integers. SageMaker AI provides this value automatically when you run the container in a multicontainer pipeline.

**Note**  
To use custom Docker images in a pipeline that includes [SageMaker AI built-in algorithms](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-algo-docker-registry-paths.html), you need an [Amazon Elastic Container Registry (Amazon ECR) policy](https://docs.aws.amazon.com/AmazonECR/latest/userguide/what-is-ecr.html). Your Amazon ECR repository must grant SageMaker AI permission to pull the image. For more information, see [Troubleshoot Amazon ECR Permissions for Inference Pipelines](inference-pipeline-troubleshoot.md#inference-pipeline-troubleshoot-permissions).

## Create and Deploy an Inference Pipeline Endpoint
Real-time Inference Pipeline Endpoint

The following code creates and deploys a real-time inference pipeline model with SparkML and XGBoost models in series using the SageMaker AI SDK.

```
from sagemaker.model import Model
from sagemaker.pipeline_model import PipelineModel
from sagemaker.sparkml.model import SparkMLModel

sparkml_data = 's3://{}/{}/{}'.format(s3_model_bucket, s3_model_key_prefix, 'model.tar.gz')
sparkml_model = SparkMLModel(model_data=sparkml_data)
xgb_model = Model(model_data=xgb_model.model_data, image=training_image)

model_name = 'serial-inference-' + timestamp_prefix
endpoint_name = 'serial-inference-ep-' + timestamp_prefix
sm_model = PipelineModel(name=model_name, role=role, models=[sparkml_model, xgb_model])
sm_model.deploy(initial_instance_count=1, instance_type='ml.c4.xlarge', endpoint_name=endpoint_name)
```

## Request Real-Time Inference from an Inference Pipeline Endpoint
Call an Inference Endpoint

The following example shows how to make real-time predictions by calling an inference endpoint and passing a request payload in JSON format:

```
import sagemaker
from sagemaker.predictor import json_serializer, json_deserializer, Predictor

payload = {
        "input": [
            {
                "name": "Pclass",
                "type": "float",
                "val": "1.0"
            },
            {
                "name": "Embarked",
                "type": "string",
                "val": "Q"
            },
            {
                "name": "Age",
                "type": "double",
                "val": "48.0"
            },
            {
                "name": "Fare",
                "type": "double",
                "val": "100.67"
            },
            {
                "name": "SibSp",
                "type": "double",
                "val": "1.0"
            },
            {
                "name": "Sex",
                "type": "string",
                "val": "male"
            }
        ],
        "output": {
            "name": "features",
            "type": "double",
            "struct": "vector"
        }
    }

predictor = Predictor(endpoint=endpoint_name, sagemaker_session=sagemaker.Session(), serializer=json_serializer,
                                content_type='text/csv', accept='application/json')

print(predictor.predict(payload))
```

The response you get from `predictor.predict(payload)` is the model's inference result.

## Realtime inference pipeline example


You can run this [example notebook using the SKLearn predictor](https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker-python-sdk/scikit_learn_randomforest/Sklearn_on_SageMaker_end2end.ipynb) that shows how to deploy an endpoint, run an inference request, then deserialize the response. Find this notebook and more examples in the [Amazon SageMaker example GitHub repository](https://github.com/awslabs/amazon-sagemaker-examples).

# Batch transforms with inference pipelines
Batch transforms

To get inferences on an entire dataset you run a batch transform on a trained model. To run inferences on a full dataset, you can use the same inference pipeline model created and deployed to an endpoint for real-time processing in a batch transform job. To run a batch transform job in a pipeline, you download the input data from Amazon S3 and send it in one or more HTTP requests to the inference pipeline model. For an example that shows how to prepare data for a batch transform, see "Section 2 - Preprocess the raw housing data using Scikit Learn" of the [Amazon SageMaker Multi-Model Endpoints using Linear Learner sample notebook](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/advanced_functionality/multi_model_linear_learner_home_value). For information about Amazon SageMaker AI batch transforms, see [Batch transform for inference with Amazon SageMaker AI](batch-transform.md). 

**Note**  
To use custom Docker images in a pipeline that includes [Amazon SageMaker AI built-in algorithms](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-algo-docker-registry-paths.html), you need an [Amazon Elastic Container Registry (ECR) policy](https://docs.aws.amazon.com/AmazonECR/latest/userguide/what-is-ecr.html). Your Amazon ECR repository must grant SageMaker AI permission to pull the image. For more information, see [Troubleshoot Amazon ECR Permissions for Inference Pipelines](inference-pipeline-troubleshoot.md#inference-pipeline-troubleshoot-permissions).

The following example shows how to run a transform job using the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable). In this example, `model_name` is the inference pipeline that combines SparkML and XGBoost models (created in previous examples). The Amazon S3 location specified by `input_data_path` contains the input data, in CSV format, to be downloaded and sent to the Spark ML model. After the transform job has finished, the Amazon S3 location specified by `output_data_path` contains the output data returned by the XGBoost model in CSV format.

```
import sagemaker
input_data_path = 's3://{}/{}/{}'.format(default_bucket, 'key', 'file_name')
output_data_path = 's3://{}/{}'.format(default_bucket, 'key')
transform_job = sagemaker.transformer.Transformer(
    model_name = model_name,
    instance_count = 1,
    instance_type = 'ml.m4.xlarge',
    strategy = 'SingleRecord',
    assemble_with = 'Line',
    output_path = output_data_path,
    base_transform_job_name='inference-pipelines-batch',
    sagemaker_session=sagemaker.Session(),
    accept = CONTENT_TYPE_CSV)
transform_job.transform(data = input_data_path, 
                        content_type = CONTENT_TYPE_CSV, 
                        split_type = 'Line')
```

# Inference Pipeline Logs and Metrics
Logs and Metrics

Monitoring is important for maintaining the reliability, availability, and performance of Amazon SageMaker AI resources. To monitor and troubleshoot inference pipeline performance, use Amazon CloudWatch logs and error messages. For information about the monitoring tools that SageMaker AI provides, see [Monitoring AWS resources in Amazon SageMaker AI](monitoring-overview.md).

## Use Metrics to Monitor Multi-container Models
Metrics

To monitor the multi-container models in Inference Pipelines, use Amazon CloudWatch. CloudWatch collects raw data and processes it into readable, near real-time metrics. SageMaker AI training jobs and endpoints write CloudWatch metrics and logs in the `AWS/SageMaker` namespace. 

The following tables list the metrics and dimensions for the following:
+ Endpoint invocations
+ Training jobs, batch transform jobs, and endpoint instances

A *dimension* is a name/value pair that uniquely identifies a metric. You can assign up to 10 dimensions to a metric. For more information on monitoring with CloudWatch, see [Amazon SageMaker AI metrics in Amazon CloudWatch](monitoring-cloudwatch.md). 

**Endpoint Invocation Metrics**

The `AWS/SageMaker` namespace includes the following request metrics from calls to [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_InvokeEndpoint.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_InvokeEndpoint.html).

Metrics are reported at a 1-minute intervals.


| Metric | Description | 
| --- | --- | 
| Invocation4XXErrors |  The number of `InvokeEndpoint` requests that the model returned a `4xx` HTTP response code for. For each `4xx` response, SageMaker AI sends a `1`. Units: None Valid statistics: `Average`, `Sum`  | 
| Invocation5XXErrors |  The number of `InvokeEndpoint` requests that the model returned a `5xx` HTTP response code for. For each `5xx` response, SageMaker AI sends a `1`. Units: None Valid statistics: `Average`, `Sum`  | 
| Invocations |  The `number of InvokeEndpoint` requests sent to a model endpoint.  To get the total number of requests sent to a model endpoint, use the `Sum` statistic. Units: None Valid statistics: `Sum`, `Sample Count`  | 
| InvocationsPerInstance |  The number of endpoint invocations sent to a model, normalized by `InstanceCount` in each `ProductionVariant`. SageMaker AI sends 1/`numberOfInstances` as the value for each request, where `numberOfInstances` is the number of active instances for the ProductionVariant at the endpoint at the time of the request. Units: None Valid statistics: `Sum`  | 
| ModelLatency | The time the model or models took to respond. This includes the time it took to send the request, to fetch the response from the model container, and to complete the inference in the container. ModelLatency is the total time taken by all containers in an inference pipeline.Units: MicrosecondsValid statistics: `Average`, `Sum`, `Min`, `Max`, Sample Count | 
| OverheadLatency |  The time added to the time taken to respond to a client request by SageMaker AI for overhead. `OverheadLatency` is measured from the time that SageMaker AI receives the request until it returns a response to the client, minus the `ModelLatency`. Overhead latency can vary depending on request and response payload sizes, request frequency, and authentication or authorization of the request, among other factors. Units: Microseconds Valid statistics: `Average`, `Sum`, `Min`, `Max`, `Sample Count`  | 
| ContainerLatency | The time it took for an Inference Pipelines container to respond as viewed from SageMaker AI. ContainerLatency includes the time it took to send the request, to fetch the response from the model's container, and to complete inference in the container.Units: MicrosecondsValid statistics: `Average`, `Sum`, `Min`, `Max`, `Sample Count` | 

**Dimensions for Endpoint Invocation Metrics**


| Dimension | Description | 
| --- | --- | 
| EndpointName, VariantName, ContainerName |  Filters endpoint invocation metrics for a `ProductionVariant` at the specified endpoint and for the specified variant.  | 

For an inference pipeline endpoint, CloudWatch lists per-container latency metrics in your account as **Endpoint Container Metrics** and **Endpoint Variant Metrics** in the **SageMaker AI** namespace, as follows. The `ContainerLatency` metric appears only for inferences pipelines.

![\[The CloudWatch dashboard for an inference pipeline.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/pipeline-endpoint-metrics.png)


For each endpoint and each container, latency metrics display names for the container, endpoint, variant, and metric.

![\[The latency metrics for an endpoint.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/pipeline-endpoint-metrics-details.png)


**Training Job, Batch Transform Job, and Endpoint Instance Metrics**

The namespaces `/aws/sagemaker/TrainingJobs`, `/aws/sagemaker/TransformJobs`, and `/aws/sagemaker/Endpoints` include the following metrics for training jobs and endpoint instances.

Metrics are reported at a 1-minute intervals.


| Metric | Description | 
| --- | --- | 
| CPUUtilization |  The percentage of CPU units that are used by the containers running on an instance. The value ranges from 0% to 100%, and is multiplied by the number of CPUs. For example, if there are four CPUs, `CPUUtilization` can range from 0% to 400%. For training jobs, `CPUUtilization` is the CPU utilization of the algorithm container running on the instance. For batch transform jobs, `CPUUtilization` is the CPU utilization of the transform container running on the instance. For multi-container models, `CPUUtilization` is the sum of CPU utilization by all containers running on the instance. For endpoint variants, `CPUUtilization` is the sum of CPU utilization by all of the containers running on the instance. Units: Percent  | 
| MemoryUtilization | The percentage of memory that is used by the containers running on an instance. This value ranges from 0% to 100%.For training jobs, `MemoryUtilization` is the memory used by the algorithm container running on the instance.For batch transform jobs, `MemoryUtilization` is the memory used by the transform container running on the instance.For multi-container models, MemoryUtilization is the sum of memory used by all containers running on the instance.For endpoint variants, `MemoryUtilization` is the sum of memory used by all of the containers running on the instance.Units: Percent | 
| GPUUtilization |  The percentage of GPU units that are used by the containers running on an instance. `GPUUtilization` ranges from 0% to 100% and is multiplied by the number of GPUs. For example, if there are four GPUs, `GPUUtilization` can range from 0% to 400%. For training jobs, `GPUUtilization` is the GPU used by the algorithm container running on the instance. For batch transform jobs, `GPUUtilization` is the GPU used by the transform container running on the instance. For multi-container models, `GPUUtilization` is the sum of GPU used by all containers running on the instance. For endpoint variants, `GPUUtilization` is the sum of GPU used by all of the containers running on the instance. Units: Percent  | 
| GPUMemoryUtilization |  The percentage of GPU memory used by the containers running on an instance. GPUMemoryUtilization ranges from 0% to 100% and is multiplied by the number of GPUs. For example, if there are four GPUs, `GPUMemoryUtilization` can range from 0% to 400%. For training jobs, `GPUMemoryUtilization` is the GPU memory used by the algorithm container running on the instance. For batch transform jobs, `GPUMemoryUtilization` is the GPU memory used by the transform container running on the instance. For multi-container models, `GPUMemoryUtilization` is sum of GPU used by all containers running on the instance. For endpoint variants, `GPUMemoryUtilization` is the sum of the GPU memory used by all of the containers running on the instance. Units: Percent  | 
| DiskUtilization |  The percentage of disk space used by the containers running on an instance. DiskUtilization ranges from 0% to 100%. This metric is not supported for batch transform jobs. For training jobs, `DiskUtilization` is the disk space used by the algorithm container running on the instance. For endpoint variants, `DiskUtilization` is the sum of the disk space used by all of the provided containers running on the instance. Units: Percent  | 

**Dimensions for Training Job, Batch Transform Job, and Endpoint Instance Metrics**


| Dimension | Description | 
| --- | --- | 
| Host |  For training jobs, `Host` has the format `[training-job-name]/algo-[instance-number-in-cluster]`. Use this dimension to filter instance metrics for the specified training job and instance. This dimension format is present only in the `/aws/sagemaker/TrainingJobs` namespace. For batch transform jobs, `Host` has the format `[transform-job-name]/[instance-id]`. Use this dimension to filter instance metrics for the specified batch transform job and instance. This dimension format is present only in the `/aws/sagemaker/TransformJobs` namespace. For endpoints, `Host` has the format `[endpoint-name]/[ production-variant-name ]/[instance-id]`. Use this dimension to filter instance metrics for the specified endpoint, variant, and instance. This dimension format is present only in the `/aws/sagemaker/Endpoints` namespace.  | 

To help you debug your training jobs, endpoints, and notebook instance lifecycle configurations, SageMaker AI also sends anything an algorithm container, a model container, or a notebook instance lifecycle configuration sends to `stdout` or `stderr` to Amazon CloudWatch Logs. You can use this information for debugging and to analyze progress.

## Use Logs to Monitor an Inference Pipeline
Logs

The following table lists the log groups and log streams SageMaker AI. sends to Amazon CloudWatch 

A *log stream* is a sequence of log events that share the same source. Each separate source of logs into CloudWatch makes up a separate log stream. A *log group* is a group of log streams that share the same retention, monitoring, and access control settings.

**Logs**

[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/inference-pipeline-logs-metrics.html)

**Note**  
SageMaker AI creates the `/aws/sagemaker/NotebookInstances` log group when you create a notebook instance with a lifecycle configuration. For more information, see [Customization of a SageMaker notebook instance using an LCC script](notebook-lifecycle-config.md).

For more information about SageMaker AI logging, see [CloudWatch Logs for Amazon SageMaker AI](logging-cloudwatch.md). 

# Troubleshoot Inference Pipelines
Troubleshooting

To troubleshoot inference pipeline issues, use CloudWatch logs and error messages. If you are using custom Docker images in a pipeline that includes Amazon SageMaker AI built-in algorithms, you might also encounter permissions problems. To grant the required permissions, create an Amazon Elastic Container Registry (Amazon ECR) policy.

**Topics**
+ [

## Troubleshoot Amazon ECR Permissions for Inference Pipelines
](#inference-pipeline-troubleshoot-permissions)
+ [

## Use CloudWatch Logs to Troubleshoot SageMaker AI Inference Pipelines
](#inference-pipeline-troubleshoot-logs)
+ [

## Use Error Messages to Troubleshoot Inference Pipelines
](#inference-pipeline-troubleshoot-errors)

## Troubleshoot Amazon ECR Permissions for Inference Pipelines
ECR Permissions

When you use custom Docker images in a pipeline that includes [SageMaker AI built-in algorithms](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-algo-docker-registry-paths.html), you need an [Amazon ECR policy](https://docs.aws.amazon.com/AmazonECR/latest/userguide/what-is-ecr.html). The policy allows your Amazon ECR repository to grant permission for SageMaker AI to pull the image. The policy must add the following permissions:

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Sid": "allowSageMakerToPull",
            "Effect": "Allow",
            "Principal": {
                "Service": "sagemaker.amazonaws.com"
            },
            "Action": [
                "ecr:GetDownloadUrlForLayer",
                "ecr:BatchGetImage",
                "ecr:BatchCheckLayerAvailability"
            ],
            "Resource": "*"
        }
    ]
}
```

------

## Use CloudWatch Logs to Troubleshoot SageMaker AI Inference Pipelines
Logs

SageMaker AI publishes the container logs for endpoints that deploy an inference pipeline to Amazon CloudWatch at the following path for each container.

```
/aws/sagemaker/Endpoints/{EndpointName}/{Variant}/{InstanceId}/{ContainerHostname}
```

For example, logs for this endpoint are published to the following log groups and streams:

```
EndpointName: MyInferencePipelinesEndpoint
Variant: MyInferencePipelinesVariant
InstanceId: i-0179208609ff7e488
ContainerHostname: MyContainerName1 and MyContainerName2
```

```
logGroup: /aws/sagemaker/Endpoints/MyInferencePipelinesEndpoint
logStream: MyInferencePipelinesVariant/i-0179208609ff7e488/MyContainerName1
logStream: MyInferencePipelinesVariant/i-0179208609ff7e488/MyContainerName2
```

A *log stream* is a sequence of log events that share the same source. Each separate source of logs into CloudWatch makes up a separate log stream. A *log group* is a group of log streams that share the same retention, monitoring, and access control settings.

**To see the log groups and streams**

1. Open the CloudWatch console at [https://console.aws.amazon.com/cloudwatch/](https://console.aws.amazon.com/cloudwatch/).

1. In the navigation page, choose **Logs**.

1. In **Log Groups**. filter on **MyInferencePipelinesEndpoint**:   
![\[The CloudWatch log groups filtered for the inference pipeline endpoint.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/pipeline-log-group-filter.png)

1. To see the log streams, on the CloudWatch **Log Groups** page, choose **MyInferencePipelinesEndpoint**, and then **Search Log Group**.  
![\[The CloudWatch log stream for the inference pipeline.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/pipeline-log-streams-2.png)

For a list of the logs that SageMaker AI publishes, see [Inference Pipeline Logs and Metrics](inference-pipeline-logs-metrics.md).

## Use Error Messages to Troubleshoot Inference Pipelines
Error Messages

The inference pipeline error messages indicate which containers failed. 

If an error occurs while SageMaker AI is invoking an endpoint, the service returns a `ModelError` (error code 424), which indicates which container failed. If the request payload (the response from the previous container) exceeds the limit of 5 MB, SageMaker AI provides a detailed error message, such as: 

Received response from MyContainerName1 with status code 200. However, the request payload from MyContainerName1 to MyContainerName2 is 6000000 bytes, which has exceeded the maximum limit of 5 MB.

``

If a container fails the ping health check while SageMaker AI is creating an endpoint, it returns a `ClientError` and indicates all of the containers that failed the ping check in the last health check.

# Delete Endpoints and Resources


Delete endpoints to stop incurring charges.

## Delete Endpoint


Delete your endpoint programmatically using AWS SDK for Python (Boto3), with the AWS CLI, or interactively using the SageMaker AI console.

SageMaker AI frees up all of the resources that were deployed when the endpoint was created. Deleting an endpoint will not delete the endpoint configuration or the SageMaker AI model. See [Delete Endpoint Configuration](#realtime-endpoints-delete-endpoint-config) and [Delete Model](#realtime-endpoints-delete-model) for information on how to delete your endpoint configuration and SageMaker AI model.

------
#### [ AWS SDK for Python (Boto3) ]

Use the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DeleteEndpoint.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DeleteEndpoint.html) API to delete your endpoint. Specify the name of your endpoint for the `EndpointName` field.

```
import boto3

# Specify your AWS Region
aws_region='<aws_region>'

# Specify the name of your endpoint
endpoint_name='<endpoint_name>'

# Create a low-level SageMaker service client.
sagemaker_client = boto3.client('sagemaker', region_name=aws_region)

# Delete endpoint
sagemaker_client.delete_endpoint(EndpointName=endpoint_name)
```

------
#### [ AWS CLI ]

Use the [https://docs.aws.amazon.com/cli/latest/reference/sagemaker/delete-endpoint.html](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/delete-endpoint.html) command to delete your endpoint. Specify the name of your endpoint for the `endpoint-name` flag.

```
aws sagemaker delete-endpoint --endpoint-name <endpoint-name>
```

------
#### [ SageMaker AI Console ]

Delete your endpoint interactively with the SageMaker AI console.

1. In the SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/) navigation menu, choose **Inference**.

1. Choose **Endpoints** from the drop down menu. A list of endpoints created in you AWS account will appear by name, Amazon Resource Name (ARN), creation time, status, and a time stamp of when the endpoint was last updated.

1. Select the endpoint you want to delete.

1. Select the **Actions** dropdown button in the top right corner.

1. Choose **Delete**.

------

## Delete Endpoint Configuration


Delete your endpoint configuration programmaticially using AWS SDK for Python (Boto3), with the AWS CLI, or interactively using the SageMaker AI console. Deleting an endpoint configuration does not delete endpoints created using this configuration. See [Delete Endpoint](#realtime-endpoints-delete-endpoint) for information on how to delete your endpoint.

Do not delete an endpoint configuration in use by an endpoint that is live or while the endpoint is being updated or created. You might lose visibility into the instance type the endpoint is using if you delete the endpoint configuration of an endpoint that is active or being created or updated.

------
#### [ AWS SDK for Python (Boto3) ]

Use the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DeleteEndpointConfig.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DeleteEndpointConfig.html) API to delete your endpoint. Specify the name of your endpoint configuration for the `EndpointConfigName` field.

```
import boto3

# Specify your AWS Region
aws_region='<aws_region>'

# Specify the name of your endpoint configuration
endpoint_config_name='<endpoint_name>'

# Create a low-level SageMaker service client.
sagemaker_client = boto3.client('sagemaker', region_name=aws_region)

# Delete endpoint configuration
sagemaker_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
```

You can optionally use the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeEndpointConfig.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeEndpointConfig.html) API to return information about the name of the your deployed models (production variants) such as the name of your model and the name of the endpoint configuration associated with that deployed model. Provide the name of your endpoint for the `EndpointConfigName` field. 

```
# Specify the name of your endpoint
endpoint_name='<endpoint_name>'

# Create a low-level SageMaker service client.
sagemaker_client = boto3.client('sagemaker', region_name=aws_region)

# Store DescribeEndpointConfig response into a variable that we can index in the next step.
response = sagemaker_client.describe_endpoint_config(EndpointConfigName=endpoint_name)

# Delete endpoint
endpoint_config_name = response['ProductionVariants'][0]['EndpointConfigName']
                        
# Delete endpoint configuration
sagemaker_client.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
```

For more information about other response elements returned by `DescribeEndpointConfig`, see [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeEndpointConfig.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeEndpointConfig.html) in the [SageMaker API Reference guide](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_Operations_Amazon_SageMaker_Service.html).

------
#### [ AWS CLI ]

Use the [https://docs.aws.amazon.com/cli/latest/reference/sagemaker/delete-endpoint-config.html](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/delete-endpoint-config.html) command to delete your endpoint configuration. Specify the name of your endpoint configuration for the `endpoint-config-name` flag.

```
aws sagemaker delete-endpoint-config \
                        --endpoint-config-name <endpoint-config-name>
```

You can optionally use the [https://docs.aws.amazon.com/cli/latest/reference/sagemaker/describe-endpoint-config.html](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/describe-endpoint-config.html) command to return information about the name of the your deployed models (production variants) such as the name of your model and the name of the endpoint configuration associated with that deployed model. Provide the name of your endpoint for the `endpoint-config-name` flag.

```
aws sagemaker describe-endpoint-config --endpoint-config-name <endpoint-config-name>
```

This will return a JSON response. You can copy and paste, use a JSON parser, or use a tool built for JSON parsing to obtain the endpoint configuration name associated with that endpoint.

------
#### [ SageMaker AI Console ]

Delete your endpoint configuration interactively with the SageMaker AI console.

1. In the SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/) navigation menu, choose **Inference**.

1. Choose **Endpoint configurations** from the dropdown menu. A list of endpoint configurations created in you AWS account will appear by name, Amazon Resource Name (ARN), and creation time.

1. Select the endpoint configuration you want to delete.

1. Select the **Actions** dropdown button in the top right corner.

1. Choose **Delete**.

------

## Delete Model


Delete your SageMaker AI model programmaticially using AWS SDK for Python (Boto3), with the AWS CLI, or interactively using the SageMaker AI console. Deleting a SageMaker AI model only deletes the model entry that was created in SageMaker AI. Deleting a model does not delete model artifacts, inference code, or the IAM role that you specified when creating the model.

------
#### [ AWS SDK for Python (Boto3) ]

Use the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DeleteModel.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DeleteModel.html) API to delete your SageMaker AI model. Specify the name of your model for the `ModelName` field.

```
import boto3

# Specify your AWS Region
aws_region='<aws_region>'

# Specify the name of your endpoint configuration
model_name='<model_name>'

# Create a low-level SageMaker service client.
sagemaker_client = boto3.client('sagemaker', region_name=aws_region)

# Delete model
sagemaker_client.delete_model(ModelName=model_name)
```

You can optionally use the [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeEndpointConfig.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeEndpointConfig.html) API to return information about the name of the your deployed models (production variants) such as the name of your model and the name of the endpoint configuration associated with that deployed model. Provide the name of your endpoint for the `EndpointConfigName` field. 

```
# Specify the name of your endpoint
endpoint_name='<endpoint_name>'

# Create a low-level SageMaker service client.
sagemaker_client = boto3.client('sagemaker', region_name=aws_region)

# Store DescribeEndpointConfig response into a variable that we can index in the next step.
response = sagemaker_client.describe_endpoint_config(EndpointConfigName=endpoint_name)

# Delete endpoint
model_name = response['ProductionVariants'][0]['ModelName']
sagemaker_client.delete_model(ModelName=model_name)
```

For more information about other response elements returned by `DescribeEndpointConfig`, see [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeEndpointConfig.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeEndpointConfig.html) in the [SageMaker API Reference guide](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_Operations_Amazon_SageMaker_Service.html).

------
#### [ AWS CLI ]

Use the [https://docs.aws.amazon.com/cli/latest/reference/sagemaker/delete-model.html](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/delete-model.html) command to delete your SageMaker AI model. Specify the name of your model for the `model-name` flag.

```
aws sagemaker delete-model \
                        --model-name <model-name>
```

You can optionally use the [https://docs.aws.amazon.com/cli/latest/reference/sagemaker/describe-endpoint-config.html](https://docs.aws.amazon.com/cli/latest/reference/sagemaker/describe-endpoint-config.html) command to return information about the name of the your deployed models (production variants) such as the name of your model and the name of the endpoint configuration associated with that deployed model. Provide the name of your endpoint for the `endpoint-config-name` flag.

```
aws sagemaker describe-endpoint-config --endpoint-config-name <endpoint-config-name>
```

This will return a JSON response. You can copy and paste, use a JSON parser, or use a tool built for JSON parsing to obtain the name of the model associated with that endpoint.

------
#### [ SageMaker AI Console ]

Delete your SageMaker AI model interactively with the SageMaker AI console.

1. In the SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/) navigation menu, choose **Inference**.

1. Choose **Models** from the dropdown menu. A list of models created in you AWS account will appear by name, Amazon Resource Name (ARN), and creation time.

1. Select the model you want to delete.

1. Select the **Actions** dropdown button in the top right corner.

1. Choose **Delete**.

------

# Automatic scaling of Amazon SageMaker AI models
Automatic scaling

Amazon SageMaker AI supports automatic scaling (auto scaling) for your hosted models. *Auto scaling* dynamically adjusts the number of instances provisioned for a model in response to changes in your workload. When the workload increases, auto scaling brings more instances online. When the workload decreases, auto scaling removes unnecessary instances so that you don't pay for provisioned instances that you aren't using.

**Topics**
+ [

# Auto scaling policy overview
](endpoint-auto-scaling-policy.md)
+ [

# Auto scaling prerequisites
](endpoint-auto-scaling-prerequisites.md)
+ [

# Configure model auto scaling with the console
](endpoint-auto-scaling-add-console.md)
+ [

# Register a model
](endpoint-auto-scaling-add-policy.md)
+ [

# Define a scaling policy
](endpoint-auto-scaling-add-code-define.md)
+ [

# Apply a scaling policy
](endpoint-auto-scaling-add-code-apply.md)
+ [

# Instructions for editing a scaling policy
](endpoint-auto-scaling-edit.md)
+ [

# Temporarily turn off scaling policies
](endpoint-auto-scaling-suspend-scaling-activities.md)
+ [

# Delete a scaling policy
](endpoint-auto-scaling-delete.md)
+ [

# Check the status of a scaling activity by describing scaling activities
](endpoint-scaling-query-history.md)
+ [

# Scale an endpoint to zero instances
](endpoint-auto-scaling-zero-instances.md)
+ [

# Load testing your auto scaling configuration
](endpoint-scaling-loadtest.md)
+ [

# Use CloudFormation to create a scaling policy
](endpoint-scaling-cloudformation.md)
+ [

# Update endpoints that use auto scaling
](endpoint-scaling-update.md)
+ [

# Delete endpoints configured for auto scaling
](endpoint-delete-with-scaling.md)

# Auto scaling policy overview


To use auto scaling, you define a scaling policy that adds and removes the number of instances for your production variant in response to actual workloads.

To automatically scale as workload changes occur, you have two options: target tracking and step scaling policies. 

In most cases, we recommend using target tracking scaling policies. With target tracking, you choose an Amazon CloudWatch metric and target value. Auto scaling creates and manages the CloudWatch alarms for the scaling policy and calculates the scaling adjustment based on the metric and the target value. The policy adds and removes the number of instances as required to keep the metric at, or close to, the specified target value. For example, a scaling policy that uses the predefined `InvocationsPerInstance` metric with a target value of 70 can keep `InvocationsPerInstance` at, or close to 70. For more information, see [Target tracking scaling policies](https://docs.aws.amazon.com/autoscaling/application/userguide/application-auto-scaling-target-tracking.html) in the *Application Auto Scaling User Guide*.

You can use step scaling when you require an advanced configuration, such as specifying how many instances to deploy under what conditions. For example, you must use step scaling if you want to enable an endpoint to scale out from zero active instances. For an overview of step scaling policies and how they work, see [Step scaling policies](https://docs.aws.amazon.com/autoscaling/application/userguide/application-auto-scaling-step-scaling-policies.html) in the *Application Auto Scaling User Guide*.

To create a target tracking scaling policy, you specify the following:
+ **Metric** — The CloudWatch metric to track, such as average number of invocations per instance. 
+ **Target value** — The target value for the metric, such as 70 invocations per instance per minute.

You can create target tracking scaling policies with either predefined metrics or custom metrics. A predefined metric is defined in an enumeration so that you can specify it by name in code or use it in the SageMaker AI console. Alternatively, you can use either the AWS CLI or the Application Auto Scaling API to apply a target tracking scaling policy based on a predefined or custom metric.

Note that scaling activities are performed with cooldown periods between them to prevent rapid fluctuations in capacity. You can optionally configure the cooldown periods for your scaling policy. 

For more information about the key concepts of auto scaling, see the following section.

## Schedule-based scaling


You can also create scheduled actions to perform scaling activities at specific times. You can create scheduled actions that scale one time only or that scale on a recurring schedule. After a scheduled action runs, your scaling policy can continue to make decisions about whether to scale dynamically as workload changes occur. Scheduled scaling can be managed only from the AWS CLI or the Application Auto Scaling API. For more information, see [Scheduled scaling](https://docs.aws.amazon.com/autoscaling/application/userguide/application-auto-scaling-step-scaling-policies.html) in the *Application Auto Scaling User Guide*.

## Minimum and maximum scaling limits


When configuring auto scaling, you must specify your scaling limits before creating a scaling policy. You set limits separately for the minimum and maximum values.

The minimum value must be at least 1, and equal to or less than the value specified for the maximum value.

The maximum value must be equal to or greater than the value specified for the minimum value. SageMaker AI auto scaling does not enforce a limit for this value.

To determine the scaling limits that you need for typical traffic, test your auto scaling configuration with the expected rate of traffic to your model.

If a variant’s traffic becomes zero, SageMaker AI automatically scales in to the minimum number of instances specified. In this case, SageMaker AI emits metrics with a value of zero.

There are three options for specifying the minimum and maximum capacity:

1. Use the console to update the **Minimum instance count** and **Maximum instance count** settings.

1. Use the AWS CLI and include the `--min-capacity` and `--max-capacity` options when running the [register-scalable-target](https://docs.aws.amazon.com/cli/latest/reference/application-autoscaling/register-scalable-target.html) command.

1. Call the [RegisterScalableTarget](https://docs.aws.amazon.com/autoscaling/application/APIReference/API_RegisterScalableTarget.html) API and specify the `MinCapacity` and `MaxCapacity` parameters.

**Tip**  
You can manually scale out by increasing the minimum value, or manually scale in by decreasing the maximum value.

## Cooldown period


A *cooldown period* is used to protect against over-scaling when your model is scaling in (reducing capacity) or scaling out (increasing capacity). It does this by slowing down subsequent scaling activities until the period expires. Specifically, it blocks the deletion of instances for scale-in requests, and limits the creation of instances for scale-out requests. For more information, see [Define cooldown periods](https://docs.aws.amazon.com/autoscaling/application/userguide/target-tracking-scaling-policy-overview.html#target-tracking-cooldown) in the *Application Auto Scaling User Guide*. 

You configure the cooldown period in your scaling policy. 

If you don't specify a scale-in or a scale-out cooldown period, your scaling policy uses the default, which is 300 seconds for each.

If instances are being added or removed too quickly when you test your scaling configuration, consider increasing this value. You might see this behavior if the traffic to your model has a lot of spikes, or if you have multiple scaling policies defined for a variant.

If instances are not being added quickly enough to address increased traffic, consider decreasing this value.

## Related resources


For more information about configuring auto scaling, see the following resources:
+ [application-autoscaling](https://docs.aws.amazon.com/cli/latest/reference/application-autoscaling) section of the *AWS CLI Command Reference*
+ [Application Auto Scaling API Reference](https://docs.aws.amazon.com/autoscaling/application/APIReference/)
+ [Application Auto Scaling User Guide](https://docs.aws.amazon.com/autoscaling/application/userguide/)

**Note**  
SageMaker AI recently introduced new inference capabilities built on real-time inference endpoints. You create a SageMaker AI endpoint with an endpoint configuration that defines the instance type and initial instance count for the endpoint. Then, create an inference component, which is a SageMaker AI hosting object that you can use to deploy a model to an endpoint. For information about scaling inference components, see [SageMaker AI adds new inference capabilities to help reduce foundation model deployment costs and latency](https://aws.amazon.com/blogs/aws/amazon-sagemaker-adds-new-inference-capabilities-to-help-reduce-foundation-model-deployment-costs-and-latency/) and [Reduce model deployment costs by 50% on average using the latest features of SageMaker AI](https://aws.amazon.com/blogs/machine-learning/reduce-model-deployment-costs-by-50-on-average-using-sagemakers-latest-features/) on the AWS Blog.

# Auto scaling prerequisites
Prerequisites

Before you can use auto scaling, you must have already created an Amazon SageMaker AI model endpoint. You can have multiple model versions for the same endpoint. Each model is referred to as a [production (model) variant](model-ab-testing.md). For more information about deploying a model endpoint, see [Deploy the Model to SageMaker AI Hosting Services](ex1-model-deployment.md#ex1-deploy-model).

To activate auto scaling for a model, you can use the SageMaker AI console, the AWS Command Line Interface (AWS CLI), or an AWS SDK through the Application Auto Scaling API. 
+ If this is your first time configuring scaling for a model, we recommend you [Configure model auto scaling with the console](endpoint-auto-scaling-add-console.md). 
+ When using the AWS CLI or the Application Auto Scaling API, the flow is to register the model as a scalable target, define the scaling policy, and then apply it. On the SageMaker AI console, under **Inference** in the navigation pane, choose **Endpoints**. Find your model's endpoint name and then choose it to find the variant name. You must specify both the endpoint name and the variant name to activate auto scaling for a model.

Auto scaling is made possible by a combination of the Amazon SageMaker AI, Amazon CloudWatch, and Application Auto Scaling APIs. For information about the minimum required permissions, see [Application Auto Scaling identity-based policy examples](https://docs.aws.amazon.com/autoscaling/application/userguide/security_iam_id-based-policy-examples.html) in the *Application Auto Scaling User Guide*.

The `SagemakerFullAccessPolicy` IAM policy has all the IAM permissions required to perform auto scaling. For more information about SageMaker AI IAM permissions, see [How to use SageMaker AI execution roles](sagemaker-roles.md).

If you manage your own permission policy, you must include the following permissions:

------
#### [ JSON ]

****  

```
{
  "Version":"2012-10-17",		 	 	 
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "sagemaker:DescribeEndpoint",
        "sagemaker:DescribeEndpointConfig",
        "sagemaker:UpdateEndpointWeightsAndCapacities"
      ],
      "Resource": "*"
    },
    {    
        "Effect": "Allow",
        "Action": [
            "application-autoscaling:*"
        ],
        "Resource": "*"
    },
    {
      "Effect": "Allow",
      "Action": "iam:CreateServiceLinkedRole",
      "Resource": "arn:aws:iam::*:role/aws-service-role/sagemaker.application-autoscaling.amazonaws.com/AWSServiceRoleForApplicationAutoScaling_SageMakerEndpoint",
      "Condition": {
        "StringLike": { "iam:AWSServiceName": "sagemaker.application-autoscaling.amazonaws.com"	}
      }
    },
    {
      "Effect": "Allow",
      "Action": [
        "cloudwatch:PutMetricAlarm",
        "cloudwatch:DescribeAlarms",
        "cloudwatch:DeleteAlarms"
      ],
      "Resource": "*"
    }
  ]
}
```

------

## Service-linked role


Auto scaling uses the `AWSServiceRoleForApplicationAutoScaling_SageMakerEndpoint` service-linked role. This service-linked role grants Application Auto Scaling permission to describe the alarms for your policies, to monitor current capacity levels, and to scale the target resource. This role is created for you automatically. For automatic role creation to succeed, you must have permission for the `iam:CreateServiceLinkedRole` action. For more information, see [Service-linked roles](https://docs.aws.amazon.com/autoscaling/application/userguide/application-auto-scaling-service-linked-roles.html) in the *Application Auto Scaling User Guide*.

# Configure model auto scaling with the console


**To configure auto scaling for a model (console)**

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. On the navigation pane, choose **Inference**, and then choose **Endpoints**. 

1. Choose your endpoint, and then for **Endpoint runtime settings**, choose the variant.

1. Choose **Configure auto scaling**.

1. On the **Configure variant automatic scaling** page, for **Variant automatic scaling**, do the following:

   1. For **Minimum instance count**, type the minimum number of instances that you want the scaling policy to maintain. At least 1 instance is required.

   1. For **Maximum instance count**, type the maximum number of instances that you want the scaling policy to maintain.

1. For **Built-in scaling policy**, do the following:

   1. For the **Target metric**, `SageMakerVariantInvocationsPerInstance` is automatically selected for the metric and cannot be changed.

   1. For the **Target value**, type the average number of invocations per instance per minute for the model. To determine this value, follow the guidelines in [Load testing](endpoint-scaling-loadtest.md).

   1. (Optional) For **Scale-in cool down (seconds)** and **Scale-out cool down (seconds)**, enter the amount of time, in seconds, for each cool down period.

   1. (Optional) Select **Disable scale in** if you don’t want auto scaling to terminate instances when traffic decreases.

1. Choose **Save**.

This procedure registers a model as a scalable target with Application Auto Scaling. When you register a model, Application Auto Scaling performs validation checks to ensure the following:
+ The model exists
+ The permissions are sufficient
+ You aren't registering a variant with an instance that is a burstable performance instance such as T2
**Note**  
SageMaker AI doesn't support auto scaling for burstable instances such as T2, because they already allow for increased capacity under increased workloads. For information about burstable performance instances, see [Amazon EC2 instance types](https://aws.amazon.com/ec2/instance-types/).

# Register a model


Before you add a scaling policy to your model, you first must register your model for auto scaling and define the scaling limits for the model.

The following procedures cover how to register a model (production variant) for auto scaling using the AWS Command Line Interface (AWS CLI) or Application Auto Scaling API.

**Topics**
+ [

## Register a model (AWS CLI)
](#endpoint-auto-scaling-add-cli)
+ [

## Register a model (Application Auto Scaling API)
](#endpoint-auto-scaling-add-api)

## Register a model (AWS CLI)


To register your production variant, use the [register-scalable-target](https://docs.aws.amazon.com/cli/latest/reference/application-autoscaling/register-scalable-target.html) command with the following parameters:
+ `--service-namespace`—Set this value to `sagemaker`.
+ `--resource-id`—The resource identifier for the model (specifically, the production variant). For this parameter, the resource type is `endpoint` and the unique identifier is the name of the production variant. For example, `endpoint/my-endpoint/variant/my-variant`.
+ `--scalable-dimension`—Set this value to `sagemaker:variant:DesiredInstanceCount`.
+ `--min-capacity`—The minimum number of instances. This value must be set to at least 1 and must be equal to or less than the value specified for `max-capacity`.
+ `--max-capacity`—The maximum number of instances. This value must be set to at least 1 and must be equal to or greater than the value specified for `min-capacity`.

**Example**  
The following example shows how to register a variant named `my-variant`, running on the `my-endpoint` endpoint, that can be dynamically scaled to have one to eight instances.  

```
aws application-autoscaling register-scalable-target \
  --service-namespace sagemaker \
  --resource-id endpoint/my-endpoint/variant/my-variant \
  --scalable-dimension sagemaker:variant:DesiredInstanceCount \
  --min-capacity 1 \
  --max-capacity 8
```

## Register a model (Application Auto Scaling API)


To register your model with Application Auto Scaling, use the [RegisterScalableTarget](https://docs.aws.amazon.com/autoscaling/application/APIReference/API_RegisterScalableTarget.html) Application Auto Scaling API action with the following parameters:
+ `ServiceNamespace`—Set this value to `sagemaker`.
+ `ResourceID`—The resource identifier for the production variant. For this parameter, the resource type is `endpoint` and the unique identifier is the name of the variant. For example `endpoint/my-endpoint/variant/my-variant`.
+ `ScalableDimension`—Set this value to `sagemaker:variant:DesiredInstanceCount`.
+ `MinCapacity`—The minimum number of instances. This value must be set to at least 1 and must be equal to or less than the value specified for `MaxCapacity`.
+ `MaxCapacity`—The maximum number of instances. This value must be set to at least 1 and must be equal to or greater than the value specified for `MinCapacity`.

**Example**  
The following example shows how to register a variant named `my-variant`, running on the `my-endpoint` endpoint, that can be dynamically scaled to use one to eight instances.  

```
POST / HTTP/1.1
Host: application-autoscaling.us-east-2.amazonaws.com
Accept-Encoding: identity
X-Amz-Target: AnyScaleFrontendService.RegisterScalableTarget
X-Amz-Date: 20230506T182145Z
User-Agent: aws-cli/2.0.0 Python/3.7.5 Windows/10 botocore/2.0.0dev4
Content-Type: application/x-amz-json-1.1
Authorization: AUTHPARAMS

{
    "ServiceNamespace": "sagemaker",
    "ResourceId": "endpoint/my-endpoint/variant/my-variant",
    "ScalableDimension": "sagemaker:variant:DesiredInstanceCount",
    "MinCapacity": 1,
    "MaxCapacity": 8
}
```

# Define a scaling policy


Before you add a scaling policy to your model, save your policy configuration as a JSON block in a text file. You use that text file when invoking the AWS Command Line Interface (AWS CLI) or the Application Auto Scaling API. You can optimize scaling by choosing an appropriate CloudWatch metric. However, before using a custom metric in production, you must test auto scaling with your custom metric.

**Topics**
+ [

## Specify a predefined metric (CloudWatch metric: InvocationsPerInstance)
](#endpoint-auto-scaling-add-code-predefined)
+ [

## Specify a high-resolution predefined metric (CloudWatch metrics: ConcurrentRequestsPerModel and ConcurrentRequestsPerCopy)
](#endpoint-auto-scaling-add-code-high-res)
+ [

## Define a custom metric (CloudWatch metric: CPUUtilization)
](#endpoint-auto-scaling-add-code-custom)
+ [

## Define a custom metric (CloudWatch metric: ExplanationsPerInstance)
](#endpoint-auto-scaling-online-explainability)
+ [

## Specify cooldown periods
](#endpoint-auto-scaling-add-code-cooldown)

This section shows you example policy configurations for target tracking scaling policies.

## Specify a predefined metric (CloudWatch metric: InvocationsPerInstance)


**Example**  
The following is an example target tracking policy configuration for a variant that keeps the average invocations per instance at 70. Save this configuration in a file named `config.json`.  

```
{
    "TargetValue": 70.0,
    "PredefinedMetricSpecification":
    {
        "PredefinedMetricType": "SageMakerVariantInvocationsPerInstance"
    }
}
```
For more information, see [TargetTrackingScalingPolicyConfiguration](https://docs.aws.amazon.com/autoscaling/application/APIReference/API_TargetTrackingScalingPolicyConfiguration.html) in the *Application Auto Scaling API Reference*.

## Specify a high-resolution predefined metric (CloudWatch metrics: ConcurrentRequestsPerModel and ConcurrentRequestsPerCopy)


With the following high-resolution CloudWatch metrics, you can set scaling policies for the volume of concurrent requests that your models receive:

**ConcurrentRequestsPerModel**  
The number of concurrent requests being received by a model container.

**ConcurrentRequestsPerCopy**  
The number of concurrent requests being received by an inference component.

These metrics track the number of simultaneous requests that your model containers handle, including the requests that are queued inside the containers. For models that send their inference response as a stream of tokens, these metrics track each request until the model sends the last token for the request.

As high-resolution metrics, they emit data more frequently than standard CloudWatch metrics. Standard metrics, such as the `InvocationsPerInstance` metric, emit data once every minute. However, these high-resolution metrics emit data every 10 seconds. Therefore, as the concurrent traffic to your models increases, your policy reacts by scaling out much more quickly than it would for standard metrics. However, as the traffic to your models decreases, your policy scales in at the same speed as it would for standard metrics.

The following is an example target tracking policy configuration that adds instances if the number of concurrent requests per model exceeds 5. Save this configuration in a file named `config.json`.

```
{
    "TargetValue": 5.0,
    "PredefinedMetricSpecification":
    {
        "PredefinedMetricType": "SageMakerVariantConcurrentRequestsPerModelHighResolution"
    }
}
```

If you use inference components to deploy multiple models to the same endpoint, you can create an equivalent policy. In that case, set `PredefinedMetricType` to `SageMakerInferenceComponentConcurrentRequestsPerCopyHighResolution`.

For more information, see [TargetTrackingScalingPolicyConfiguration](https://docs.aws.amazon.com/autoscaling/application/APIReference/API_TargetTrackingScalingPolicyConfiguration.html) in the *Application Auto Scaling API Reference*.

## Define a custom metric (CloudWatch metric: CPUUtilization)


To create a target tracking scaling policy with a custom metric, specify the metric's name, namespace, unit, statistic, and zero or more dimensions. A dimension consists of a dimension name and a dimension value. You can use any production variant metric that changes in proportion to capacity. 

**Example**  
The following example configuration shows a target tracking scaling policy with a custom metric. The policy scales the variant based on an average CPU utilization of 50 percent across all instances. Save this configuration in a file named `config.json`.  

```
{
    "TargetValue": 50.0,
    "CustomizedMetricSpecification":
    {
        "MetricName": "CPUUtilization",
        "Namespace": "/aws/sagemaker/Endpoints",
        "Dimensions": [
            {"Name": "EndpointName", "Value": "my-endpoint" },
            {"Name": "VariantName","Value": "my-variant"}
        ],
        "Statistic": "Average",
        "Unit": "Percent"
    }
}
```
For more information, see [CustomizedMetricSpecification](https://docs.aws.amazon.com/autoscaling/application/APIReference/API_CustomizedMetricSpecification.html) in the *Application Auto Scaling API Reference*. 

## Define a custom metric (CloudWatch metric: ExplanationsPerInstance)


When the endpoint has online explainability activated, it emits a `ExplanationsPerInstance` metric that outputs the average number of records explained per minute, per instance, for a variant. The resource utilization of explaining records can be more different than that of predicting records. We strongly recommend using this metric for target tracking scaling of endpoints with online explainability activated.

You can create multiple target tracking policies for a scalable target. Consider adding the `InvocationsPerInstance` policy from the [Specify a predefined metric (CloudWatch metric: InvocationsPerInstance)](#endpoint-auto-scaling-add-code-predefined) section (in addition to the `ExplanationsPerInstance` policy). If most invocations don't return an explanation because of the threshold value set in the `EnableExplanations` parameter, then the endpoint can choose the `InvocationsPerInstance` policy. If there is a large number of explanations, the endpoint can use the `ExplanationsPerInstance` policy. 

**Example**  
The following example configuration shows a target tracking scaling policy with a custom metric. The policy scale adjusts the number of variant instances so that each instance has an `ExplanationsPerInstance` metric of 20. Save this configuration in a file named `config.json`.  

```
{
    "TargetValue": 20.0,
    "CustomizedMetricSpecification":
    {
        "MetricName": "ExplanationsPerInstance",
        "Namespace": "AWS/SageMaker",
        "Dimensions": [
            {"Name": "EndpointName", "Value": "my-endpoint" },
            {"Name": "VariantName","Value": "my-variant"}
        ],
        "Statistic": "Sum"
    }
}
```

For more information, see [CustomizedMetricSpecification](https://docs.aws.amazon.com/autoscaling/application/APIReference/API_CustomizedMetricSpecification.html) in the *Application Auto Scaling API Reference*. 

## Specify cooldown periods


You can optionally define cooldown periods in your target tracking scaling policy by specifying the `ScaleOutCooldown` and `ScaleInCooldown` parameters. 

**Example**  
The following is an example target tracking policy configuration for a variant that keeps the average invocations per instance at 70. The policy configuration provides a scale-in cooldown period of 10 minutes (600 seconds) and a scale-out cooldown period of 5 minutes (300 seconds). Save this configuration in a file named `config.json`.   

```
{
    "TargetValue": 70.0,
    "PredefinedMetricSpecification":
    {
        "PredefinedMetricType": "SageMakerVariantInvocationsPerInstance"
    },
    "ScaleInCooldown": 600,
    "ScaleOutCooldown": 300
}
```
For more information, see [TargetTrackingScalingPolicyConfiguration](https://docs.aws.amazon.com/autoscaling/application/APIReference/API_TargetTrackingScalingPolicyConfiguration.html) in the *Application Auto Scaling API Reference*. 

# Apply a scaling policy


After you register your model and define a scaling policy, apply the scaling policy to the registered model. This section shows how to apply a scaling policy using the the AWS Command Line Interface (AWS CLI) or the Application Auto Scaling API. 

**Topics**
+ [

## Apply a target tracking scaling policy (AWS CLI)
](#endpoint-auto-scaling-add-code-apply-cli)
+ [

## Apply a scaling policy (Application Auto Scaling API)
](#endpoint-auto-scaling-add-code-apply-api)

## Apply a target tracking scaling policy (AWS CLI)


To apply a scaling policy to your model, use the [put-scaling-policy](https://docs.aws.amazon.com/cli/latest/reference/application-autoscaling/put-scaling-policy.html) AWS CLI command with the following parameters:
+ `--policy-name`—The name of the scaling policy.
+ `--policy-type`—Set this value to `TargetTrackingScaling`.
+ `--resource-id`—The resource identifier for the variant. For this parameter, the resource type is `endpoint` and the unique identifier is the name of the variant. For example, `endpoint/my-endpoint/variant/my-variant`.
+ `--service-namespace`—Set this value to `sagemaker`.
+ `--scalable-dimension`—Set this value to `sagemaker:variant:DesiredInstanceCount`.
+ `--target-tracking-scaling-policy-configuration`—The target-tracking scaling policy configuration to use for the model.

**Example**  
The following example applies a target tracking scaling policy named `my-scaling-policy` to a variant named `my-variant`, running on the `my-endpoint` endpoint. For the `--target-tracking-scaling-policy-configuration` option, specify the `config.json` file that you created previously.   

```
aws application-autoscaling put-scaling-policy \
  --policy-name my-scaling-policy \
  --policy-type TargetTrackingScaling \
  --resource-id endpoint/my-endpoint/variant/my-variant \
  --service-namespace sagemaker \
  --scalable-dimension sagemaker:variant:DesiredInstanceCount \
  --target-tracking-scaling-policy-configuration file://config.json
```

## Apply a scaling policy (Application Auto Scaling API)


To apply a scaling policy to a variant with the Application Auto Scaling API, use the [PutScalingPolicy](https://docs.aws.amazon.com/autoscaling/application/APIReference/API_PutScalingPolicy.html) Application Auto Scaling API action with the following parameters:
+ `PolicyName`—The name of the scaling policy.
+ `ServiceNamespace`—Set this value to `sagemaker`.
+ `ResourceID`—The resource identifier for the variant. For this parameter, the resource type is `endpoint` and the unique identifier is the name of the variant. For example, `endpoint/my-endpoint/variant/my-variant`.
+ `ScalableDimension`—Set this value to `sagemaker:variant:DesiredInstanceCount`.
+ `PolicyType`—Set this value to `TargetTrackingScaling`.
+ `TargetTrackingScalingPolicyConfiguration`—The target-tracking scaling policy configuration to use for the variant.

**Example**  
The following example applies a target tracking scaling policy named `my-scaling-policy` to a variant named `my-variant`, running on the `my-endpoint` endpoint. The policy configuration keeps the average invocations per instance at 70.  

```
POST / HTTP/1.1
Host: application-autoscaling.us-east-2.amazonaws.com
Accept-Encoding: identity
X-Amz-Target: AnyScaleFrontendService.
X-Amz-Date: 20230506T182145Z
User-Agent: aws-cli/2.0.0 Python/3.7.5 Windows/10 botocore/2.0.0dev4
Content-Type: application/x-amz-json-1.1
Authorization: AUTHPARAMS

{
    "PolicyName": "my-scaling-policy",
    "ServiceNamespace": "sagemaker",
    "ResourceId": "endpoint/my-endpoint/variant/my-variant",
    "ScalableDimension": "sagemaker:variant:DesiredInstanceCount",
    "PolicyType": "TargetTrackingScaling",
    "TargetTrackingScalingPolicyConfiguration": {
        "TargetValue": 70.0,
        "PredefinedMetricSpecification":
        {
            "PredefinedMetricType": "SageMakerVariantInvocationsPerInstance"
        }
    }
}
```

# Instructions for editing a scaling policy


After creating a scaling policy, you can edit any of its settings except the name.

 To edit a target tracking scaling policy with the AWS Management Console, use the same procedure that you used to [Configure model auto scaling with the console](endpoint-auto-scaling-add-console.md).

You can use the AWS CLI or the Application Auto Scaling API to edit a scaling policy in the same way that you create a new scaling policy. For more information, see [Apply a scaling policy](endpoint-auto-scaling-add-code-apply.md).

# Temporarily turn off scaling policies


After you configure auto scaling, you have the following options if you need to investigate an issue without interference from scaling policies (dynamic scaling):
+ Temporarily suspend and then resume scaling activities by calling the [register-scalable-target](https://docs.aws.amazon.com/cli/latest/reference/application-autoscaling/register-scalable-target.html) CLI command or [RegisterScalableTarget](https://docs.aws.amazon.com/autoscaling/application/APIReference/API_RegisterScalableTarget.html) API action, specifying a Boolean value for both `DynamicScalingInSuspended` and `DynamicScalingOutSuspended`.   
**Example**  

  The following example shows how to suspend scaling policies for a variant named `my-variant`, running on the `my-endpoint` endpoint.

  ```
  aws application-autoscaling register-scalable-target \
    --service-namespace sagemaker \
    --resource-id endpoint/my-endpoint/variant/my-variant \
    --scalable-dimension sagemaker:variant:DesiredInstanceCount \
    --suspended-state '{"DynamicScalingInSuspended":true,"DynamicScalingOutSuspended":true}'
  ```
+ Prevent specific target tracking scaling policies from scaling in your variant by disabling the policy's scale-in portion. This method prevents the scaling policy from deleting instances, while still allowing it to create them as needed.

  Temporarily disable and then enable scale-in activities by editing the policy using the [put-scaling-policy](https://docs.aws.amazon.com/cli/latest/reference/application-autoscaling/put-scaling-policy.html) CLI command or the [PutScalingPolicy](https://docs.aws.amazon.com/autoscaling/application/APIReference/API_PutScalingPolicy.html) API action, specifying a Boolean value for `DisableScaleIn`.  
**Example**  

  The following is an example of a target tracking configuration for a scaling policy that will scale out but not scale in. 

  ```
  {
      "TargetValue": 70.0,
      "PredefinedMetricSpecification":
      {
          "PredefinedMetricType": "SageMakerVariantInvocationsPerInstance"
      },
      "DisableScaleIn": true
  }
  ```

# Delete a scaling policy


If you no longer need a scaling policy, you can delete it at any time.

**Topics**
+ [

## Delete all scaling policies and deregister the model (console)
](#endpoint-auto-scaling-delete-console)
+ [

## Delete a scaling policy (AWS CLI or Application Auto Scaling API)
](#endpoint-auto-scaling-delete-code)

## Delete all scaling policies and deregister the model (console)


**To delete all scaling policies and deregister the variant as a scalable target**

1. Open the Amazon SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. On the navigation pane, choose **Endpoints**.

1. Choose your endpoint, and then for **Endpoint runtime settings**, choose the variant.

1. Choose **Configure auto scaling**.

1. Choose **Deregister auto scaling**.

## Delete a scaling policy (AWS CLI or Application Auto Scaling API)


You can use the AWS CLI or the Application Auto Scaling API to delete a scaling policy from a variant.

### Delete a scaling policy (AWS CLI)


To delete a scaling policy from a variant, use the [delete-scaling-policy](https://docs.aws.amazon.com/cli/latest/reference/application-autoscaling/delete-scaling-policy.html) command with the following parameters:
+ `--policy-name`—The name of the scaling policy.
+ `--resource-id`—The resource identifier for the variant. For this parameter, the resource type is `endpoint` and the unique identifier is the name of the variant. For example, `endpoint/my-endpoint/variant/my-variant`.
+ `--service-namespace`—Set this value to `sagemaker`.
+ `--scalable-dimension`—Set this value to `sagemaker:variant:DesiredInstanceCount`.

**Example**  
The following example deletes a target tracking scaling policy named `my-scaling-policy` from a variant named `my-variant`, running on the `my-endpoint` endpoint.  

```
aws application-autoscaling delete-scaling-policy \
  --policy-name my-scaling-policy \
  --resource-id endpoint/my-endpoint/variant/my-variant \
  --service-namespace sagemaker \
  --scalable-dimension sagemaker:variant:DesiredInstanceCount
```

### Delete a scaling policy (Application Auto Scaling API)


To delete a scaling policy from your variant, use the [DeleteScalingPolicy](https://docs.aws.amazon.com/autoscaling/application/APIReference/API_DeleteScalingPolicy.html) Application Auto Scaling API action with the following parameters:
+ `PolicyName`—The name of the scaling policy.
+ `ServiceNamespace`—Set this value to `sagemaker`.
+ `ResourceID`—The resource identifier for the variant. For this parameter, the resource type is `endpoint` and the unique identifier is the name of the variant. For example, `endpoint/my-endpoint/variant/my-variant`.
+ `ScalableDimension`—Set this value to `sagemaker:variant:DesiredInstanceCount`.

**Example**  
The following example deletes a target tracking scaling policy named `my-scaling-policy` from a variant named `my-variant`, running on the `my-endpoint` endpoint.  

```
POST / HTTP/1.1
Host: application-autoscaling.us-east-2.amazonaws.com
Accept-Encoding: identity
X-Amz-Target: AnyScaleFrontendService.DeleteScalingPolicy
X-Amz-Date: 20230506T182145Z
User-Agent: aws-cli/2.0.0 Python/3.7.5 Windows/10 botocore/2.0.0dev4
Content-Type: application/x-amz-json-1.1
Authorization: AUTHPARAMS

{
    "PolicyName": "my-scaling-policy",
    "ServiceNamespace": "sagemaker",
    "ResourceId": "endpoint/my-endpoint/variant/my-variant",
    "ScalableDimension": "sagemaker:variant:DesiredInstanceCount"
}
```

# Check the status of a scaling activity by describing scaling activities
Check the status of a scaling activity

You can check the status of a scaling activity for your auto scaled endpoint by describing scaling activities. Application Auto Scaling provides descriptive information about the scaling activities in the specified namespace from the previous six weeks. For more information, see [Scaling activities for Application Auto Scaling](https://docs.aws.amazon.com/autoscaling/application/userguide/application-auto-scaling-scaling-activities.html) in the *Application Auto Scaling User Guide*.

To check the status of a scaling activity, use the [describe-scaling-activities](https://docs.aws.amazon.com/cli/latest/reference/application-autoscaling/describe-scaling-activities.html) command. You can't check the status of a scaling activity using the console.

**Topics**
+ [

## Describe scaling activities (AWS CLI)
](#endpoint-how-to)
+ [

## Identify blocked scaling activities from instance quotas (AWS CLI)
](#endpoint-identify-blocked-autoscaling)

## Describe scaling activities (AWS CLI)


To describe scaling activities for all SageMaker AI resources that registered with Application Auto Scaling, use the [describe-scaling-activities](https://docs.aws.amazon.com/cli/latest/reference/application-autoscaling/describe-scaling-activities.html) command, specifying `sagemaker` for the `--service-namespace` option.

```
aws application-autoscaling describe-scaling-activities \
  --service-namespace sagemaker
```

To describe scaling activities for a specific resource, include the `--resource-id` option. 

```
aws application-autoscaling describe-scaling-activities \
  --service-namespace sagemaker \
  --resource-id endpoint/my-endpoint/variant/my-variant
```

The following example shows the output produced when you run this command.

```
{
    "ActivityId": "activity-id",
    "ServiceNamespace": "sagemaker",
    "ResourceId": "endpoint/my-endpoint/variant/my-variant",
    "ScalableDimension": "sagemaker:variant:DesiredInstanceCount",
    "Description": "string",
    "Cause": "string",
    "StartTime": timestamp,
    "EndTime": timestamp,
    "StatusCode": "string",
    "StatusMessage": "string"
}
```

## Identify blocked scaling activities from instance quotas (AWS CLI)


When you scale out (add more instances), you might reach your account-level instance quota. You can use the [describe-scaling-activities](https://docs.aws.amazon.com/cli/latest/reference/application-autoscaling/describe-scaling-activities.html) command to check whether you have reached your instance quota. When you exceed your quota, auto scaling is blocked. 

To check if you have reached your instance quota, use the [describe-scaling-activities](https://docs.aws.amazon.com/cli/latest/reference/application-autoscaling/describe-scaling-activities.html) command and specify the resource ID for the `--resource-id` option. 

```
aws application-autoscaling describe-scaling-activities \
    --service-namespace sagemaker \
    --resource-id endpoint/my-endpoint/variant/my-variant
```

Within the return syntax, check the [StatusCode](https://docs.aws.amazon.com/autoscaling/application/APIReference/API_ScalingActivity.html#autoscaling-Type-ScalingActivity-StatusCode) and [StatusMessage](https://docs.aws.amazon.com/autoscaling/application/APIReference/API_ScalingActivity.html#autoscaling-Type-ScalingActivity-StatusMessage) keys and their associated values. `StatusCode` returns `Failed`. Within `StatusMessage` there is a message indicating that the account-level service quota was reached. The following is an example of what that message might look like: 

```
{
    "ActivityId": "activity-id",
    "ServiceNamespace": "sagemaker",
    "ResourceId": "endpoint/my-endpoint/variant/my-variant",
    "ScalableDimension": "sagemaker:variant:DesiredInstanceCount",
    "Description": "string",
    "Cause": "minimum capacity was set to 110",
    "StartTime": timestamp,
    "EndTime": timestamp,
    "StatusCode": "Failed",
    "StatusMessage": "Failed to set desired instance count to 110. Reason: The 
    account-level service limit 'ml.xx.xxxxxx for endpoint usage' is 1000 
    Instances, with current utilization of 997 Instances and a request delta 
    of 20 Instances. Please contact AWS support to request an increase for this 
    limit. (Service: AmazonSageMaker; Status Code: 400; 
    Error Code: ResourceLimitExceeded; Request ID: request-id)."
}
```

# Scale an endpoint to zero instances


When you set up auto scaling for an endpoint, you can allow the scale-in process to reduce the number of in-service instances to zero. By doing so, you save costs during periods when your endpoint isn't serving inference requests and therefore doesn't require any active instances. 

However, after scaling in to zero instances, your endpoint can't respond to any incoming inference requests until it provisions at least one instance. To automate the provisioning process, you create a step scaling policy with Application Auto Scaling. Then, you assign the policy to an Amazon CloudWatch alarm.

After you set up the step scaling policy and the alarm, your endpoint will automatically provision an instance soon after it receives an inference request that it can't respond to. Be aware that the provisioning process takes several minutes. During that time, any attempts to invoke the endpoint will produce an error.

The following procedures explain how to set up auto scaling for an endpoint so that it scales in to, and out from, zero instances. The procedures use commands with the AWS CLI.

**Before you begin**

Before your endpoint can scale in to, and out from, zero instances, it must meet the following requirements:
+ It is in service.
+ It hosts one or more inference components. An endpoint can scale to and from zero instances only if it hosts inference components.

  For information about hosting inference components on SageMaker AI endpoints, see [Deploy models for real-time inference](realtime-endpoints-deploy-models.md).
+ In the endpoint configuration, for the production variant `ManagedInstanceScaling` object, you've set the `MinInstanceCount` parameter to `0`.

  For reference information about this parameter, see [ProductionVariantManagedInstanceScaling](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ProductionVariantManagedInstanceScaling.html).

**To enable an endpoint to scale in to zero instances (AWS CLI)**

For each inference component that the endpoint hosts, do the following:

1. Register the inference component as a scalable target. When you register it, set the minimum capacity to `0`, as shown by the following command:

   ```
   aws application-autoscaling register-scalable-target \
     --service-namespace sagemaker \
     --resource-id inference-component/inference-component-name \
     --scalable-dimension sagemaker:inference-component:DesiredCopyCount \
     --min-capacity 0 \
     --max-capacity n
   ```

   In this example, replace *inference-component-name* with the name of your inference component. Replace *n* with the maximum number of inference component copies to provision when scaling out.

   For more information about this command and each of its parameters, see [register-scalable-target](https://docs.aws.amazon.com/cli/latest/reference/application-autoscaling/register-scalable-target.html) in the *AWS CLI Command Reference*.

1. Apply a target tracking policy to the inference component, as shown by the following command:

   ```
   aws application-autoscaling put-scaling-policy \
     --policy-name my-scaling-policy \
     --policy-type TargetTrackingScaling \
     --resource-id inference-component/inference-component-name \
     --service-namespace sagemaker \
     --scalable-dimension sagemaker:inference-component:DesiredCopyCount \
     --target-tracking-scaling-policy-configuration file://config.json
   ```

   In this example, replace *inference-component-name* with the name of your inference component.

   In the example, the `config.json` file contains a target tracking policy configuration, such as the following:

   ```
   {
     "PredefinedMetricSpecification": {
         "PredefinedMetricType": "SageMakerInferenceComponentInvocationsPerCopy"
     },
     "TargetValue": 1,
     "ScaleInCooldown": 300,
     "ScaleOutCooldown": 300
   }
   ```

   For more example tracking policy configurations, see [Define a scaling policy](endpoint-auto-scaling-add-code-define.md).

   For more information about this command and each of its parameters, see [put-scaling-policy](https://docs.aws.amazon.com/cli/latest/reference/application-autoscaling/put-scaling-policy.html) in the *AWS CLI Command Reference*.

**To enable an endpoint to scale out from zero instances (AWS CLI)**

For each inference component that the endpoint hosts, do the following:

1. Apply a step scaling policy to the inference component, as shown by the following command:

   ```
   aws application-autoscaling put-scaling-policy \
     --policy-name my-scaling-policy \
     --policy-type StepScaling \
     --resource-id inference-component/inference-component-name \
     --service-namespace sagemaker \
     --scalable-dimension sagemaker:inference-component:DesiredCopyCount \
     --step-scaling-policy-configuration file://config.json
   ```

   In this example, replace *my-scaling-policy* with a unique name for your policy. Replace *inference-component-name* with the name of your inference component.

   In the example, the `config.json` file contains a step scaling policy configuration, such as the following:

   ```
   {
       "AdjustmentType": "ChangeInCapacity",
       "MetricAggregationType": "Maximum",
       "Cooldown": 60,
       "StepAdjustments":
         [
            {
              "MetricIntervalLowerBound": 0,
              "ScalingAdjustment": 1
            }
         ]
   }
   ```

   When this step scaling policy is triggered, SageMaker AI provisions the necessary instances to support the inference component copies.

   After you create the step scaling policy, take note of its Amazon Resource Name (ARN). You need the ARN for the CloudWatch alarm in the next step.

   For more information about step scaling polices, see [Step scaling policies](https://docs.aws.amazon.com/autoscaling/application/userguide/application-auto-scaling-step-scaling-policies.html) in the *Application Auto Scaling User Guide*.

1. Create a CloudWatch alarm and assign the step scaling policy to it, as shown by the following example:

   ```
   aws cloudwatch put-metric-alarm \
   --alarm-actions step-scaling-policy-arn \
   --alarm-description "Alarm when SM IC endpoint invoked that has 0 instances." \
   --alarm-name ic-step-scaling-alarm \
   --comparison-operator GreaterThanThreshold  \
   --datapoints-to-alarm 1 \
   --dimensions "Name=InferenceComponentName,Value=inference-component-name" \
   --evaluation-periods 1 \
   --metric-name NoCapacityInvocationFailures \
   --namespace AWS/SageMaker \
   --period 60 \
   --statistic Sum \
   --threshold 1
   ```

   In this example, replace *step-scaling-policy-arn* with the ARN of your step scaling policy. Replace *ic-step-scaling-alarm* with a name of your choice. Replace *inference-component-name* with the name of your inference component. 

   This example sets the `--metric-name` parameter to `NoCapacityInvocationFailures`. SageMaker AI emits this metric when an endpoint receives an inference request, but the endpoint has no active instances to serve the request. When that event occurs, the alarm initiates the step scaling policy in the previous step.

   For more information about this command and each of its parameters, see [put-metric-alarm](https://docs.aws.amazon.com/cli/latest/reference/cloudwatch/put-metric-alarm.html) in the *AWS CLI Command Reference*.

# Load testing your auto scaling configuration
Load testing

Perform load tests to choose a scaling configuration that works the way you want.

The following guidelines for load testing assume you are using a scaling policy that uses the predefined target metric `SageMakerVariantInvocationsPerInstance`.

**Topics**
+ [

## Determine the performance characteristics
](#endpoint-scaling-loadtest-variant)
+ [

## Calculate the target load
](#endpoint-scaling-loadtest-calc)

## Determine the performance characteristics


Perform load testing to find the peak `InvocationsPerInstance` that your model's production variant can handle, and the latency of requests, as concurrency increases.

This value depends on the instance type chosen, payloads that clients of your model typically send, and the performance of any external dependencies your model has.

**To find the peak requests-per-second (RPS) your model's production variant can handle and latency of requests**

1. Set up an endpoint with your model using a single instance. For information about how to set up an endpoint, see [Deploy the Model to SageMaker AI Hosting Services](ex1-model-deployment.md#ex1-deploy-model).

1. Use a load testing tool to generate an increasing number of parallel requests, and monitor the RPS and model latency in the out put of the load testing tool. 
**Note**  
You can also monitor requests-per-minute instead of RPS. In that case don't multiply by 60 in the equation to calculate `SageMakerVariantInvocationsPerInstance` shown below.

   When the model latency increases or the proportion of successful transactions decreases, this is the peak RPS that your model can handle.

## Calculate the target load


After you find the performance characteristics of the variant, you can determine the maximum RPS we should allow to be sent to an instance. The threshold used for scaling must be less than this maximum value. Use the following equation in combination with load testing to determine the correct value for the `SageMakerVariantInvocationsPerInstance` target metric in your scaling configuration.

```
SageMakerVariantInvocationsPerInstance = (MAX_RPS * SAFETY_FACTOR) * 60
```

Where `MAX_RPS` is the maximum RPS that you determined previously, and `SAFETY_FACTOR` is the safety factor that you chose to ensure that your clients don't exceed the maximum RPS. Multiply by 60 to convert from RPS to invocations-per-minute to match the per-minute CloudWatch metric that SageMaker AI uses to implement auto scaling (you don't need to do this if you measured requests-per-minute instead of requests-per-second).

**Note**  
SageMaker AI recommends that you start testing with a `SAFETY_FACTOR` of 0.5. Test your scaling configuration to ensure it operates in the way you expect with your model for both increasing and decreasing customer traffic on your endpoint.

# Use CloudFormation to create a scaling policy


The following example shows how to configure model auto scaling on an endpoint using CloudFormation.

```
  Endpoint:
    Type: "AWS::SageMaker::Endpoint"
    Properties:
      EndpointName: yourEndpointName
      EndpointConfigName: yourEndpointConfigName

  ScalingTarget:
    Type: "AWS::ApplicationAutoScaling::ScalableTarget"
    Properties:
      MaxCapacity: 10
      MinCapacity: 2
      ResourceId: endpoint/my-endpoint/variant/my-variant
      RoleARN: arn
      ScalableDimension: sagemaker:variant:DesiredInstanceCount
      ServiceNamespace: sagemaker

  ScalingPolicy:
    Type: "AWS::ApplicationAutoScaling::ScalingPolicy"
    Properties:
      PolicyName: my-scaling-policy
      PolicyType: TargetTrackingScaling
      ScalingTargetId:
        Ref: ScalingTarget
      TargetTrackingScalingPolicyConfiguration:
        TargetValue: 70.0
        ScaleInCooldown: 600
        ScaleOutCooldown: 30
        PredefinedMetricSpecification:
          PredefinedMetricType: SageMakerVariantInvocationsPerInstance
```

For more information, see [Create Application Auto Scaling resources with AWS CloudFormation](https://docs.aws.amazon.com/autoscaling/application/userguide/creating-resources-with-cloudformation.html) in the *Application Auto Scaling User Guide*.

# Update endpoints that use auto scaling


When you update an endpoint, Application Auto Scaling checks to see whether any of the models on that endpoint are targets for auto scaling. If the update would change the instance type for any model that is a target for auto scaling, the update fails. 

In the AWS Management Console, you see a warning that you must deregister the model from auto scaling before you can update it. If you are trying to update the endpoint by calling the [UpdateEndpoint](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateEndpoint.html) API, the call fails. Before you update the endpoint, delete any scaling policies configured for it and deregister the variant as a scalable target by calling the [DeregisterScalableTarget](https://docs.aws.amazon.com/autoscaling/application/APIReference/API_DeregisterScalableTarget.html) Application Auto Scaling API action. After you update the endpoint, you can register the updated variant as a scalable target and attach a scaling policy.

There is one exception. If you change the model for a variant that is configured for auto scaling, Amazon SageMaker AI auto scaling allows the update. This is because changing the model doesn't typically affect performance enough to change scaling behavior. If you do update a model for a variant configured for auto scaling, ensure that the change to the model doesn't significantly affect performance and scaling behavior.

When you update SageMaker AI endpoints that have auto scaling applied, complete the following steps:

**To update an endpoint that has auto scaling applied**

1. Deregister the endpoint as a scalable target by calling [DeregisterScalableTarget](https://docs.aws.amazon.com/autoscaling/application/APIReference/API_DeregisterScalableTarget.html).

1. Because auto scaling is blocked while the update operation is in progress (or if you turned off auto scaling in the previous step), you might want to take the additional precaution of increasing the number of instances for your endpoint during the update. To do this, update the instance counts for the production variants hosted at the endpoint by calling [UpdateEndpointWeightsAndCapacities](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateEndpointWeightsAndCapacities.html).

1. Call [ DescribeEndpoint](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeEndpoint.html) repeatedly until the value of the `EndpointStatus` field of the response is `InService`.

1. Call [ DescribeEndpointConfig](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeEndpointConfig.html) to get the values of the current endpoint config.

1. Create a new endpoint config by calling [ CreateEndpointConfig](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateEndpointConfig.html). For the production variants where you want to keep the existing instance count or weight, use the same variant name from the response from the call to [ DescribeEndpointConfig](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeEndpointConfig.html) in the previous step. For all other values, use the values that you got as the response when you called [ DescribeEndpointConfig](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeEndpointConfig.html) in the previous step.

1. Update the endpoint by calling [ UpdateEndpoint](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateEndpoint.html). Specify the endpoint config you created in the previous step as the `EndpointConfig` field. If you want to retain the variant properties like instance count or weight, set the value of the `RetainAllVariantProperties` parameter to `True`. This specifies that production variants with the same name will are updated with the most recent `DesiredInstanceCount` from the response from the call to `DescribeEndpoint`, regardless of the values of the `InitialInstanceCount` field in the new `EndpointConfig`.

1. (Optional) Re-activate auto scaling by calling [RegisterScalableTarget](https://docs.aws.amazon.com/autoscaling/application/APIReference/API_RegisterScalableTarget.html) and [PutScalingPolicy](https://docs.aws.amazon.com/autoscaling/application/APIReference/API_PutScalingPolicy.html).

**Note**  
Steps 1 and 7 are required only if you are updating an endpoint with the following changes:  
Changing the instance type for a production variant that has auto scaling configured
Removing a production variant that has auto scaling configured.

# Delete endpoints configured for auto scaling


If you delete an endpoint, Application Auto Scaling checks to see whether any of the models on that endpoint are targets for auto scaling. If any are and you have permission to deregister the model, Application Auto Scaling deregisters those models as scalable targets without notifying you. If you use a custom permission policy that doesn't provide permission for the [DeregisterScalableTarget](https://docs.aws.amazon.com/autoscaling/application/APIReference/API_DeregisterScalableTarget.html) action, you must request access to this action before deleting the endpoint.

**Note**  
As an IAM user, you might not have sufficient permission to delete an endpoint if another user configured auto scaling for a variant on that endpoint.

# Instance storage volumes


When you create an endpoint, Amazon SageMaker AI attaches an Amazon Elastic Block Store (Amazon EBS) storage volume to Amazon EC2 instances that hosts the endpoint. The size of the storage volume is scalable, and storage options are divided into two categories: SSD-backed storage and HDD-backed storage. 

For more information about Amazon EBS storages and features, see the following pages.
+ [Amazon EBS Features](https://aws.amazon.com/ebs/features/)
+ [ Amazon EBS User Guide ](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AmazonEBS.html)

For a full list of the host instance storage volumes, see [Host Instance Storage Volumes Table](https://aws.amazon.com/releasenotes/host-instance-storage-volumes-table/) 

**Note**  
Amazon SageMaker AI attaches an Amazon Elastic Block Store (Amazon EBS) storage volume to Amazon EC2 instances only when you create [Asynchronous inference](async-inference.md) or [Real-time inference](realtime-endpoints.md) endpoint types. For more information on customizing Amazon EBS storage volume, see [SageMaker AI endpoint parameters for large model inference](large-model-inference-hosting.md).

# Validation of models in production


 With SageMaker AI, you can test multiple models or model versions behind the same endpoint using variants. A variant consists of an ML instance and the serving components specified in a SageMaker AI model. You can have multiple variants behind an endpoint. Each variant can have a different instance type or a SageMaker AI model that can be autoscaled independently of the others. The models within the variants can be trained using different datasets, different algorithms, different ML frameworks, or any combination of all of these. All the variants behind an endpoint share the same inference code. SageMaker AI supports two types of variants, production variants and shadow variants. 

 If you have multiple production variants behind an endpoint, then you can allocate a portion of your inference requests to each variant. Each request is routed to only one of the production variants. The production variant to which the request was routed provides the response to the caller. You can compare how the production variants perform relative to each other. 

 You can also have a shadow variant corresponding to a production variant behind an endpoint. A portion of the inference requests that goes to the production variant is replicated to the shadow variant. The responses of the shadow variant are logged for comparison and not returned to the caller. This lets you test the performance of the shadow variant without exposing the caller to the response produced by the shadow variant. 

**Topics**
+ [

# Testing models with production variants
](model-ab-testing.md)
+ [

# Testing models with shadow variants
](model-shadow-deployment.md)

# Testing models with production variants


 In production ML workflows, data scientists and engineers frequently try to improve performance using various methods, such as [Automatic model tuning with SageMaker AI](automatic-model-tuning.md), training on additional or more-recent data, improving feature selection, using better updated instances and serving containers. You can use production variants to compare your models, instances and containers, and choose the best performing candidate to respond to inference requests. 

 With SageMaker AI multi-variant endpoints you can distribute endpoint invocation requests across multiple production variants by providing the traffic distribution for each variant, or you can invoke a specific variant directly for each request. In this topic, we look at both methods for testing ML models. 

**Topics**
+ [

## Test models by specifying traffic distribution
](#model-testing-traffic-distribution)
+ [

## Test models by invoking specific variants
](#model-testing-target-variant)
+ [

## Model A/B test example
](#model-ab-test-example)

## Test models by specifying traffic distribution


 To test multiple models by distributing traffic between them, specify the percentage of the traffic that gets routed to each model by specifying the weight for each production variant in the endpoint configuration. For information, see [CreateEndpointConfig](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateEndpointConfig.html). The following diagram shows how this works in more detail. 

![\[Example showing how distributing traffic between models using InvokeEndpoint works in SageMaker AI.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/model-traffic-distribution.png)


## Test models by invoking specific variants


 To test multiple models by invoking specific models for each request, specify the specific version of the model you want to invoke by providing a value for the `TargetVariant` parameter when you call [InvokeEndpoint](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_runtime_InvokeEndpoint.html). SageMaker AI ensures that the request is processed by the production variant you specify. If you have already provided traffic distribution and specify a value for the `TargetVariant` parameter, the targeted routing overrides the random traffic distribution. The following diagram shows how this works in more detail. 

![\[Example showing how invoking specific models for each request using InvokeEndpoint works in SageMaker AI.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/model-target-variant.png)


## Model A/B test example


 Performing A/B testing between a new model and an old model with production traffic can be an effective final step in the validation process for a new model. In A/B testing, you test different variants of your models and compare how each variant performs. If the newer version of the model delivers better performance than the previously existing version, replace the old version of the model with the new version in production. 

 The following example shows how to perform A/B model testing. For a sample notebook that implements this example, see ["A/B Testing ML models in production](https://sagemaker-examples.readthedocs.io/en/latest/sagemaker_endpoints/a_b_testing/a_b_testing.html). 

### Step 1: Create and deploy models


 First, we define where our models are located in Amazon S3. These locations are used when we deploy our models in subsequent steps: 

```
model_url = f"s3://{path_to_model_1}"
model_url2 = f"s3://{path_to_model_2}"
```

 Next, we create the model objects with the image and model data. These model objects are used to deploy production variants on an endpoint. The models are developed by training ML models on different data sets, different algorithms or ML frameworks, and different hyperparameters: 

```
from sagemaker.amazon.amazon_estimator import get_image_uri

model_name = f"DEMO-xgb-churn-pred-{datetime.now():%Y-%m-%d-%H-%M-%S}"
model_name2 = f"DEMO-xgb-churn-pred2-{datetime.now():%Y-%m-%d-%H-%M-%S}"
image_uri = get_image_uri(boto3.Session().region_name, 'xgboost', '0.90-1')
image_uri2 = get_image_uri(boto3.Session().region_name, 'xgboost', '0.90-2')

sm_session.create_model(
    name=model_name,
    role=role,
    container_defs={
        'Image': image_uri,
        'ModelDataUrl': model_url
    }
)

sm_session.create_model(
    name=model_name2,
    role=role,
    container_defs={
        'Image': image_uri2,
        'ModelDataUrl': model_url2
    }
)
```

 We now create two production variants, each with its own different model and resource requirements (instance type and counts). This enables you to also test models on different instance types. 

 We set an initial\$1weight of 1 for both variants. This means that 50% of requests go to `Variant1`, and the remaining 50% of requests to `Variant2`. The sum of weights across both variants is 2 and each variant has weight assignment of 1. This means that each variant receives 1/2, or 50%, of the total traffic. 

```
from sagemaker.session import production_variant

variant1 = production_variant(
               model_name=model_name,
               instance_type="ml.m5.xlarge",
               initial_instance_count=1,
               variant_name='Variant1',
               initial_weight=1,
           )

variant2 = production_variant(
               model_name=model_name2,
               instance_type="ml.m5.xlarge",
               initial_instance_count=1,
               variant_name='Variant2',
               initial_weight=1,
           )
```

 Finally we’re ready to deploy these production variants on a SageMaker AI endpoint. 

```
endpoint_name = f"DEMO-xgb-churn-pred-{datetime.now():%Y-%m-%d-%H-%M-%S}"
print(f"EndpointName={endpoint_name}")

sm_session.endpoint_from_production_variants(
    name=endpoint_name,
    production_variants=[variant1, variant2]
)
```

### Step 2: Invoke the deployed models


 Now we send requests to this endpoint to get inferences in real time. We use both traffic distribution and direct targeting. 

 First, we use traffic distribution that we configured in the previous step. Each inference response contains the name of the production variant that processes the request, so we can see that traffic to the two production variants is roughly equal. 

```
# get a subset of test data for a quick test
!tail -120 test_data/test-dataset-input-cols.csv > test_data/test_sample_tail_input_cols.csv
print(f"Sending test traffic to the endpoint {endpoint_name}. \nPlease wait...")

with open('test_data/test_sample_tail_input_cols.csv', 'r') as f:
    for row in f:
        print(".", end="", flush=True)
        payload = row.rstrip('\n')
        sm_runtime.invoke_endpoint(
            EndpointName=endpoint_name,
            ContentType="text/csv",
            Body=payload
        )
        time.sleep(0.5)

print("Done!")
```

 SageMaker AI emits metrics such as `Latency` and `Invocations` for each variant in Amazon CloudWatch. For a complete list of metrics that SageMaker AI emits, see [Amazon SageMaker AI metrics in Amazon CloudWatch](monitoring-cloudwatch.md). Let’s query CloudWatch to get the number of invocations per variant, to show how invocations are split across variants by default: 

![\[Example CloudWatch number of invocations per variant.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/model-variant-invocations.png)


 Now let's invoke a specific version of the model by specifying `Variant1` as the `TargetVariant` in the call to `invoke_endpoint`. 

```
print(f"Sending test traffic to the endpoint {endpoint_name}. \nPlease wait...")
with open('test_data/test_sample_tail_input_cols.csv', 'r') as f:
    for row in f:
        print(".", end="", flush=True)
        payload = row.rstrip('\n')
        sm_runtime.invoke_endpoint(
            EndpointName=endpoint_name,
            ContentType="text/csv",
            Body=payload,
            TargetVariant="Variant1"
        ) 
        time.sleep(0.5)
```

 To confirm that all new invocations were processed by `Variant1`, we can query CloudWatch to get the number of invocations per variant. We see that for the most recent invocations (latest timestamp), all requests were processed by `Variant1`, as we had specified. There were no invocations made for `Variant2`. 

![\[Example CloudWatch number of invocations for each variant.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/model-invocations-target1.png)


### Step 3: Evaluate model performance


 To see which model version performs better, let's evaluate the accuracy, precision, recall, F1 score, and Receiver operating charactersistic/Area under the curve for each variant. First, let's look at these metrics for `Variant1`: 

![\[Example receiver operating characteristic curve for Variant1.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/model-curve.png)


Now let's look at the metrics for `Variant2`:

![\[Example receiver operating characteristic curve for Variant2.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/model2-curve.png)


 For most of our defined metrics, `Variant2` is performing better, so this is the one that we want to use in production. 

### Step 4: Increase traffic to the best model


 Now that we have determined that `Variant2` performs better than `Variant1`, we shift more traffic to it. We can continue to use `TargetVariant` to invoke a specific model variant, but a simpler approach is to update the weights assigned to each variant by calling [UpdateEndpointWeightsAndCapacities](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateEndpointWeightsAndCapacities.html). This changes the traffic distribution to your production variants without requiring updates to your endpoint. Recall from the setup section that we set variant weights to split traffic 50/50. The CloudWatch metrics for the total invocations for each variant below show us the invocation patterns for each variant: 

![\[Example CloudWatch metrics for the total invocations for each variant.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/model-invocations-even-dist.png)


 Now we shift 75% of the traffic to `Variant2` by assigning new weights to each variant using `UpdateEndpointWeightsAndCapacities`. SageMaker AI now sends 75% of the inference requests to `Variant2` and remaining 25% of requests to `Variant1`. 

```
sm.update_endpoint_weights_and_capacities(
    EndpointName=endpoint_name,
    DesiredWeightsAndCapacities=[
        {
            "DesiredWeight": 25,
            "VariantName": variant1["VariantName"]
        },
        {
            "DesiredWeight": 75,
            "VariantName": variant2["VariantName"]
        }
    ]
)
```

 The CloudWatch metrics for total invocations for each variant shows us higher invocations for `Variant2` than for `Variant1`: 

![\[Example CloudWatch metrics for total invocations for each variant.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/model-invocations-75-25.png)


 We can continue to monitor our metrics, and when we're satisfied with a variant's performance, we can route 100% of the traffic to that variant. We use [https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateEndpointWeightsAndCapacities.html](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateEndpointWeightsAndCapacities.html) to update the traffic assignments for the variants. The weight for `Variant1` is set to 0 and the weight for `Variant2` is set to 1. SageMaker AI now sends 100% of all inference requests to `Variant2`. 

```
sm.update_endpoint_weights_and_capacities(
    EndpointName=endpoint_name,
    DesiredWeightsAndCapacities=[
        {
            "DesiredWeight": 0,
            "VariantName": variant1["VariantName"]
        },
        {
            "DesiredWeight": 1,
            "VariantName": variant2["VariantName"]
        }
    ]
)
```

 The CloudWatch metrics for the total invocations for each variant show that all inference requests are being processed by `Variant2` and there are no inference requests processed by `Variant1`. 

![\[Example CloudWatch metrics for the total invocations for each variant.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/model-invocations-best-model.png)


 You can now safely update your endpoint and delete `Variant1` from your endpoint. You can also continue testing new models in production by adding new variants to your endpoint and following steps 2 - 4. 

# Testing models with shadow variants


 You can use SageMaker AI Model Shadow Deployments to create long running shadow variants to validate any new candidate component of your model serving stack before promoting it to production. The following diagram shows how shadow variants work in more detail. 

![\[Details of a shadow variant.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/juxtaposer/shadow-variant.png)


## Deploy shadow variants


 The following code example shows how you can programmatically deploy shadow variants. Replace the *user placeholder text* in the example with your own information. 

1.  Create two SageMaker AI models: one for your production variant, and one for your shadow variant. 

   ```
   import boto3
   from sagemaker import get_execution_role, Session
                   
   aws_region = "aws-region"
   
   boto_session = boto3.Session(region_name=aws_region)
   sagemaker_client = boto_session.client("sagemaker")
   
   role = get_execution_role()
   
   bucket = Session(boto_session).default_bucket()
   
   model_name1 = "name-of-your-first-model"
   model_name2 = "name-of-your-second-model"
   
   sagemaker_client.create_model(
       ModelName = model_name1,
       ExecutionRoleArn = role,
       Containers=[
           {
               "Image": "ecr-image-uri-for-first-model",
               "ModelDataUrl": "s3-location-of-trained-first-model" 
           }
       ]
   )
   
   sagemaker_client.create_model(
       ModelName = model_name2,
       ExecutionRoleArn = role,
       Containers=[
           {
               "Image": "ecr-image-uri-for-second-model",
               "ModelDataUrl": "s3-location-of-trained-second-model" 
           }
       ]
   )
   ```

1.  Create an endpoint configuration. Specify both your production and shadow variants in the configuration. 

   ```
   endpoint_config_name = name-of-your-endpoint-config
   
   create_endpoint_config_response = sagemaker_client.create_endpoint_config(
       EndpointConfigName=endpoint_config_name,
       ProductionVariants=[
           {
               "VariantName": name-of-your-production-variant,
               "ModelName": model_name1,
               "InstanceType": "ml.m5.xlarge",
               "InitialInstanceCount": 1,
               "InitialVariantWeight": 1,
           }
       ],
       ShadowProductionVariants=[
           {
               "VariantName": name-of-your-shadow-variant,
               "ModelName": model_name2,
               "InstanceType": "ml.m5.xlarge",
               "InitialInstanceCount": 1,
               "InitialVariantWeight": 1,
           }
      ]
   )
   ```

1. Create an endpoint.

   ```
   create_endpoint_response = sm.create_endpoint(
       EndpointName=name-of-your-endpoint,
       EndpointConfigName=endpoint_config_name,
   )
   ```

# Online explainability with SageMaker Clarify
Online explainability

This guide shows how to configure online explainability with SageMaker Clarify. With SageMaker AI [real-time inference](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints.html) endpoints, you can analyze explainability in real time, continuously. The online explainability function fits into the **Deploy to production** part of the[ Amazon SageMaker AI Machine Learning workflow](https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-mlconcepts.html).

## How Clarify Online Explainability Works
How It Works

The following graphic depicts SageMaker AI architecture for hosting an endpoint that serves explainability requests. It depicts interactions between an endpoint, the model container, and the SageMaker Clarify explainer.

![\[SageMaker AI architecture showing hosting an endpoint that serves on-demand explainability requests.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/clarify/DeveloperGuideArchitecture.png)


Here's how Clarify online explainability works. The application sends a REST-style `InvokeEndpoint` request to the SageMaker AI Runtime Service. The service routes this request to a SageMaker AI endpoint to obtain predictions and explanations. Then, the service receives the response from the endpoint. Lastly, the service sends the response back to the application.

To increase the endpoint availability, SageMaker AI automatically attempts to distribute endpoint instances in multiple Availability Zones, according to the instance count in the endpoint configuration. On an endpoint instance, upon a new explainability request, the SageMaker Clarify explainer calls the model container for predictions. Then it computes and returns the feature attributions.

Here are the four steps to create an endpoint that uses SageMaker Clarify online explainability:

1. Check if your pre-trained SageMaker AI model is compatible with online explainability by following the [pre-check](https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-online-explainability-precheck.html) steps.

1. [Create an endpoint configuration](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateEndpointConfig.html) with the SageMaker Clarify explainer configuration using the `CreateEndpointConfig` API.

1. [Create an endpoint](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateEndpoint.html) and provide the endpoint configuration to SageMaker AI using the `CreateEndpoint` API. The service launches the ML compute instance and deploys the model as specified in the configuration.

1. [Invoke the endpoint](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_runtime_InvokeEndpoint.html): After the endpoint is in service, call the SageMaker AI Runtime API `InvokeEndpoint` to send requests to the endpoint. The endpoint then returns explanations and predictions.

# Pre-check the model container


This section shows how to pre-check the model container inputs and outputs for compatibility before configuring an endpoint. The SageMaker Clarify explainer is **model agnostic**, but it has requirements for model container input and output.

**Note**  
You can increase efficiency by configuring your container to support batch requests, which support two or more records in a single request. For example, a single record is a single line of CSV data, or a single line of JSON Lines data. SageMaker Clarify will attempt to send a mini-batch of records to the model container first before falling back to single record requests.

## Model container input


------
#### [ CSV ]

The model container supports input in CSV with MIME type:`text/csv`. The following table shows example inputs that SageMaker Clarify supports.


| Model container input (string representation) | Comments | 
| --- | --- | 
|  '1,2,3,4'  |  Single record that uses four numerical features.  | 
|  '1,2,3,4\$1n5,6,7,8'  |  Two records, separated by line break '\$1n'.  | 
|  '"This is a good product",5'  |  Single record that contains a text feature and a numerical feature.  | 
|  ‘"This is a good product",5\$1n"Bad shopping experience",1'  |  Two records.  | 

------
#### [ JSON Lines ]

SageMaker AI also supports input in [ JSON Lines dense format](https://docs.aws.amazon.com/sagemaker/latest/dg/cdf-inference.html#cm-jsonlines) with MIME type:`application/jsonlines`, as shown in the following table.


| Model container input | Comments | 
| --- | --- | 
|  '\$1"data":\$1"features":[1,2,3,4]\$1\$1'  |  Single record; a list of features can be extracted by JMESPath expression `data.features`.  | 
|  '\$1"data":\$1"features":[1,2,3,4]\$1\$1\$1n\$1"data":\$1"features":[5,6,7,8]\$1\$1'  |  Two records.  | 
|  '\$1"features":["This is a good product",5]\$1'  |  Single record; a list of features can be extracted by JMESPath expression `features`.  | 
|  '\$1"features":["This is a good product",5]\$1\$1n\$1"features":["Bad shopping experience",1]\$1'  |  Two records.  | 

------

## Model container output


Your model container output should also be in either CSV, or JSON Lines dense format. Additionally the model container should include the probabilities of the input records, which SageMaker Clarify uses to compute feature attributions.

The following data examples are for model container outputs in **CSV format**.

------
#### [ Probability only ]

For regression and binary classification problems, the model container outputs a single probability value (score) of the predicted label. These probabilities can be extracted using column index 0. For multi-class problems, the model container outputs a list of probabilities (scores). For multi-class problems, if no index is provided, all values are extracted.


| Model container input | Model container output (string representation) | 
| --- | --- | 
|  Single record  |  '0.6'  | 
|  Two records (results in one line)  |  '0.6,0.3'  | 
|  Two records (results in two lines)  |  '0.6\$1n0.3'  | 
|  Single record of a multi-class model (three classes)  |  '0.1,0.6,0.3'  | 
|  Two records of a multi-class model (three classes)  |  '0.1,0.6,0.3\$1n0.2,0.5,0.3'  | 

------
#### [ Predicted label and probabilities ]

The model container outputs the predicted label followed by its probability in **CSV** format. The probabilities can be extracted using index `1`.


| Model container input | Model container output | 
| --- | --- | 
|  Single record  |  '1,0.6'  | 
|  Two records  |  '1,0.6\$1n0,0.3'  | 

------
#### [ Predicted labels header and probabilities ]

A multi-class model container trained by Autopilot can be configured to output **the string representation** of the list of predicted labels and probabilities in **CSV** format. In the following example, the probabilities can be extracted by index `1`. The label headers can be extracted by index `1`, and the label headers can be extracted using index `0`.


| Model container input | Model container output | 
| --- | --- | 
|  Single record  |  '"[\$1'cat\$1',\$1'dog\$1',\$1'fish\$1']","[0.1,0.6,0.3]"'  | 
|  Two records  |  '"[\$1'cat\$1',\$1'dog\$1',\$1'fish\$1']","[0.1,0.6,0.3]"\$1n"[\$1'cat\$1',\$1'dog\$1',\$1'fish\$1']","[0.2,0.5,0.3]"'  | 

------

The following data examples are for model container outputs in **JSON Lines** format.

------
#### [ Probability only ]

In this example, the model container outputs the probability that can be extracted by [https://jmespath.org/](https://jmespath.org/) expression `score` in **JSON Lines** format.


| Model container input | Model container output | 
| --- | --- | 
|  Single record  |  '\$1"score":0.6\$1'  | 
|  Two records  |  '\$1"score":0.6\$1\$1n\$1"score":0.3\$1'  | 

------
#### [ Predicted label and probabilities ]

In this example, a multi-class model container outputs a list of label headers along with a list of probabilities in **JSON Lines** format. The probabilities can be extracted by `JMESPath` expression `probability`, and the label headers can be extracted by `JMESPath` expression `predicted labels`.


| Model container input | Model container output | 
| --- | --- | 
|  Single record  |  '\$1"predicted\$1labels":["cat","dog","fish"],"probabilities":[0.1,0.6,0.3]\$1'  | 
|  Two records  |  '\$1"predicted\$1labels":["cat","dog","fish"],"probabilities":[0.1,0.6,0.3]\$1\$1n\$1"predicted\$1labels":["cat","dog","fish"],"probabilities":[0.2,0.5,0.3]\$1'  | 

------
#### [ Predicted labels header and probabilities ]

In this example, a multi-class model container outputs a list of label headers and probabilities in **JSON Lines** format. The probabilities can be extracted by `JMESPath` expression `probability`, and the label headers can be extracted by `JMESPath` expression `predicted labels`.


| Model container input | Model container output | 
| --- | --- | 
|  Single record  |  '\$1"predicted\$1labels":["cat","dog","fish"],"probabilities":[0.1,0.6,0.3]\$1'  | 
|  Two records  |  '\$1"predicted\$1labels":["cat","dog","fish"],"probabilities":[0.1,0.6,0.3]\$1\$1n\$1"predicted\$1labels":["cat","dog","fish"],"probabilities":[0.2,0.5,0.3]\$1'  | 

------

## Model container validation


We recommend that you deploy your model to a SageMaker AI real-time inference endpoint, and send requests to the endpoint. Manually examine the requests (model container inputs) and responses (model container outputs) to make sure that both are compliant with the requirements in the **Model Container Input** section and **Model Container Output** section. If your model container supports batch requests, you can start with a single record request, and then try two or more records.

The following commands show how to request a response using the AWS CLI. The AWS CLI is pre-installed in SageMaker Studio Classic, and SageMaker Notebook instances. If you need to install the AWS CLI, follow this [installation guide](https://aws.amazon.com/cli/).

```
aws sagemaker-runtime invoke-endpoint \
  --endpoint-name $ENDPOINT_NAME \
  --content-type $CONTENT_TYPE \
  --accept $ACCEPT_TYPE \
  --body $REQUEST_DATA \
  $CLI_BINARY_FORMAT \
  /dev/stderr 1>/dev/null
```

The parameters are defined, as follows:
+ `$ENDPOINT NAME`: The name of the endpoint.
+ `$CONTENT_TYPE`: The MIME type of the request (model container input).
+ `$ACCEPT_TYPE`: The MIME type of the response (model container output).
+ `$REQUEST_DATA`: The requested payload string.
+ `$CLI_BINARY_FORMAT`: The format of the command line interface (CLI) parameter. For AWS CLI v1, this parameter should remain blank. For v2, this parameter should be set to `--cli-binary-format raw-in-base64-out`.

**Note**  
AWS CLI v2 passes binary parameters as base64-encoded strings [default](https://docs.aws.amazon.com/cli/latest/userguide/cliv2-migration.html#cliv2-migration-binaryparam).

The following examples use AWS CLI v1:

------
#### [ Request and response in CSV format ]
+ The request consists of a single record and the response is its probability value.

  ```
  aws sagemaker-runtime invoke-endpoint \
    --endpoint-name test-endpoint-sagemaker-xgboost-model \
    --content-type text/csv \
    --accept text/csv \
    --body '1,2,3,4' \
    /dev/stderr 1>/dev/null
  ```

  Output:

  `0.6`
+ The request consists of two records, and the response includes their probabilities, and the model separates the probabilities by a comma. The `$'content'` expression in the `--body` tells the command to interpret `\n` in the content as a line break.

  ```
  aws sagemaker-runtime invoke-endpoint \
    --endpoint-name test-endpoint-sagemaker-xgboost-model \
    --content-type text/csv \
    --accept text/csv \
    --body $'1,2,3,4\n5,6,7,8' \
    /dev/stderr 1>/dev/null
  ```

  Output:

  `0.6,0.3`
+ The request consists of two records, the response includes their probabilities, and the model separates the probabilities with a line break.

  ```
  aws sagemaker-runtime invoke-endpoint \
    --endpoint-name test-endpoint-csv-1 \
    --content-type text/csv \
    --accept text/csv \
    --body $'1,2,3,4\n5,6,7,8' \
    /dev/stderr 1>/dev/null
  ```

  Output:

  `0.6`

  `0.3`
+ The request consists of a single record, and the response is probability values (multi-class model, three classes).

  ```
  aws sagemaker-runtime invoke-endpoint \
    --endpoint-name test-endpoint-csv-1 \
    --content-type text/csv \
    --accept text/csv \
    --body '1,2,3,4' \
    /dev/stderr 1>/dev/null
  ```

  Output:

  `0.1,0.6,0.3`
+ The request consists of two records, and the response includes their probability values (multi-class model, three classes).

  ```
  aws sagemaker-runtime invoke-endpoint \
    --endpoint-name test-endpoint-csv-1 \
    --content-type text/csv \
    --accept text/csv \
    --body $'1,2,3,4\n5,6,7,8' \
    /dev/stderr 1>/dev/null
  ```

  Output:

  `0.1,0.6,0.3`

  `0.2,0.5,0.3`
+ The request consists of two records, and the response includes predicted label and probability.

  ```
  aws sagemaker-runtime invoke-endpoint \
    --endpoint-name test-endpoint-csv-2 \
    --content-type text/csv \
    --accept text/csv \
    --body $'1,2,3,4\n5,6,7,8' \
    /dev/stderr 1>/dev/null
  ```

  Output:

  `1,0.6`

  `0,0.3`
+ The request consists of two records and the response includes label headers and probabilities.

  ```
  aws sagemaker-runtime invoke-endpoint \
    --endpoint-name test-endpoint-csv-3 \
    --content-type text/csv \
    --accept text/csv \
    --body $'1,2,3,4\n5,6,7,8' \
    /dev/stderr 1>/dev/null
  ```

  Output:

  `"['cat','dog','fish']","[0.1,0.6,0.3]"`

  `"['cat','dog','fish']","[0.2,0.5,0.3]"`

------
#### [ Request and response in JSON Lines format ]
+ The request consists of a single record and the response is its probability value.

  ```
  aws sagemaker-runtime invoke-endpoint \
    --endpoint-name test-endpoint-jsonlines \
    --content-type application/jsonlines \
    --accept application/jsonlines \
    --body '{"features":["This is a good product",5]}' \
    /dev/stderr 1>/dev/null
  ```

  Output:

  `{"score":0.6}`
+ The request contains two records, and the response includes predicted label and probability.

  ```
  aws sagemaker-runtime invoke-endpoint \
    --endpoint-name test-endpoint-jsonlines-2 \
    --content-type application/jsonlines \
    --accept application/jsonlines \
    --body $'{"features":[1,2,3,4]}\n{"features":[5,6,7,8]}' \
    /dev/stderr 1>/dev/null
  ```

  Output:

  `{"predicted_label":1,"probability":0.6}`

  `{"predicted_label":0,"probability":0.3}`
+ The request contains two records and the response includes label headers and probabilities.

  ```
  aws sagemaker-runtime invoke-endpoint \
    --endpoint-name test-endpoint-jsonlines-3 \
    --content-type application/jsonlines \
    --accept application/jsonlines \
    --body $'{"data":{"features":[1,2,3,4]}}\n{"data":{"features":[5,6,7,8]}}' \
    /dev/stderr 1>/dev/null
  ```

  Output:

  `{"predicted_labels":["cat","dog","fish"],"probabilities":[0.1,0.6,0.3]}`

  `{"predicted_labels":["cat","dog","fish"],"probabilities":[0.2,0.5,0.3]}`

------
#### [ Request and response in different formats ]
+ The request is in CSV format and the response is in JSON Lines format:

  ```
  aws sagemaker-runtime invoke-endpoint \
    --endpoint-name test-endpoint-csv-in-jsonlines-out \
    --content-type text/csv \
    --accept application/jsonlines \
    --body $'1,2,3,4\n5,6,7,8' \
    /dev/stderr 1>/dev/null
  ```

  Output:

  `{"probability":0.6}`

  `{"probability":0.3}`
+ The request is in JSON Lines format and the response is in CSV format:

  ```
  aws sagemaker-runtime invoke-endpoint \
    --endpoint-name test-endpoint-jsonlines-in-csv-out \
    --content-type application/jsonlines \
    --accept text/csv \
    --body $'{"features":[1,2,3,4]}\n{"features":[5,6,7,8]}' \
    /dev/stderr 1>/dev/null
  ```

  Output:

  `0.6`

  `0.3`

------

After the validations are complete, [delete](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints-delete-resources.html) the testing endpoint.

# Configure and create an endpoint
Configure and create an endpoint

Create a new endpoint configuration to fit your model, and use this configuration to create the endpoint. You can use the model container validated in the [pre-check step ](https://docs.aws.amazon.com/sagemaker/latest/dg/clarify-online-explainability-precheck.html) to create an endpoint and enable the SageMaker Clarify online explainability feature.

Use the `sagemaker_client` object to create an endpoint using the [CreateEndpointConfig](https://docs.aws.amazon.com//sagemaker/latest/APIReference/API_CreateEndpointConfig.html) API. Set the member `ClarifyExplainerConfig` inside the `ExplainerConfig` parameter as follows:

```
sagemaker_client.create_endpoint_config(
    EndpointConfigName='name-of-your-endpoint-config',
    ExplainerConfig={
        'ClarifyExplainerConfig': {
            'EnableExplanations': '`true`',
            'InferenceConfig': {
                ...
            },
            'ShapConfig': {
                ...
            }
        },
    },
    ProductionVariants=[{
        'VariantName': 'AllTraffic',
        'ModelName': 'name-of-your-model',
        'InitialInstanceCount': 1,
        'InstanceType': 'ml.m5.xlarge',
    }]
     ...
)
sagemaker_client.create_endpoint(
    EndpointName='name-of-your-endpoint',
    EndpointConfigName='name-of-your-endpoint-config'
)
```

The first call to the `sagemaker_client` object creates a new endpoint configuration with the explainability feature enabled. The second call uses the endpoint configuration to launch the endpoint.

**Note**  
You can also host multiple models in one container behind a [SageMaker AI real-time inference multi-model endpoint](https://docs.aws.amazon.com/sagemaker/latest/dg/multi-model-endpoints.html) and configure online explainability with SageMaker Clarify.

# The `EnableExplanations` expression


The `EnableExplanations` parameter is a [https://jmespath.org/](https://jmespath.org/) Boolean expression string. It is evaluated for **each record** in the explainability request. If this parameter is evaluated to be **true**, then the record will be explained. If this parameter is evaluated to be **false**, then explanations are not be generated.

SageMaker Clarify deserializes the model container output for each record into a JSON compatible data structure, and then uses the `EnableExplanations` parameter to evaluate the data.

**Notes**  
There are two options for records depending on the format of the model container output.  
If the model container output is in CSV format, then a record is loaded as a JSON array.
If the model container output is in JSON Lines format, then a record is loaded as a JSON object.

The `EnableExplanations` parameter is a JMESPath expression that can be passed either during the `InvokeEndpoint` or `CreateEndpointConfig` operations. If the JMESPath expression that you supplied is not valid, the endpoint creation will fail. If the expression is valid, but the expression evaluation result is unexpected, then the endpoint will be created successfully, but an error will be generated when the endpoint is invoked. Test your `EnableExplanations` expression by using the `InvokeEndpoint` API, and then apply it to the endpoint configuration.

The following are some examples of valid `EnableExplanations` expression. In the examples, a JMESPath expression encloses a literal using backtick characters. For example, ``true`` means true.


| Expression (string representation) | Model container output (string representation) | Evaluation result (Boolean) | Meaning | 
| --- | --- | --- | --- | 
|  '`true`'  |  (N/A)  |  True  |  Activate online explainability unconditionally.  | 
|  '`false`'  |  (N/A)  |  False  |  Deactivate online explainability unconditionally.  | 
|  '[1]>`0.5`'  |  '1,0.6'  |  True  |  For each record, the model container outputs its predicted label and probability. Explains a record if its probability (at index 1) is greater than 0.5.  | 
|  'probability>`0.5`'  |  '\$1"predicted\$1label":1,"probability":0.6\$1'  |  True  |  For each record, the model container outputs JSON data. Explain a record if its probability is greater than 0.5.  | 
|  '\$1contains(probabilities[:-1], max(probabilities))'  |  '\$1"probabilities": [0.4, 0.1, 0.4], "labels":["cat","dog","fish"]\$1'  |  False  |  For a multi-class model: Explains a record if its predicted label (the class that has the max probability value) is the last class. Literally, the expression means that the max probability value is not in the list of probabilities excluding the last one.  | 

# Synthetic dataset


SageMaker Clarify uses the Kernel SHAP algorithm. Given a record (also called a sample or an instance) and the SHAP configuration, the explainer first generates a synthetic dataset. SageMaker Clarify then queries the model container for the predictions of the dataset, and then computes and returns the feature attributions. The size of the synthetic dataset affects the runtime for the Clarify explainer. Larger synthetic datasets take more time to obtain model predictions than smaller ones.

 The synthetic dataset size is determined by the following formula:

```
Synthetic dataset size = SHAP baseline size * n_samples
```

The SHAP baseline size is the number of records in the SHAP baseline data. This information is taken from the `ShapBaselineConfig`.

The size of `n_samples` is set by the parameter `NumberOfSamples` in the explainer configuration and the number of features. If the number of features is `n_features`, then `n_samples` is the following: 

```
n_samples = MIN(NumberOfSamples, 2^n_features - 2)
```

The following shows `n_samples` if `NumberOfSamples` is not provided.

```
n_samples = MIN(2*n_features + 2^11, 2^n_features - 2)
```

For example, a tabular record with 10 features has a SHAP baseline size of 1. If `NumberOfSamples` is not provided, the synthetic dataset contains 1022 records. If the record has 20 features, the synthetic dataset contains 2088 records.

For NLP problems, `n_features` is equal to the number of non-text features plus the number of text units.

**Note**  
The `InvokeEndpoint` API has a request timeout limit. If the synthetic dataset is too large, the explainer may not be able to complete the computation within this limit. If necessary, use the previous information to understand and reduce the SHAP baseline size and `NumberOfSamples`. If your model container is set up to handle batch requests, then you can also adjust the value of `MaxRecordCount`.

# Invoke the endpoint


After the endpoint is running, use the SageMaker AI Runtime [InvokeEndpoint ](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_runtime_InvokeEndpoint.html) API in the SageMaker AI Runtime service to send requests to, or invoke the endpoint. In response, the requests are handled as explainability requests by the SageMaker Clarify explainer.

**Note**  
To invoke an endpoint, choose one of the following options:  
For instructions to use Boto3 or the AWS CLI to invoke an endpoint, see [Invoke models for real-time inference](realtime-endpoints-test-endpoints.md).
To use the SageMaker SDK for Python to invoke an endpoint, see the [Predictor](https://sagemaker.readthedocs.io/en/stable/api/inference/predictors.html) API.

## Request


The `InvokeEndpoint` API has an optional parameter `EnableExplanations`, which is mapped to the HTTP header `X-Amzn-SageMaker-Enable-Explanations`. If this parameter is provided, it overrides the `EnableExplanations` parameter of the `ClarifyExplainerConfig`.

**Note**  
The `ContentType` and `Accept` parameters of the `InvokeEndpoint` API are required. Supported formats include MIME type `text/csv` and `application/jsonlines`.

Use the `sagemaker_runtime_client` to send a request to the endpoint, as follows:

```
response = sagemaker_runtime_client.invoke_endpoint(
    EndpointName='name-of-your-endpoint',
    EnableExplanations='`true`',
    ContentType='text/csv',
    Accept='text/csv',
    Body='1,2,3,4',  # single record (of four numerical features)
)
```

For multi-model endpoints, pass an additional `TargetModel` parameter in the previous example request to specifies which model to target at the endpoint. The multi-model endpoint dynamically loads target models as needed. For more information about multi-model endpoints, see [Multi-model endpoints](multi-model-endpoints.md). See the [SageMaker Clarify Online Explainability on Multi-Model Endpoint Sample Notebook](https://github.com/aws/amazon-sagemaker-examples/blob/main/sagemaker-clarify/online_explainability/tabular_multi_model_endpoint/multi_model_xgboost_with_online_explainability.ipynb) for an example of how to set up and invoke multiple target models from a single endpoint.

## Response


If the endpoint is created with `ExplainerConfig`, then a new response schema is used, This new schema is different from, and is not compatible with, an endpoint that lacks the `ExplainerConfig` parameter provided.

The MIME type of the response is `application/json`, and the response payload can be decoded from UTF-8 bytes to a JSON object. The following shows the members of this JSON object are as follows:
+ `version`: The version of the response schema in string format. For example, `1.0`.
+ `predictions`: The predictions that the request makes have the following:
  + `content_type`: The MIME type of the predictions, referring to the `ContentType` of the model container response.
  + `data`: The predictions data string delivered as the payload of the model container response for the request.
+ `label_headers`: The label headers from the `LabelHeaders` parameter. This is provided in either the explainer configuration or the model container output.
+ `explanations`: The explanations provided in the request payload. If no records are explained, then this member returns the empty object `{}`.
+ 
  + `kernel_shap`: A key that refers to an array of Kernel SHAP explanations for each record in the request. If a record is not explained, the corresponding explanation is `null`.

The `kernel_shap` element has the following members:
+ `feature_header`: The header name of the features provided by the `FeatureHeaders` parameter in the explainer configuration `ExplainerConfig`.
+ `feature_type`: The feature type inferred by explainer or provided in the `FeatureTypes` parameter in the `ExplainerConfig`. This element is only available for NLP explainability problems.
+ `attributions`: An array of attribution objects. Text features can have multiple attribution objects, each for a unit. The attribution object has the following members:
  + `attribution`: A list of probability values, given for each class.
  + `description`: The description of the text units, available only for NLP explainability problems.
    + `partial_text`: The portion of the text explained by the explainer.
    + `start_idx`: A zero-based index to identify the array location of the beginning of the partial text fragment.

# Code examples: SDK for Python


This section provides sample code to create and invoke an endpoint that uses SageMaker Clarify online explainability. These code examples use the [AWS SDK for Python.](https://aws.amazon.com/sdk-for-python/)

## Tabular data


The following example uses tabular data and a SageMaker AI model called `model_name`. In this example, the model container accepts data in CSV format, and each record has four numerical features. In this minimal configuration, **for demonstration purposes only**, the SHAP baseline data is set to zero. Refer to [SHAP Baselines for Explainability](clarify-feature-attribute-shap-baselines.md) to learn how to choose more appropriate values for `ShapBaseline`.

Configure the endpoint, as follows:

```
endpoint_config_name = 'tabular_explainer_endpoint_config'
response = sagemaker_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[{
        'VariantName': 'AllTraffic',
        'ModelName': model_name,
        'InitialInstanceCount': 1,
        'InstanceType': 'ml.m5.xlarge',
    }],
    ExplainerConfig={
        'ClarifyExplainerConfig': {
            'ShapConfig': {
                'ShapBaselineConfig': {
                    'ShapBaseline': '0,0,0,0',
                },
            },
        },
    },
)
```

Use the endpoint configuration to create an endpoint, as follows:

```
endpoint_name = 'tabular_explainer_endpoint'
response = sagemaker_client.create_endpoint(
    EndpointName=endpoint_name,
    EndpointConfigName=endpoint_config_name,
)
```

Use the `DescribeEndpoint` API to inspect the progress of creating an endpoint, as follows:

```
response = sagemaker_client.describe_endpoint(
    EndpointName=endpoint_name,
)
response['EndpointStatus']
```

After the endpoint status is "InService", invoke the endpoint with a test record, as follows:

```
response = sagemaker_runtime_client.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType='text/csv',
    Accept='text/csv',
    Body='1,2,3,4',
)
```

**Note**  
In the previous code example, for multi-model endpoints, pass an additional `TargetModel` parameter in the request to specify which model to target at the endpoint.

Assume that the response has a status code of 200 (no error), and load the response body, as follows:

```
import codecs
import json
json.load(codecs.getreader('utf-8')(response['Body']))
```

The default action for the endpoint is to explain the record. The following shows example output in the returned JSON object.

```
{
    "version": "1.0",
    "predictions": {
        "content_type": "text/csv; charset=utf-8",
        "data": "0.0006380207487381"
    },
    "explanations": {
        "kernel_shap": [
            [
                {
                    "attributions": [
                        {
                            "attribution": [-0.00433456]
                        }
                    ]
                },
                {
                    "attributions": [
                        {
                            "attribution": [-0.005369821]
                        }
                    ]
                },
                {
                    "attributions": [
                        {
                            "attribution": [0.007917749]
                        }
                    ]
                },
                {
                    "attributions": [
                        {
                            "attribution": [-0.00261214]
                        }
                    ]
                }
            ]
        ]
    }
}
```

Use the `EnableExplanations` parameter to enable on-demand explanations, as follows:

```
response = sagemaker_runtime_client.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType='text/csv',
    Accept='text/csv',
    Body='1,2,3,4',
    EnableExplanations='[0]>`0.8`',
)
```

**Note**  
In the previous code example, for multi-model endpoints, pass an additional `TargetModel` parameter in the request to specify which model to target at the endpoint.

In this example, the prediction value is less than the threshold value of `0.8`, so the record is not explained:

```
{
    "version": "1.0",
    "predictions": {
        "content_type": "text/csv; charset=utf-8",
        "data": "0.6380207487381995"
    },
    "explanations": {}
}
```

Use visualization tools to help interpret the returned explanations. The following image shows how SHAP plots can be used to understand how each feature contributes to the prediction. The base value on the diagram, also called the expected value, is the mean predictions of the training dataset. Features that push the expected value higher are red, and features that push the expected value lower are blue. See [SHAP additive force layout](https://shap.readthedocs.io/en/latest/generated/shap.plots.force.html) for additional information.

![\[Example SHAP plot, that can be used to understand how each feature contributes to the prediction.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/clarify/force-plot.png)


See the [full example notebook for tabular data](https://github.com/aws/amazon-sagemaker-examples/blob/main/sagemaker-clarify/online_explainability/tabular/tabular_online_explainability_with_sagemaker_clarify.ipynb). 

## Text data


This section provides a code example to create and invoke an online explainability endpoint for text data. The code example uses SDK for Python.

The following example uses text data and a SageMaker AI model called `model_name`. In this example, the model container accepts data in CSV format, and each record is a single string.

```
endpoint_config_name = 'text_explainer_endpoint_config'
response = sagemaker_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[{
        'VariantName': 'AllTraffic',
        'ModelName': model_name,
        'InitialInstanceCount': 1,
        'InstanceType': 'ml.m5.xlarge',
    }],
    ExplainerConfig={
        'ClarifyExplainerConfig': {
            'InferenceConfig': {
                'FeatureTypes': ['text'],
                'MaxRecordCount': 100,
            },
            'ShapConfig': {
                'ShapBaselineConfig': {
                    'ShapBaseline': '"<MASK>"',
                },
                'TextConfig': {
                    'Granularity': 'token',
                    'Language': 'en',
                },
                'NumberOfSamples': 100,
            },
        },
    },
)
```
+ `ShapBaseline`: A special token reserved for natural language processing (NLP) processing.
+ `FeatureTypes`: Identifies the feature as text. If this parameter is not provided, the explainer will attempt to infer the feature type.
+ `TextConfig`: Specifies the unit of granularity and language for the analysis of text features. In this example, the language is English, and granularity `token` means a word in English text.
+ `NumberOfSamples`: A limit to set the upper bounds of the size of the synthetic dataset.
+ `MaxRecordCount`: The maximum number of records in a request that the model container can handle. This parameter is set to stabilize performance.

Use the endpoint configuration to create the endpoint, as follows:

```
endpoint_name = 'text_explainer_endpoint'
response = sagemaker_client.create_endpoint(
    EndpointName=endpoint_name,
    EndpointConfigName=endpoint_config_name,
)
```

After the status of the endpoint becomes `InService`, invoke the endpoint. The following code sample uses a test record as follows:

```
response = sagemaker_runtime_client.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType='text/csv',
    Accept='text/csv',
    Body='"This is a good product"',
)
```

If the request completes successfully, the response body will return a valid JSON object that's similar to the following:

```
{
    "version": "1.0",
    "predictions": {
        "content_type": "text/csv",
        "data": "0.9766594\n"
    },
    "explanations": {
        "kernel_shap": [
            [
                {
                    "attributions": [
                        {
                            "attribution": [
                                -0.007270948666666712
                            ],
                            "description": {
                                "partial_text": "This",
                                "start_idx": 0
                            }
                        },
                        {
                            "attribution": [
                                -0.018199033666666628
                            ],
                            "description": {
                                "partial_text": "is",
                                "start_idx": 5
                            }
                        },
                        {
                            "attribution": [
                                0.01970993241666666
                            ],
                            "description": {
                                "partial_text": "a",
                                "start_idx": 8
                            }
                        },
                        {
                            "attribution": [
                                0.1253469515833334
                            ],
                            "description": {
                                "partial_text": "good",
                                "start_idx": 10
                            }
                        },
                        {
                            "attribution": [
                                0.03291143366666657
                            ],
                            "description": {
                                "partial_text": "product",
                                "start_idx": 15
                            }
                        }
                    ],
                    "feature_type": "text"
                }
            ]
        ]
    }
}
```

Use visualization tools to help interpret the returned text attributions. The following image shows how the captum visualization utility can be used to understand how each word contributes to the prediction. The higher the color saturation, the higher the importance given to the word. In this example, a highly saturated bright red color indicates a strong negative contribution. A highly saturated green color indicates a strong positive contribution. The color white indicates that the word has a neutral contribution. See the [captum](https://github.com/pytorch/captum) library for additional information on parsing and rendering the attributions.

![\[Captum visualization utility used to understand how each word contributes to the prediction.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/clarify/word-importance.png)


See the [full example notebook for text](https://github.com/aws/amazon-sagemaker-examples/blob/main/sagemaker-clarify/online_explainability/natural_language_processing/nlp_online_explainability_with_sagemaker_clarify.ipynb) data. 

# Troubleshooting guide


If you encounter errors using SageMaker Clarify online explainability, consult the topics in this section.

**`InvokeEndpoint` API fails with the error "ReadTimeoutError:Read timeout on endpoint..."** 

This error means that the request could not be completed within the 60-second time limit set by the [request timeout](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_runtime_InvokeEndpoint.html).

To reduce the request latency, try the following:
+ Tune the model's performance during inference. For example, SageMaker AI [Neo](https://aws.amazon.com/sagemaker/neo/) can optimize models for inference.
+ Allow the model container to handle batch requests.
+ Use a larger `MaxRecordCount` to reduce the number of calls from the explainer to the model container. This will reduce network latency and overhead.
+ Use an instance type that has more resources allocated to it. Alternately, assign more instances to the endpoint to help balance the load.
+ Reduce the number of records inside a single `InvokeEndpoint` request.
+ Reduce the number of records in the baseline data.
+ Use a smaller `NumberOfSamples` value to reduce the size of the synthetic dataset. For more information about how the number of samples affects your synthetic dataset, see [Synthetic dataset](clarify-online-explainability-create-endpoint-synthetic.md).

# Fine-tune models with adapter inference components
Fine-tune with adapters

With Amazon SageMaker AI, you can host pre-trained foundation models without needing to create your own models from scratch. However, to tailor a general-purpose foundation model for the unique needs of your business, you must create a fine-tuned version of it. One cost-effective fine-tuning technique is Low-Rank Adaptation (LoRA). The principle behind LoRA is that only a small part of a large foundation model needs updating to adapt it to new tasks or domains. A LoRA adapter augments the inference from a base foundation model with just a few extra adapter layers.

If you host your base foundation model by using a SageMaker AI inference component, you can fine-tune that base model with LoRA adapters by creating *adapter inference components*. When you create an adapter inference component, you specify the following:
+ The *base inference component* that is to contain the adapter inference component. The base inference component contains the foundation model that you want to adapt. The adapter inference component uses the compute resources that you assigned to the base inference component.
+ The location where you've stored the LoRA adapter in Amazon S3.

After you create the adapter inference component, you can invoke it directly. When you do, SageMaker AI combines the adapter with the base model to augment the generated response.

**Before you begin**

Before you can create an adapter inference component, you must meet the following requirements: 
+ You have a base inference component that contains the foundation model to adapt. You've deployed this inference component to a SageMaker AI endpoint. 

  For more information about deploying inference components to endpoints, see [Deploy models for real-time inference](realtime-endpoints-deploy-models.md).
+ You have a LoRA adapter model, and you've stored the model artifacts as a `tar.gz` file in Amazon S3. You specify the S3 URI of the artifacts when you create the adapter inference component.

The following examples use the SDK for Python (Boto3) to create and invoke an adapter inference component.

**Example `create_inference_component` call to create an adapter inference component**  
The following example creates an adapter inference component and assigns it to a base inference component:  

```
sm_client.create_inference_component(
    InferenceComponentName = adapter_ic_name,
    EndpointName = endpoint_name,
    Specification={
        "BaseInferenceComponentName": base_inference_component_name,
        "Container": {
            "ArtifactUrl": adapter_s3_uri
        },
    },
)
```
When you use this example in your own code, replace the placeholder values as follows:  
+ *adapter\$1ic\$1name* – A unique name for your adapter inference component.
+ *endpoint\$1name* – The name of the endpoint that hosts the base inference component.
+ *base\$1inference\$1component\$1name* – The name of the base inference component that contains the foundation model to adapt.
+ *adapter\$1s3\$1uri* – The S3 URI that locates the `tar.gz` file with your LoRA adapter artifacts.
You create an adapter inference component with code that is similar to the code for a normal inference component. One difference is that, for the `Specification` parameter, you omit the `ComputeResourceRequirements` key. When you invoke an adapter inference component, it is loaded by the base inference component. The adapter inference component uses the compute resources of the base inference component.  
For more information about creating and deploying inference components with the SDK for Python (Boto3), see [Deploy models with the Python SDKs](realtime-endpoints-deploy-models.md#deploy-models-python).

After you create an adapter inference component, you invoke it by specifying its name in an `invoke_endpoint` request.

**Example `invoke_endpoint` call to invoke an adapter inference component**  
The following example invokes an adapter inference component:  

```
response = sm_rt_client.invoke_endpoint(
    EndpointName = endpoint_name,
    InferenceComponentName = adapter_ic_name,
    Body = json.dumps(
        {
            "inputs": prompt,
            "parameters": {"max_new_tokens": 100, "temperature":0.9}
        }
    ),
    ContentType = "application/json",
)

adapter_reponse = response["Body"].read().decode("utf8")["generated_text"]
```
When you use this example in your own code, replace the placeholder values as follows:  
+ *endpoint\$1name* – The name of the endpoint that hosts the base and adapter inference components.
+ *adapter\$1ic\$1name* – The name of the adapter inference component.
+ *prompt* – The prompt for the inference request.
For more information about invoking inference components with the SDK for Python (Boto3), see [Invoke models for real-time inference](realtime-endpoints-test-endpoints.md).