

# Deploy models with Amazon SageMaker Serverless Inference
Serverless Inference

Amazon SageMaker Serverless Inference is a purpose-built inference option that enables you to deploy and scale ML models without configuring or managing any of the underlying infrastructure. On-demand Serverless Inference is ideal for workloads which have idle periods between traffic spurts and can tolerate cold starts. Serverless endpoints automatically launch compute resources and scale them in and out depending on traffic, eliminating the need to choose instance types or manage scaling policies. This takes away the undifferentiated heavy lifting of selecting and managing servers. Serverless Inference integrates with AWS Lambda to offer you high availability, built-in fault tolerance and automatic scaling. With a pay-per-use model, Serverless Inference is a cost-effective option if you have an infrequent or unpredictable traffic pattern. During times when there are no requests, Serverless Inference scales your endpoint down to 0, helping you to minimize your costs. For more information about pricing for on-demand Serverless Inference, see [Amazon SageMaker Pricing](https://aws.amazon.com/sagemaker/pricing/).

Optionally, you can also use Provisioned Concurrency with Serverless Inference. Serverless Inference with provisioned concurrency is a cost-effective option when you have predictable bursts in your traffic. Provisioned Concurrency allows you to deploy models on serverless endpoints with predictable performance, and high scalability by keeping your endpoints warm. SageMaker AI ensures that for the number of Provisioned Concurrency that you allocate, the compute resources are initialized and ready to respond within milliseconds. For Serverless Inference with Provisioned Concurrency, you pay for the compute capacity used to process inference requests, billed by the millisecond, and the amount of data processed. You also pay for Provisioned Concurrency usage, based on the memory configured, duration provisioned, and the amount of concurrency enabled. For more information about pricing for Serverless Inference with Provisioned Concurrency, see [Amazon SageMaker Pricing](https://aws.amazon.com/sagemaker/pricing/).

You can integrate Serverless Inference with your MLOps Pipelines to streamline your ML workflow, and you can use a serverless endpoint to host a model registered with [Model Registry](model-registry.md).

Serverless Inference is generally available in 21 AWS Regions: US East (N. Virginia), US East (Ohio), US West (N. California), US West (Oregon), Africa (Cape Town), Asia Pacific (Hong Kong), Asia Pacific (Mumbai), Asia Pacific (Tokyo), Asia Pacific (Seoul), Asia Pacific (Osaka), Asia Pacific (Singapore), Asia Pacific (Sydney), Canada (Central), Europe (Frankfurt), Europe (Ireland), Europe (London), Europe (Paris), Europe (Stockholm), Europe (Milan), Middle East (Bahrain), South America (São Paulo). For more information about Amazon SageMaker AI regional availability, see the [AWS Regional Services List](https://aws.amazon.com/about-aws/global-infrastructure/regional-product-services/).

## How it works


The following diagram shows the workflow of on-demand Serverless Inference and the benefits of using a serverless endpoint.

![\[Diagram showing the Serverless Inference workflow.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/serverless-endpoints-how-it-works.png)


When you create an on-demand serverless endpoint, SageMaker AI provisions and manages the compute resources for you. Then, you can make inference requests to the endpoint and receive model predictions in response. SageMaker AI scales the compute resources up and down as needed to handle your request traffic, and you only pay for what you use.

For Provisioned Concurrency, Serverless Inference also integrates with Application Auto Scaling, so that you can manage Provisioned Concurrency based on a target metric or on a schedule. For more information, see [Automatically scale Provisioned Concurrency for a serverless endpoint](serverless-endpoints-autoscale.md).

The following sections provide additional details about Serverless Inference and how it works.

**Topics**
+ [

### Container support
](#serverless-endpoints-how-it-works-containers)
+ [

### Memory size
](#serverless-endpoints-how-it-works-memory)
+ [

### Concurrent invocations
](#serverless-endpoints-how-it-works-concurrency)
+ [

### Minimizing cold starts
](#serverless-endpoints-how-it-works-cold-starts)
+ [

### Feature exclusions
](#serverless-endpoints-how-it-works-exclusions)

### Container support


For your endpoint container, you can choose either a SageMaker AI-provided container or bring your own. SageMaker AI provides containers for its built-in algorithms and prebuilt Docker images for some of the most common machine learning frameworks, such as Apache MXNet, TensorFlow, PyTorch, and Chainer. For a list of available SageMaker images, see [Available Deep Learning Containers Images](https://github.com/aws/deep-learning-containers/blob/master/available_images.md). If you are bringing your own container, you must modify it to work with SageMaker AI. For more information about bringing your own container, see [Adapt your own inference container for Amazon SageMaker AI](adapt-inference-container.md).

The maximum size of the container image you can use is 10 GB. For serverless endpoints, we recommend creating only one worker in the container and only loading one copy of the model. Note that this is unlike real-time endpoints, where some SageMaker AI containers may create a worker for each vCPU to process inference requests and load the model in each worker.

If you already have a container for a real-time endpoint, you can use the same container for your serverless endpoint, though some capabilities are excluded. To learn more about the container capabilities that are not supported in Serverless Inference, see [Feature exclusions](#serverless-endpoints-how-it-works-exclusions). If you choose to use the same container, SageMaker AI escrows (retains) a copy of your container image until you delete all endpoints that use the image. SageMaker AI encrypts the copied image at rest with a SageMaker AI-owned AWS KMS key.

### Memory size


Your serverless endpoint has a minimum RAM size of 1024 MB (1 GB), and the maximum RAM size you can choose is 6144 MB (6 GB). The memory sizes you can choose are 1024 MB, 2048 MB, 3072 MB, 4096 MB, 5120 MB, or 6144 MB. Serverless Inference auto-assigns compute resources proportional to the memory you select. If you choose a larger memory size, your container has access to more vCPUs. Choose your endpoint’s memory size according to your model size. Generally, the memory size should be at least as large as your model size. You may need to benchmark in order to choose the right memory selection for your model based on your latency SLAs. For a step by step guide to benchmark, see [ Introducing the Amazon SageMaker Serverless Inference Benchmarking Toolkit](https://aws.amazon.com/blogs/machine-learning/introducing-the-amazon-sagemaker-serverless-inference-benchmarking-toolkit/). The memory size increments have different pricing; see the [Amazon SageMaker AI pricing page](https://aws.amazon.com/sagemaker/pricing/) for more information.

Regardless of the memory size you choose, your serverless endpoint has 5 GB of ephemeral disk storage available. For help with container permissions issues when working with storage, see [Troubleshooting](serverless-endpoints-troubleshooting.md).

### Concurrent invocations


On-demand Serverless Inference manages predefined scaling policies and quotas for the capacity of your endpoint. Serverless endpoints have a quota for how many concurrent invocations can be processed at the same time. If the endpoint is invoked before it finishes processing the first request, then it handles the second request concurrently.

The total concurrency that you can share between all serverless endpoints in your account depends on your region:
+ For the US East (Ohio), US East (N. Virginia), US West (Oregon), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Tokyo), Europe (Frankfurt), and Europe (Ireland) Regions, the total concurrency you can share between all serverless endpoints per Region in your account is 1000.
+ For the US West (N. California), Africa (Cape Town), Asia Pacific (Hong Kong), Asia Pacific (Mumbai), Asia Pacific (Osaka), Asia Pacific (Seoul), Canada (Central), Europe (London), Europe (Milan), Europe (Paris), Europe (Stockholm), Middle East (Bahrain), and South America (São Paulo) Regions, the total concurrency per Region in your account is 500.

You can set the maximum concurrency for a single endpoint up to 200, and the total number of serverless endpoints you can host in a Region is 50. The maximum concurrency for an individual endpoint prevents that endpoint from taking up all of the invocations allowed for your account, and any endpoint invocations beyond the maximum are throttled.

**Note**  
Provisioned Concurrency that you assign to a serverless endpoint should always be less than or equal to the maximum concurrency that you assigned to that endpoint.

To learn how to set the maximum concurrency for your endpoint, see [Create an endpoint configuration](serverless-endpoints-create-config.md). For more information about quotas and limits, see [ Amazon SageMaker AI endpoints and quotas](https://docs.aws.amazon.com/general/latest/gr/sagemaker.html) in the *AWS General Reference*. To request a service limit increase, contact [AWS Support](https://console.aws.amazon.com/support). For instructions on how to request a service limit increase, see [Supported Regions and Quotas](regions-quotas.md).

### Minimizing cold starts


If your on-demand Serverless Inference endpoint does not receive traffic for a while and then your endpoint suddenly receives new requests, it can take some time for your endpoint to spin up the compute resources to process the requests. This is called a *cold start*. Since serverless endpoints provision compute resources on demand, your endpoint may experience cold starts. A cold start can also occur if your concurrent requests exceed the current concurrent request usage. The cold start time depends on your model size, how long it takes to download your model, and the start-up time of your container.

To monitor how long your cold start time is, you can use the Amazon CloudWatch metric `OverheadLatency` to monitor your serverless endpoint. This metric tracks the time it takes to launch new compute resources for your endpoint. To learn more about using CloudWatch metrics with serverless endpoints, see [Alarms and logs for tracking metrics from serverless endpoints](serverless-endpoints-monitoring.md).

You can minimize cold starts by using Provisioned Concurrency. SageMaker AI keeps the endpoint warm and ready to respond in milliseconds, for the number of Provisioned Concurrency that you allocated.

### Feature exclusions


Some of the features currently available for SageMaker AI Real-time Inference are not supported for Serverless Inference, including GPUs, AWS marketplace model packages, private Docker registries, Multi-Model Endpoints, VPC configuration, network isolation, data capture, multiple production variants, Model Monitor, and inference pipelines.

You cannot convert your instance-based, real-time endpoint to a serverless endpoint. If you try to update your real-time endpoint to serverless, you receive a `ValidationError` message. You can convert a serverless endpoint to real-time, but once you make the update, you cannot roll it back to serverless.

## Getting started


You can create, update, describe, and delete a serverless endpoint using the SageMaker AI console, the AWS SDKs, the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/overview.html#sagemaker-serverless-inference), and the AWS CLI. You can invoke your endpoint using the AWS SDKs, the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/overview.html#sagemaker-serverless-inference), and the AWS CLI. For serverless endpoints with Provisioned Concurrency, you can use Application Auto Scaling to auto scale Provisioned Concurrency based on a target metric or a schedule. For more information about how to set up and use a serverless endpoint, read the guide [Serverless endpoint operations](serverless-endpoints-create-invoke-update-delete.md). For more information on auto scaling serverless endpoints with Provisioned Concurrency, see [Automatically scale Provisioned Concurrency for a serverless endpoint](serverless-endpoints-autoscale.md).

**Note**  
 Application Auto Scaling for Serverless Inference with Provisioned Concurrency is currently not supported on AWS CloudFormation. 

### Example notebooks and blogs


For Jupyter notebook examples that show end-to-end serverless endpoint workflows, see the [Serverless Inference example notebooks](https://github.com/aws/amazon-sagemaker-examples/tree/master/serverless-inference).

# Serverless endpoint operations


Unlike other SageMaker AI real-time endpoints, Serverless Inference manages compute resources for you, reducing complexity so you can focus on your ML model instead of on managing infrastructure. The following guide highlights the key capabilities of serverless endpoints: how to create, invoke, update, describe, or delete an endpoint. You can use the SageMaker AI console, the AWS SDKs, the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/overview.html#sagemaker-serverless-inference), or the AWS CLI to manage your serverless endpoints.

**Topics**
+ [

# Complete the prerequisites
](serverless-endpoints-prerequisites.md)
+ [

# Serverless endpoint creation
](serverless-endpoints-create.md)
+ [

# Invoke a serverless endpoint
](serverless-endpoints-invoke.md)
+ [

# Update a serverless endpoint
](serverless-endpoints-update.md)
+ [

# Describe a serverless endpoint
](serverless-endpoints-describe.md)
+ [

# Delete a serverless endpoint
](serverless-endpoints-delete.md)

# Complete the prerequisites


The following topic describes the prerequisites that you must complete before creating a serverless endpoint. These prerequisites include properly storing your model artifacts, configuring an AWS IAM with the correct permissions, and selecting a container image.

**To complete the prerequisites**

1. **Set up an AWS account.** You first need an AWS account and an AWS Identity and Access Management administrator user. For instructions on how to set up an AWS account, see [How do I create and activate a new AWS account?](https://aws.amazon.com/premiumsupport/knowledge-center/create-and-activate-aws-account/). For instructions on how to secure your account with an IAM administrator user, see [Creating your first IAM admin user and user group](https://docs.aws.amazon.com/IAM/latest/UserGuide/getting-started_create-admin-group.html) in the *IAM User Guide*.

1. **Create an Amazon S3 bucket.** You use an Amazon S3 bucket to store your model artifacts. To learn how to create a bucket, see [Create your first S3 bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/creating-bucket.html) in the *Amazon S3 User Guide*.

1. **Upload your model artifacts to your S3 bucket.** For instructions on how to upload your model to your bucket, see [Upload an object to your bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/uploading-an-object-bucket.html) in the *Amazon S3 User Guide*.

1. **Create an IAM role for Amazon SageMaker AI.** Amazon SageMaker AI needs access to the S3 bucket that stores your model. Create an IAM role with a policy that gives SageMaker AI read access to your bucket. The following procedure shows how to create a role in the console, but you can also use the [CreateRole](https://docs.aws.amazon.com/IAM/latest/APIReference/API_CreateRole.html) API from the *IAM User Guide*. For information on giving your role more granular permissions based on your use case, see [How to use SageMaker AI execution roles](sagemaker-roles.md#sagemaker-roles-createmodel-perms).

   1. Sign in to the [IAM console](https://console.aws.amazon.com/iam/).

   1. In the navigation tab, choose **Roles**.

   1. Choose **Create Role**.

   1. For **Select type of trusted entity**, choose **AWS service** and then choose **SageMaker AI**.

   1. Choose **Next: Permissions** and then choose **Next: Tags**.

   1. (Optional) Add tags as key-value pairs if you want to have metadata for the role.

   1. Choose **Next: Review**.

   1.  For **Role name**, enter a name for the new role that is unique within your AWS account. You cannot edit the role name after creating the role.

   1. (Optional) For **Role description**, enter a description for the new role.

   1. Choose **Create role**.

1. **Attach S3 bucket permissions to your SageMaker AI role.** After creating an IAM role, attach a policy that gives SageMaker AI permission to access the S3 bucket containing your model artifacts.

   1. In the IAM console navigation tab, choose **Roles**.

   1. From the list of roles, search for the role you created in the previous step by name.

   1. Choose your role, and then choose **Attach policies**.

   1. For **Attach permissions**, choose **Create policy**.

   1. In the **Create policy** view, select the **JSON** tab.

   1. Add the following policy statement into the JSON editor. Make sure to replace `<your-bucket-name>` with the name of the S3 bucket that stores your model artifacts. If you want to restrict the access to a specific folder or file in your bucket, you can also specify the Amazon S3 folder path, for example, `<your-bucket-name>/<model-folder>`.

------
#### [ JSON ]

****  

      ```
      {
          "Version":"2012-10-17",		 	 	 
          "Statement": [
              {
                  "Sid": "VisualEditor0",
                  "Effect": "Allow",
                  "Action": "s3:GetObject",
                  "Resource": "arn:aws:s3:::<your-bucket-name>/*"
              }
          ]
      }
      ```

------

   1. Choose **Next: Tags**.

   1. (Optional) Add tags in key-value pairs to the policy.

   1. Choose **Next: Review**.

   1. For **Name**, enter a name for the new policy.

   1. (Optional) Add a **Description** for the policy.

   1. Choose **Create policy**.

   1. After creating the policy, return to **Roles** in the [IAM console](https://console.aws.amazon.com/iam/) and select your SageMaker AI role.

   1. Choose **Attach policies**.

   1. For **Attach permissions**, search for the policy you created by name. Select it and choose **Attach policy**.

1. **Select a prebuilt Docker container image or bring your own.** The container you choose serves inference on your endpoint. SageMaker AI provides containers for built-in algorithms and prebuilt Docker images for some of the most common machine learning frameworks, such as Apache MXNet, TensorFlow, PyTorch, and Chainer. For a full list of the available SageMaker images, see [Available Deep Learning Containers Images](https://github.com/aws/deep-learning-containers/blob/master/available_images.md).

   If none of the existing SageMaker AI containers meet your needs, you may need to create your own Docker container. For information about how to create your Docker image and make it compatible with SageMaker AI, see [Containers with custom inference code](your-algorithms-inference-main.md). To use your container with a serverless endpoint, the container image must reside in an Amazon ECR repository within the same AWS account that creates the endpoint.

1. **(Optional) Register your model with Model Registry.** [SageMaker Model Registry](model-registry.md) helps you catalog and manage versions of your models for use in ML pipelines. For more information about registering a version of your model, see [Create a Model Group](model-registry-model-group.md) and [Register a Model Version](model-registry-version.md). For an example of a Model Registry and Serverless Inference workflow, see the following [example notebook](https://github.com/aws/amazon-sagemaker-examples/blob/main/serverless-inference/serverless-model-registry.ipynb).

1. **(Optional) Bring an AWS KMS key.** When setting up a serverless endpoint, you have the option to specify a KMS key that SageMaker AI uses to encrypt your Amazon ECR image. Note that the key policy for the KMS key must grant access to the IAM role you specify when setting up your endpoint. To learn more about KMS keys, see the [AWS Key Management Service Developer Guide](https://docs.aws.amazon.com/kms/latest/developerguide/overview.html).

# Serverless endpoint creation


**Important**  
Custom IAM policies that allow Amazon SageMaker Studio or Amazon SageMaker Studio Classic to create Amazon SageMaker resources must also grant permissions to add tags to those resources. The permission to add tags to resources is required because Studio and Studio Classic automatically tag any resources they create. If an IAM policy allows Studio and Studio Classic to create resources but does not allow tagging, "AccessDenied" errors can occur when trying to create resources. For more information, see [Provide permissions for tagging SageMaker AI resources](security_iam_id-based-policy-examples.md#grant-tagging-permissions).  
[AWS managed policies for Amazon SageMaker AI](security-iam-awsmanpol.md) that give permissions to create SageMaker resources already include permissions to add tags while creating those resources.

To create a serverless endpoint, you can use the Amazon SageMaker AI console, the APIs, or the AWS CLI. You can create a serverless endpoint using a similar process as a [real-time endpoint](realtime-endpoints.md).

**Topics**
+ [

# Create a model
](serverless-endpoints-create-model.md)
+ [

# Create an endpoint configuration
](serverless-endpoints-create-config.md)
+ [

# Create an endpoint
](serverless-endpoints-create-endpoint.md)

# Create a model


To create your model, you must provide the location of your model artifacts and container image. You can also use a model version from [SageMaker Model Registry](model-registry.md). The examples in the following sections show you how to create a model using the [CreateModel](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateModel.html) API, Model Registry, and the [Amazon SageMaker AI console](https://console.aws.amazon.com/sagemaker/home).

## To create a model (using Model Registry)


[Model Registry](model-registry.md) is a feature of SageMaker AI that helps you catalog and manage versions of your model for use in ML pipelines. To use Model Registry with Serverless Inference, you must first register a model version in a Model Registry model group. To learn how to register a model in Model Registry, follow the procedures in [Create a Model Group](model-registry-model-group.md) and [Register a Model Version](model-registry-version.md).

The following example requires you to have the ARN of a registered model version and uses the [AWS SDK for Python (Boto3)](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html) to call the [CreateModel](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateModel.html) API. For Serverless Inference, Model Registry is currently only supported by the AWS SDK for Python (Boto3). For the example, specify the following values:
+ For `model_name`, enter a name for the model.
+ For `sagemaker_role`, you can use the default SageMaker AI-created role or a customized SageMaker AI IAM role from Step 4 of the [Complete the prerequisites](serverless-endpoints-prerequisites.md) section.
+ For `ModelPackageName`, specify the ARN for your model version, which must be registered to a model group in Model Registry.

```
#Setup
import boto3
import sagemaker
region = boto3.Session().region_name
client = boto3.client("sagemaker", region_name=region)

#Role to give SageMaker AI permission to access AWS services.
sagemaker_role = sagemaker.get_execution_role()

#Specify a name for the model
model_name = "<name-for-model>"

#Specify a Model Registry model version
container_list = [
    {
        "ModelPackageName": <model-version-arn>
     }
 ]

#Create the model
response = client.create_model(
    ModelName = model_name,
    ExecutionRoleArn = sagemaker_role,
    container_list
)
```

## To create a model (using API)


The following example uses the [AWS SDK for Python (Boto3)](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html) to call the [CreateModel](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateModel.html) API. Specify the following values:
+ For `sagemaker_role,` you can use the default SageMaker AI-created role or a customized SageMaker AI IAM role from Step 4 of the [Complete the prerequisites](serverless-endpoints-prerequisites.md) section.
+ For `model_url`, specify the Amazon S3 URI to your model.
+ For `container`, retrieve the container you want to use by its Amazon ECR path. This example uses a SageMaker AI-provided XGBoost container. If you have not selected a SageMaker AI container or brought your own, see Step 6 of the [Complete the prerequisites](serverless-endpoints-prerequisites.md) section for more information.
+ For `model_name`, enter a name for the model.

```
#Setup
import boto3
import sagemaker
region = boto3.Session().region_name
client = boto3.client("sagemaker", region_name=region)

#Role to give SageMaker AI permission to access AWS services.
sagemaker_role = sagemaker.get_execution_role()

#Get model from S3
model_url = "s3://amzn-s3-demo-bucket/models/model.tar.gz"

#Get container image (prebuilt example)
from sagemaker import image_uris
container = image_uris.retrieve("xgboost", region, "0.90-1")

#Create model
model_name = "<name-for-model>"

response = client.create_model(
    ModelName = model_name,
    ExecutionRoleArn = sagemaker_role,
    Containers = [{
        "Image": container,
        "Mode": "SingleModel",
        "ModelDataUrl": model_url,
    }]
)
```

## To create a model (using the console)


1. Sign in to the [Amazon SageMaker AI console](https://console.aws.amazon.com/sagemaker/home).

1. In the navigation tab, choose **Inference**.

1. Next, choose **Models**.

1. Choose **Create model**.

1. For **Model name**, enter a name for the model that is unique to your account and AWS Region.

1. For **IAM role**, either select an IAM role you have already created (see [Complete the prerequisites](serverless-endpoints-prerequisites.md)) or allow SageMaker AI to create one for you.

1. In **Container definition 1**, for **Container input options**, select **Provide model artifacts and input location**.

1. For **Provide model artifacts and inference image options**, select **Use a single model**.

1. For **Location of inference code image**, enter an Amazon ECR path to a container. The image must either be a SageMaker AI-provided first party image (e.g. TensorFlow, XGBoost) or an image that resides in an Amazon ECR repository within the same account in which you are creating the endpoint. If you do not have a container, go back to Step 6 of the [Complete the prerequisites](serverless-endpoints-prerequisites.md) section for more information.

1. For **Location of model artifacts**, enter the Amazon S3 URI to your ML model. For example, `s3://amzn-s3-demo-bucket/models/model.tar.gz`.

1. (Optional) For **Tags**, add key-value pairs to create metadata for your model.

1. Choose **Create model**.

# Create an endpoint configuration


After you create a model, create an endpoint configuration. You can then deploy your model using the specifications in your endpoint configuration. In the configuration, you specify whether you want a real-time or serverless endpoint. To create a serverless endpoint configuration, you can use the [Amazon SageMaker AI console](https://console.aws.amazon.com/sagemaker/home), the [CreateEndpointConfig](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateEndpointConfig.html) API, or the AWS CLI. The API and console approaches are outlined in the following sections.

## To create an endpoint configuration (using API)


The following example uses the [AWS SDK for Python (Boto3)](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html) to call the [CreateEndpointConfig](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateEndpointConfig.html) API. Specify the following values:
+ For `EndpointConfigName`, choose a name for the endpoint configuration. The name should be unique within your account in a Region.
+ (Optional) For `KmsKeyId`, use the key ID, key ARN, alias name, or alias ARN for an AWS KMS key that you want to use. SageMaker AI uses this key to encrypt your Amazon ECR image.
+ For `ModelName`, use the name of the model you want to deploy. It should be the same model that you used in the [Create a model](serverless-endpoints-create-model.md) step.
+ For `ServerlessConfig`:
  + Set `MemorySizeInMB` to `2048`. For this example, we set the memory size to 2048 MB, but you can choose any of the following values for your memory size: 1024 MB, 2048 MB, 3072 MB, 4096 MB, 5120 MB, or 6144 MB. 
  + Set `MaxConcurrency` to `20`. For this example, we set the maximum concurrency to 20. The maximum number of concurrent invocations you can set for a serverless endpoint is 200, and the minimum value you can choose is 1.
  + (Optional) To use Provisioned Concurrency, set `ProvisionedConcurrency` to 10. For this example, we set the Provisioned Concurrency to 10. The `ProvisionedConcurrency` number for a serverless endpoint must be lower than or equal to the `MaxConcurrency` number. You can leave it empty if you want to use on-demand Serverless Inference endpoint. You can dynamically scale Provision Concurrency. For more information, see [Automatically scale Provisioned Concurrency for a serverless endpoint](serverless-endpoints-autoscale.md).

```
response = client.create_endpoint_config(
   EndpointConfigName="<your-endpoint-configuration>",
   KmsKeyId="arn:aws:kms:us-east-1:123456789012:key/143ef68f-76fd-45e3-abba-ed28fc8d3d5e",
   ProductionVariants=[
        {
            "ModelName": "<your-model-name>",
            "VariantName": "AllTraffic",
            "ServerlessConfig": {
                "MemorySizeInMB": 2048,
                "MaxConcurrency": 20,
                "ProvisionedConcurrency": 10,
            }
        } 
    ]
)
```

## To create an endpoint configuration (using the console)


1. Sign in to the [Amazon SageMaker AI console](https://console.aws.amazon.com/sagemaker/home).

1. In the navigation tab, choose **Inference**.

1. Next, choose **Endpoint configurations**.

1. Choose **Create endpoint configuration**.

1. For **Endpoint configuration name**, enter a name that is unique within your account in a Region.

1. For **Type of endpoint**, select **Serverless**.  
![\[Screenshot of the endpoint type option in the console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/serverless-endpoints-endpoint-config.png)

1. For **Production variants**, choose **Add model**.

1. Under **Add model**, select the model you want to use from the list of models and then choose **Save**.

1. After adding your model, under **Actions**, choose **Edit**.

1. For **Memory size**, choose the memory size you want in GB.  
![\[Screenshot of the memory size option in the console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/serverless-endpoints-endpoint-config-2.png)

1. For **Max Concurrency**, enter your desired maximum concurrent invocations for the endpoint. The maximum value you can enter is 200 and the minimum is 1.

1. (Optional) To use Provisioned Concurrency, enter the desired number of concurrent invocations in the **Provisioned Concurrency setting** field. The number of provisioned concurrent invocations must be less than or equal to the number of maximum concurrent invocations.

1. Choose **Save**.

1. (Optional) For **Tags**, enter key-value pairs if you want to create metadata for your endpoint configuration.

1. Choose **Create endpoint configuration**.

# Create an endpoint


To create a serverless endpoint, you can use the [Amazon SageMaker AI console](https://console.aws.amazon.com/sagemaker/home), the [CreateEndpoint](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateEndpoint.html) API, or the AWS CLI. The API and console approaches are outlined in the following sections. Once you create your endpoint, it can take a few minutes for the endpoint to become available.

## To create an endpoint (using API)


The following example uses the [AWS SDK for Python (Boto3)](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html) to call the [CreateEndpoint](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateEndpoint.html) API. Specify the following values:
+ For `EndpointName`, enter a name for the endpoint that is unique within a Region in your account.
+ For `EndpointConfigName`, use the name of the endpoint configuration that you created in the previous section.

```
response = client.create_endpoint(
    EndpointName="<your-endpoint-name>",
    EndpointConfigName="<your-endpoint-config>"
)
```

## To create an endpoint (using the console)


1. Sign in to the [Amazon SageMaker AI console](https://console.aws.amazon.com/sagemaker/home).

1. In the navigation tab, choose **Inference**.

1. Next, choose **Endpoints**.

1. Choose **Create endpoint**.

1. For **Endpoint name**, enter a name than is unique within a Region in your account.

1. For **Attach endpoint configuration**, select **Use an existing endpoint configuration**.

1. For **Endpoint configuration**, select the name of the endpoint configuration you created in the previous section and then choose **Select endpoint configuration**.

1. (Optional) For **Tags**, enter key-value pairs if you want to create metadata for your endpoint.

1. Choose **Create endpoint**.  
![\[Screenshot of the create and configure endpoint page in the console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/serverless-endpoints-create.png)

# Invoke a serverless endpoint


In order to perform inference using a serverless endpoint, you must send an HTTP request to the endpoint. You can use the [InvokeEndpoint](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_runtime_InvokeEndpoint.html) API or the AWS CLI, which make a `POST` request to invoke your endpoint. The maximum request and response payload size for serverless invocations is 4 MB. For serverless endpoints:
+ The model must download and the server must respond successfully to `/ping` within 3 minutes.
+ The timeout for the container to respond to inference requests to `/invocations` is 1 minute.

## To invoke an endpoint


The following example uses the [AWS SDK for Python (Boto3)](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html) to call the [InvokeEndpoint](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_runtime_InvokeEndpoint.html) API. Note that unlike the other API calls in this guide, for `InvokeEndpoint`, you must use SageMaker Runtime Runtime as the client. Specify the following values:
+ For `endpoint_name`, use the name of the in-service serverless endpoint you want to invoke.
+ For `content_type`, specify the MIME type of your input data in the request body (for example, `application/json`).
+ For `payload`, use your request payload for inference. Your payload should be in bytes or a file-like object.

```
runtime = boto3.client("sagemaker-runtime")

endpoint_name = "<your-endpoint-name>"
content_type = "<request-mime-type>"
payload = <your-request-body>

response = runtime.invoke_endpoint(
    EndpointName=endpoint_name,
    ContentType=content_type,
    Body=payload
)
```

# Update a serverless endpoint


Before updating your endpoint, create a new endpoint configuration or use an existing endpoint configuration. The endpoint configuration is where you specify the changes for your update. Then, you can update your endpoint with the [SageMaker AI console](https://console.aws.amazon.com/sagemaker/home), the [UpdateEndpoint](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_UpdateEndpoint.html) API, or the AWS CLI. The process for updating a serverless endpoint is the same as the process for updating a [real-time endpoint](realtime-endpoints.md). Note that when updating your endpoint, you can experience cold starts when making requests to the endpoint because SageMaker AI must re-initialize your container and model.

You may want to update an on-demand serverless endpoint to a serverless endpoint with provisioned concurrency or adjust the Provisioned Concurrency value for an existing serverless endpoint with provisioned concurrency. For both cases, you will have to create a new serverless endpoint configuration with the desired value for Provisioned Concurrency, and apply `UpdateEndpoint` to the existing serverless endpoint. For more information on creating a new serverless endpoint configuration with Provisioned Concurrency, see [Create an endpoint configuration](serverless-endpoints-create-config.md).

If you want to remove Provisioned Concurrency from a serverless endpoint, you will have to create a new endpoint configuration without specifying any value for Provisioned Concurrency, and then apply `UpdateEndpoint` to the endpoint.

**Note**  
Updating a real-time inference endpoint to either an on-demand serverless endpoint or a serverless endpoint with Provisioned Concurrency is currently not supported.

## Update the endpoint


After creating a new serverless endpoint configuration you can use the [AWS SDK for Python (Boto3)](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html) or the [SageMaker AI console](https://console.aws.amazon.com/sagemaker/) to update an existing serverless endpoint. Examples of how to update your endpoint using the AWS SDK for Python (Boto3) and the SageMaker AI console are outlined in the following sections.

### To update the endpoint (using Boto3)


The following example uses the [AWS SDK for Python (Boto3)](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html) to call the [update\$1endpoint](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker/client/update_endpoint.html) method. Specify at least the following parameters when calling the method:
+ For `EndpointName`, use the name of the endpoint you’re updating.
+ For `EndpointConfigName`, use the name of the endpoint configuration that you want to use for the update.

```
response = client.update_endpoint(
    EndpointName="<your-endpoint-name>",
    EndpointConfigName="<new-endpoint-config>",
)
```

### To update the endpoint (using the console)


1. Sign in to the [Amazon SageMaker AI console](https://console.aws.amazon.com/sagemaker/).

1. In the navigation tab, choose **Inference**.

1. Next, choose **Endpoints**.

1. From the list of endpoints, select the endpoint you want to update.

1. Choose **Change** in **Endpoint configuration settings** section.

1. For **Change the Endpoint configuration**, choose **Use an existing endpoint configuration**.

1. From the list of endpoint configurations, select the one you want to use for your update.

1. Choose **Select endpoint configuration**.

1. Choose **Update endpoint**.

# Describe a serverless endpoint


You might want to retrieve information about your endpoint, including details such as the endpoint’s ARN, current status, deployment configuration, and failure reasons. You can find information about your endpoint using the [SageMaker AI console](https://console.aws.amazon.com/sagemaker/home), the [DescribeEndpoint](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeEndpoint.html) API, or the AWS CLI.

## To describe an endpoint (using API)


The following example uses the [AWS SDK for Python (Boto3)](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#id309) to call the [DescribeEndpoint](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeEndpoint.html) API. For `EndpointName`, use the name of the endpoint you want to check.

```
response = client.describe_endpoint(
    EndpointName="<your-endpoint-name>",
)
```

## To describe an endpoint (using the console)


1. Sign in to the [Amazon SageMaker AI console](https://console.aws.amazon.com/sagemaker/home).

1. In the navigation tab, choose **Inference**.

1. Next, choose **Endpoints**.

1. From the list of endpoints, choose the endpoint you want to check.

The endpoint page contains the information about your endpoint.

# Delete a serverless endpoint


You can delete your serverless endpoint using the [SageMaker AI console](https://console.aws.amazon.com/sagemaker/home), the [DeleteEndpoint](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DeleteEndpoint.html) API, or the AWS CLI. The following examples show you how to delete your endpoint through the API and the SageMaker AI console.

## To delete an endpoint (using API)


The following example uses the [AWS SDK for Python (Boto3)](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html) to call the [DeleteEndpoint](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DeleteEndpoint.html) API. For `EndpointName`, use the name of the serverless endpoint you want to delete.

```
response = client.delete_endpoint(
    EndpointName="<your-endpoint-name>",
)
```

## To delete an endpoint (using the console)


1. Sign in to the [Amazon SageMaker AI console](https://console.aws.amazon.com/sagemaker/home).

1. In the navigation tab, choose **Inference**.

1. Next, choose **Endpoints**.

1. From the list of endpoints, select the endpoint you want to delete.

1. Choose the **Actions** drop-down list, and then choose **Delete**.

1. When prompted again, choose **Delete**.

Your endpoint should now begin the deletion process.

# Alarms and logs for tracking metrics from serverless endpoints
Alarms and logs

To monitor your serverless endpoint, you can use Amazon CloudWatch alarms. CloudWatch is a service that collects metrics in real time from your AWS applications and resources. An alarm watches metrics as they are collected and gives you the ability to pre-specify a threshold and the actions to take if that threshold is breached. For example, your CloudWatch alarm can send you a notification if your endpoint breaches an error threshold. By setting up CloudWatch alarms, you gain visibility into the performance and functionality of your endpoint. For more information about CloudWatch alarms, see [Using Amazon CloudWatch alarms](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html) in the *Amazon CloudWatch User Guide*.

## Monitoring with CloudWatch


The metrics below are an exhaustive list of metrics for serverless endpoints. Any metric not listed below is not published for serverless endpoints. For information about the following metrics, see [Monitor Amazon SageMaker AI with Amazon CloudWatch](https://docs.aws.amazon.com/sagemaker/latest/dg/monitoring-cloudwatch.html).

### Common endpoint metrics


These CloudWatch metrics are the same as the metrics published for real-time endpoints.

The `OverheadLatency` metric tracks all additional latency that SageMaker AI added which includes the cold start time for launching new compute resources for your serverless endpoint. Compared to on-demand serverless endpoints, the `OverheadLatency` for serverless endpoints with provision concurrency is generally significantly less.

Serverless endpoints can also use the `Invocations4XXErrors`, `Invocations5XXErrors`, `Invocations`, `ModelLatency`, `ModelSetupTime` and `MemoryUtilization` metrics. To learn more about these metrics, see [SageMaker AI endpoint invocation metrics](monitoring-cloudwatch.md#cloudwatch-metrics-endpoint-invocation).

### Common serverless endpoint metrics


These CloudWatch metrics are published for both on-demand serverless endpoints and serverless endpoint with Provisioned Concurrency.


| Metric Name | Description | Unit/Stats | 
| --- | --- | --- | 
| ServerlessConcurrentExecutionsUtilization | The number of concurrent executions divided by the maximum concurrency. | Units: NoneValid statistics: Average, Max, Min | 

### Serverless endpoint with Provisioned Concurrency metrics


These CloudWatch metrics are published for serverless endpoints with Provisioned Concurrency.


| Metric Name | Description | Unit/Stats | 
| --- | --- | --- | 
| ServerlessProvisionedConcurrencyExecutions | The number of concurrent executions handled by the endpoint. | Units: CountValid statistics: Average, Max, Min | 
| ServerlessProvisionedConcurrencyUtilization | The number of concurrent executions divided by the allocated Provisioned Concurrency. | Units: NoneValid statistics: Average, Max, Min | 
| ServerlessProvisionedConcurrencyInvocations | The number of InvokeEndpoint requests handled by Provisioned Concurrency. | Units: CountValid statistics: Average, Max, Min | 
| ServerlessProvisionedConcurrencySpilloverInvocations | The number of InvokeEndpoint requests not handled by Provisioned Concurrency, that is handled by on-demand Serverless Inference. | Units: CountValid statistics: Average, Max, Min | 

## Logs


If you want to monitor the logs from your endpoint for debugging or progress analysis, you can use Amazon CloudWatch Logs. The SageMaker AI-provided log group that you can use for serverless endpoints is `/aws/sagemaker/Endpoints/[EndpointName]`. For more information about using CloudWatch Logs in SageMaker AI, see [CloudWatch Logs for Amazon SageMaker AI](logging-cloudwatch.md). To learn more about CloudWatch Logs, see [What is Amazon CloudWatch Logs?](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/WhatIsCloudWatchLogs.html) in the *Amazon CloudWatch Logs User Guide*.

# Automatically scale Provisioned Concurrency for a serverless endpoint


 Amazon SageMaker AI automatically scales in or out on-demand serverless endpoints. For serverless endpoints with Provisioned Concurrency you can use Application Auto Scaling to scale up or down the Provisioned Concurrency based on your traffic profile, thus optimizing costs. 

 The following are the prerequisites to autoscale Provisioned Concurrency on serverless endpoints: 
+ [Register a model](#serverless-endpoints-autoscale-register)
+ [Define a scaling policy](#serverless-endpoints-autoscale-define)
+ [Apply a scaling policy](#serverless-endpoints-autoscale-apply)

 Before you can use autoscaling, you must have already deployed a model to a serverless endpoint with Provisioned Concurrency. Deployed models are referred to as [production variants](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_ProductionVariant.html). See [Create an endpoint configuration](serverless-endpoints-create-config.md) and [Create an endpoint](serverless-endpoints-create-endpoint.md) for more information about deploying a model to a serverless endpoint with Provisioned Concurrency. To specify the metrics and target values for a scaling policy, you must configure a scaling policy. For more information on how to define a scaling policy, see [Define a scaling policy](#serverless-endpoints-autoscale-define). After registering your model and defining a scaling policy, apply the scaling policy to the registered model. For information on how to apply the scaling policy, see [Apply a scaling policy](#serverless-endpoints-autoscale-apply). 

 For details on other prerequisites and components used with autoscaling, see the [Auto scaling prerequisites](endpoint-auto-scaling-prerequisites.md) section in the [SageMaker AI autoscaling documentation](endpoint-auto-scaling.md). 

## Register a model


 To add autoscaling to a serverless endpoint with Provisioned Concurrency, you first must register your model (production variant) using AWS CLI or Application Auto Scaling API. 

### Register a model (AWS CLI)


 To register your model, use the `register-scalable-target` AWS CLI command with the following parameters: 
+  `--service-namespace` – Set this value to `sagemaker`. 
+  `--resource-id` – The resource identifier for the model (specifically the production variant). For this parameter, the resource type is `endpoint` and the unique identifier is the name of the production variant. For example `endpoint/MyEndpoint/variant/MyVariant`. 
+  `--scalable-dimension` – Set this value to `sagemaker:variant:DesiredProvisionedConcurrency`. 
+  `--min-capacity` – The minimum number of Provisioned Concurrency for the model. Set `--min-capacity` to at least 1. It must be equal to or less than the value specified for `--max-capacity`. 
+  `--max-capacity` – The maximum number of Provisioned Concurrency that should be enabled through Application Auto Scaling. Set `--max-capacity` to a minimum of 1. It must be greater than or equal to the value specified for `--min-capacity`. 

 The following example shows how to register a model named `MyVariant` that is dynamically scaled to have 1 to 10 Provisioned Concurrency value: 

```
aws application-autoscaling register-scalable-target \
    --service-namespace sagemaker \
    --scalable-dimension sagemaker:variant:DesiredProvisionedConcurrency \
    --resource-id endpoint/MyEndpoint/variant/MyVariant \
    --min-capacity 1 \
    --max-capacity 10
```

### Register a model (Application Auto Scaling API)


 To register your model, use the `RegisterScalableTarget` Application Auto Scaling API action with the following parameters: 
+  `ServiceNamespace` – Set this value to `sagemaker`. 
+  `ResourceId` – The resource identifier for the model (specifically the production variant). For this parameter, the resource type is `endpoint` and the unique identifier is the name of the production variant. For example `endpoint/MyEndpoint/variant/MyVariant`. 
+  `ScalableDimension` – Set this value to `sagemaker:variant:DesiredProvisionedConcurrency`. 
+  `MinCapacity` – The minimum number of Provisioned Concurrency for the model. Set `MinCapacity` to at least 1. It must be equal to or less than the value specified for `MaxCapacity`. 
+  `MaxCapacity` – The maximum number of Provisioned Concurrency that should be enabled through Application Auto Scaling. Set `MaxCapacity` to a minimum of 1. It must be greater than or equal to the value specified for `MinCapacity`. 

 The following example shows how to register a model named `MyVariant` that is dynamically scaled to have 1 to 10 Provisioned Concurrency value: 

```
POST / HTTP/1.1
Host: autoscaling.us-east-2.amazonaws.com
Accept-Encoding: identity
X-Amz-Target: AnyScaleFrontendService.RegisterScalableTarget
X-Amz-Date: 20160506T182145Z
User-Agent: aws-cli/1.10.23 Python/2.7.11 Darwin/15.4.0 botocore/1.4.8
Content-Type: application/x-amz-json-1.1
Authorization: AUTHPARAMS

{
    "ServiceNamespace": "sagemaker",
    "ResourceId": "endpoint/MyEndPoint/variant/MyVariant",
    "ScalableDimension": "sagemaker:variant:DesiredProvisionedConcurrency",
    "MinCapacity": 1,
    "MaxCapacity": 10
}
```

## Define a scaling policy


 To specify the metrics and target values for a scaling policy, you can configure a target-tracking scaling policy. Define the scaling policy as a JSON block in a text file. You can then use that text file when invoking the AWS CLI or the Application Auto Scaling API. To quickly define a target-tracking scaling policy for a serverless endpoint, use the `SageMakerVariantProvisionedConcurrencyUtilization` predefined metric. 

```
{
    "TargetValue": 0.5,
    "PredefinedMetricSpecification": 
    {
        "PredefinedMetricType": "SageMakerVariantProvisionedConcurrencyUtilization"
    },
    "ScaleOutCooldown": 1,
    "ScaleInCooldown": 1
}
```

## Apply a scaling policy


 After registering your model, you can apply a scaling policy to your serverless endpoint with Provisioned Concurrency. See [Apply a target-tracking scaling policy](#serverless-endpoints-autoscale-apply-target) to apply a target-tracking scaling policy that you have defined. If the traffic flow to your serverless endpoint has a predictable routine then instead of applying a target-tracking scaling policy you might want to schedule scaling actions at specific times. For more information on scheduling scaling actions, see [Scheduled scaling](#serverless-endpoints-autoscale-apply-scheduled). 

### Apply a target-tracking scaling policy


 You can use the AWS Management Console, AWS CLI or the Application Auto Scaling API to apply a target-tracking scaling policy to your serverless endpoint with Provisioned Concurrency. 

#### Apply a target-tracking scaling policy (AWS CLI)


 To apply a scaling policy to your model, use the `put-scaling-policy` AWS CLI; command with the following parameters: 
+  `--policy-name` – The name of the scaling policy. 
+  `--policy-type` – Set this value to `TargetTrackingScaling`. 
+  `--resource-id` – The resource identifier for the variant. For this parameter, the resource type is `endpoint` and the unique identifier is the name of the variant. For example `endpoint/MyEndpoint/variant/MyVariant`. 
+  `--service-namespace` – Set this value to `sagemaker`. 
+  `--scalable-dimension` – Set this value to `sagemaker:variant:DesiredProvisionedConcurrency`. 
+  `--target-tracking-scaling-policy-configuration` – The target-tracking scaling policy configuration to use for the model. 

 The following example shows how to apply a target-tracking scaling policy named `MyScalingPolicy` to a model named `MyVariant`. The policy configuration is saved in a file named `scaling-policy.json`. 

```
aws application-autoscaling put-scaling-policy \
    --policy-name MyScalingPolicy \
    --policy-type TargetTrackingScaling \
    --service-namespace sagemaker \
    --scalable-dimension sagemaker:variant:DesiredProvisionedConcurrency \
    --resource-id endpoint/MyEndpoint/variant/MyVariant \
    --target-tracking-scaling-policy-configuration file://[file-localtion]/scaling-policy.json
```

#### Apply a target-tracking scaling policy (Application Auto Scaling API)


 To apply a scaling policy to your model, use the `PutScalingPolicy` Application Auto Scaling API action with the following parameters: 
+  `PolicyName` – The name of the scaling policy. 
+  `PolicyType` – Set this value to `TargetTrackingScaling`. 
+  `ResourceId` – The resource identifier for the variant. For this parameter, the resource type is `endpoint` and the unique identifier is the name of the variant. For example `endpoint/MyEndpoint/variant/MyVariant`. 
+  `ServiceNamespace` – Set this value to `sagemaker`. 
+  `ScalableDimension` – Set this value to `sagemaker:variant:DesiredProvisionedConcurrency`. 
+  `TargetTrackingScalingPolicyConfiguration` – The target-tracking scaling policy configuration to use for the model. 

 The following example shows how to apply a target-tracking scaling policy named `MyScalingPolicy` to a model named `MyVariant`. The policy configuration is saved in a file named `scaling-policy.json`. 

```
POST / HTTP/1.1
Host: autoscaling.us-east-2.amazonaws.com
Accept-Encoding: identity
X-Amz-Target: AnyScaleFrontendService.PutScalingPolicy
X-Amz-Date: 20160506T182145Z
User-Agent: aws-cli/1.10.23 Python/2.7.11 Darwin/15.4.0 botocore/1.4.8
Content-Type: application/x-amz-json-1.1
Authorization: AUTHPARAMS

{
    "PolicyName": "MyScalingPolicy",
    "ServiceNamespace": "sagemaker",
    "ResourceId": "endpoint/MyEndpoint/variant/MyVariant",
    "ScalableDimension": "sagemaker:variant:DesiredProvisionedConcurrency",
    "PolicyType": "TargetTrackingScaling",
    "TargetTrackingScalingPolicyConfiguration": 
    {
        "TargetValue": 0.5,
        "PredefinedMetricSpecification": 
        {
            "PredefinedMetricType": "SageMakerVariantProvisionedConcurrencyUtilization"
        }
    }
}
```

#### Apply a target-tracking scaling policy (AWS Management Console)


 To apply a target-tracking scaling policy with the AWS Management Console: 

1.  Sign in to the [Amazon SageMaker AI console](https://console.aws.amazon.com/sagemaker/). 

1.  In the navigation panel, choose **Inference**. 

1.  Choose **Endpoints** to view a list of all of your endpoints. 

1.  Choose the endpoint to which you want to apply the scaling policy. A page with the settings of the endpoint will appear, with the models (production variant) listed under **Endpoint runtime settings section**. 

1.  Select the production variant to which you want to apply the scaling policy, and choose **Configure auto scaling**. The **Configure variant automatic scaling** dialog box appears.   
![\[Screenshot of the configure variant automatic scaling dialog box in the console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/serverless-endpoints-variant-autoscaling.png)

1.  Enter the minimum and maximum Provisioned Concurrency values in the **Minimum provisioned concurrency** and **Maximum provisioned concurrency** fields, respectively, in the **Variant automatic scaling** section. Minimum Provisioned Concurrency must be less than or equal to maximum Provisioned Concurrency. 

1.  Enter the target value in the **Target value** field for the target metric, `SageMakerVariantProvisionedConcurrencyUtilization`. 

1.  (Optional) Enter scale in cool down and scale out cool down values (in seconds) in **Scale in cool down** and **Scale out cool down** fields respectively. 

1.  (Optional) Select **Disable scale in** if you don’t want auto scaling to delete instance when traffic decreases. 

1.  Select **Save**. 

### Scheduled scaling


 If the traffic to your serverless endpoint with Provisioned Concurrency follows a routine pattern you might want to schedule scaling actions at specific times, to scale in or scale out Provisioned Concurrency. You can use the AWS CLI or the Application Auto Scaling to schedule scaling actions. 

#### Scheduled scaling (AWS CLI)


 To apply a scaling policy to your model, use the `put-scheduled-action` AWS CLI; command with the following parameters: 
+  `--schedule-action-name` – The name of the scaling action. 
+  `--schedule` – A cron expression that specifies the start and end times of the scaling action with a recurring schedule. 
+  `--resource-id` – The resource identifier for the variant. For this parameter, the resource type is `endpoint` and the unique identifier is the name of the variant. For example `endpoint/MyEndpoint/variant/MyVariant`. 
+  `--service-namespace` – Set this value to `sagemaker`. 
+  `--scalable-dimension` – Set this value to `sagemaker:variant:DesiredProvisionedConcurrency`. 
+  `--scalable-target-action` – The target of the scaling action. 

 The following example shows how to add a scaling action named `MyScalingAction` to a model named `MyVariant` on a recurring schedule. On the specified schedule (every day at 12:15 PM UTC), if the current Provisioned Concurrency is below the value specified for `MinCapacity`. Application Auto Scaling scales out the Provisioned Concurrency to the value specified by `MinCapacity`. 

```
aws application-autoscaling put-scheduled-action \
    --scheduled-action-name 'MyScalingAction' \
    --schedule 'cron(15 12 * * ? *)' \
    --service-namespace sagemaker \
    --resource-id endpoint/MyEndpoint/variant/MyVariant \
    --scalable-dimension sagemaker:variant:DesiredProvisionedConcurrency \
    --scalable-target-action 'MinCapacity=10'
```

#### Scheduled scaling (Application Auto Scaling API)


 To apply a scaling policy to your model, use the `PutScheduledAction` Application Auto Scaling API action with the following parameters: 
+  `ScheduleActionName` – The name of the scaling action. 
+  `Schedule` – A cron expression that specifies the start and end times of the scaling action with a recurring schedule. 
+  `ResourceId` – The resource identifier for the variant. For this parameter, the resource type is `endpoint` and the unique identifier is the name of the variant. For example `endpoint/MyEndpoint/variant/MyVariant`. 
+  `ServiceNamespace` – Set this value to `sagemaker`. 
+  `ScalableDimension` – Set this value to `sagemaker:variant:DesiredProvisionedConcurrency`. 
+  `ScalableTargetAction` – The target of the scaling action. 

 The following example shows how to add a scaling action named `MyScalingAction` to a model named `MyVariant` on a recurring schedule. On the specified schedule (every day at 12:15 PM UTC), if the current Provisioned Concurrency is below the value specified for `MinCapacity`. Application Auto Scaling scales out the Provisioned Concurrency to the value specified by `MinCapacity`. 

```
POST / HTTP/1.1
Host: autoscaling.us-east-2.amazonaws.com
Accept-Encoding: identity
X-Amz-Target: AnyScaleFrontendService.PutScheduledAction
X-Amz-Date: 20160506T182145Z
User-Agent: aws-cli/1.10.23 Python/2.7.11 Darwin/15.4.0 botocore/1.4.8
Content-Type: application/x-amz-json-1.1
Authorization: AUTHPARAMS

{
    "ScheduledActionName": "MyScalingAction",
    "Schedule": "cron(15 12 * * ? *)",
    "ServiceNamespace": "sagemaker",
    "ResourceId": "endpoint/MyEndpoint/variant/MyVariant",
    "ScalableDimension": "sagemaker:variant:DesiredProvisionedConcurrency",
    "ScalableTargetAction": "MinCapacity=10"
        }
    }
}
```

# Clean up


 After you have finished using autoscaling for your serverless endpoint with Provisioned Concurrency, you should clean up the resources you created. This involves deleting the scaling policy and deregistering the model from Application Auto Scaling. Cleaning up ensures that you don't incur unnecessary costs for resources you're no longer using. 

## Delete a scaling policy


 You can delete a scaling policy with the AWS Management Console, the AWS CLI, or the Application Auto Scaling API. For more information on deleting a scaling policy with the AWS Management Console, see [Delete a scaling policy](endpoint-auto-scaling-delete.md) in the [SageMaker AI autoscaling documentation](endpoint-auto-scaling.md). 

### Delete a scaling policy (AWS CLI)


 To apply a scaling policy to your model, use the `delete-scaling-policy` AWS CLI; command with the following parameters: 
+  `--policy-name` – The name of the scaling policy. 
+  `--resource-id` – The resource identifier for the variant. For this parameter, the resource type is `endpoint` and the unique identifier is the name of the variant. For example `endpoint/MyEndpoint/variant/MyVariant`. 
+  `--service-namespace` – Set this value to `sagemaker`. 
+  `--scalable-dimension` – Set this value to `sagemaker:variant:DesiredProvisionedConcurrency`. 

 The following example deletes scaling policy named `MyScalingPolicy` from a model named `MyVariant`. 

```
aws application-autoscaling delete-scaling-policy \
    --policy-name MyScalingPolicy \
    --service-namespace sagemaker \
    --scalable-dimension sagemaker:variant:DesiredProvisionedConcurrency \
    --resource-id endpoint/MyEndpoint/variant/MyVariant
```

### Delete a scaling policy (Application Auto Scaling API)


 To delete a scaling policy to your model, use the `DeleteScalingPolicy` Application Auto Scaling API action with the following parameters: 
+  `PolicyName` – The name of the scaling policy. 
+  `ResourceId` – The resource identifier for the variant. For this parameter, the resource type is `endpoint` and the unique identifier is the name of the variant. For example `endpoint/MyEndpoint/variant/MyVariant`. 
+  `ServiceNamespace` – Set this value to `sagemaker`. 
+  `ScalableDimension` – Set this value to `sagemaker:variant:DesiredProvisionedConcurrency`. 

 The following example uses the Application Auto Scaling API to delete a scaling policy named `MyScalingPolicy` from a model named `MyVariant`. 

```
POST / HTTP/1.1
Host: autoscaling.us-east-2.amazonaws.com
Accept-Encoding: identity
X-Amz-Target: AnyScaleFrontendService.DeleteScalingPolicy
X-Amz-Date: 20160506T182145Z
User-Agent: aws-cli/1.10.23 Python/2.7.11 Darwin/15.4.0 botocore/1.4.8
Content-Type: application/x-amz-json-1.1
Authorization: AUTHPARAMS

{
    "PolicyName": "MyScalingPolicy",
    "ServiceNamespace": "sagemaker",
    "ResourceId": "endpoint/MyEndpoint/variant/MyVariant",
    "ScalableDimension": "sagemaker:variant:DesiredProvisionedConcurrency",
}
```

## Deregister a model


 You can deregister a model with the AWS Management Console, the AWS CLI, or the Application Auto Scaling API. 

### Deregister a model (AWS CLI)


 To deregister a model from Application Auto Scaling, use the `deregister-scalable-target` AWS CLI; command with the following parameters: 
+  `--resource-id` – The resource identifier for the variant. For this parameter, the resource type is `endpoint` and the unique identifier is the name of the variant. For example `endpoint/MyEndpoint/variant/MyVariant`. 
+  `--service-namespace` – Set this value to `sagemaker`. 
+  `--scalable-dimension` – Set this value to `sagemaker:variant:DesiredProvisionedConcurrency`. 

 The following example deregisters a model named `MyVariant` from Application Auto Scaling. 

```
aws application-autoscaling deregister-scalable-target \
    --service-namespace sagemaker \
    --scalable-dimension sagemaker:variant:DesiredProvisionedConcurrency \
    --resource-id endpoint/MyEndpoint/variant/MyVariant
```

### Deregister a model (Application Auto Scaling API)


 To deregister a model from Application Auto Scaling use the `DeregisterScalableTarget` Application Auto Scaling API action with the following parameters: 
+  `ResourceId` – The resource identifier for the variant. For this parameter, the resource type is `endpoint` and the unique identifier is the name of the variant. For example `endpoint/MyEndpoint/variant/MyVariant`. 
+  `ServiceNamespace` – Set this value to `sagemaker`. 
+  `ScalableDimension` – Set this value to `sagemaker:variant:DesiredProvisionedConcurrency`. 

 The following example uses the Application Auto Scaling API to deregister a model named `MyVariant` from Application Auto Scaling. 

```
POST / HTTP/1.1
Host: autoscaling.us-east-2.amazonaws.com
Accept-Encoding: identity
X-Amz-Target: AnyScaleFrontendService.DeregisterScalableTarget
X-Amz-Date: 20160506T182145Z
User-Agent: aws-cli/1.10.23 Python/2.7.11 Darwin/15.4.0 botocore/1.4.8
Content-Type: application/x-amz-json-1.1
Authorization: AUTHPARAMS

{
    "ServiceNamespace": "sagemaker",
    "ResourceId": "endpoint/MyEndpoint/variant/MyVariant",
    "ScalableDimension": "sagemaker:variant:DesiredProvisionedConcurrency",
}
```

### Deregister a model (AWS Management Console)


 To deregister a model (production variant) with the AWS Management Console: 

1.  Open the [Amazon SageMaker AI console](https://console.aws.amazon.com/sagemaker/). 

1.  In the navigational panel, choose **Inference**. 

1.  Choose **Endpoints** to view a list of your endpoints. 

1.  Choose the serverless endpoint hosting the production variant. A page with the settings of the endpoint will appear, with the production variants listed under **Endpoint runtime settings** section. 

1.  Select the production variant that you want to deregister, and choose **Configure auto scaling**. The **Configure variant automatic scaling** dialog box appears. 

1.  Choose **Deregister auto scaling**. 

# Troubleshooting


**Important**  
Custom IAM policies that allow Amazon SageMaker Studio or Amazon SageMaker Studio Classic to create Amazon SageMaker resources must also grant permissions to add tags to those resources. The permission to add tags to resources is required because Studio and Studio Classic automatically tag any resources they create. If an IAM policy allows Studio and Studio Classic to create resources but does not allow tagging, "AccessDenied" errors can occur when trying to create resources. For more information, see [Provide permissions for tagging SageMaker AI resources](security_iam_id-based-policy-examples.md#grant-tagging-permissions).  
[AWS managed policies for Amazon SageMaker AI](security-iam-awsmanpol.md) that give permissions to create SageMaker resources already include permissions to add tags while creating those resources.

If you are having trouble with Serverless Inference, refer to the following troubleshooting tips.

## Container issues


If the container you use for a serverless endpoint is the same one you used on an instance-based endpoint, your container may not have permissions to write files. This can happen for the following reasons:
+ Your serverless endpoint fails to create or update due to a ping health check failure.
+ The Amazon CloudWatch logs for the endpoint show that the container is failing to write to some file or directory due to a permissions error.

To fix this issue, you can try to add read, write, and execute permissions for `other` on the file or directory and then rebuild the container. You can perform the following steps to complete this process:

1. In the Dockerfile you used to build your container, add the following command: `RUN chmod o+rwX <file or directory name>`

1. Rebuild the container.

1. Upload the new container image to Amazon ECR.

1. Try to create or update the serverless endpoint again.