Diagnose InvocationLatency increases using output tokens per second (OTPS)
The InvocationLatency metric reports the wall-clock time of an inference
request, from when the request is received to when the last output token is produced. By itself,
this metric cannot tell you why latency increased. The same elevated value
can result from two different conditions:
-
The model is generating tokens more slowly — a service-side throughput change.
-
The model is generating more tokens per request — a workload change such as a longer prompt, an updated system prompt, or a model update that produces longer responses.
Output tokens per second (OTPS) isolates the throughput component, so you can alarm on service-side degradation without producing false positives when output length grows.
Note
OTPS calculation requires the TimeToFirstToken metric, which Amazon Bedrock publishes
only for the streaming API operations ConverseStream and InvokeModelWithResponseStream. The procedures
in this section apply only to traffic on those operations.
How InvocationLatency, TimeToFirstToken, and OTPS relate
An inference request passes through two compute-bound stages on the model host:
-
Prefill. The model processes the entire input prompt in a single forward pass and produces the first output token. The duration of this stage scales primarily with input length and is the main driver of
TimeToFirstToken. -
Decode. The model generates each subsequent output token sequentially, one token per forward pass. The total time of this stage scales with the number of output tokens. Per-token decode time is fairly stable for a given model and host load, which is what makes OTPS a useful throughput signal.
These stages produce the following relationship between Amazon Bedrock runtime metrics:
InvocationLatency (ms) = TimeToFirstToken (ms) + (OutputTokenCount / OTPS) * 1000
Solving for OTPS gives the formula you can compute from published CloudWatch metrics:
OTPS = OutputTokenCount / (InvocationLatency - TimeToFirstToken) * 1000
A stable OTPS over time indicates that the model is generating at expected throughput,
even if InvocationLatency is elevated due to longer prompts or longer responses.
A drop in OTPS indicates a model-side throughput change, which is the signal you typically want
to alarm on.
Calculate OTPS using a CloudWatch metric math expression
You can graph OTPS in the CloudWatch console by combining published Amazon Bedrock runtime metrics in a
metric math expression. The metrics needed are InvocationLatency,
OutputTokenCount, and TimeToFirstToken, all in the
AWS/Bedrock namespace. For descriptions of these metrics, see Amazon Bedrock runtime metrics.
-
Open the CloudWatch console and choose Metrics, then All metrics.
-
Search for
Bedrockand select the By ModelId dimension. -
Select
InvocationLatency,OutputTokenCount, andTimeToFirstTokenfor the model ID you want to monitor. -
For each selected metric, set Statistic to
p50and Period to 5 minutes. -
Choose Add math, then Start with empty expression.
-
Enter the following expression and label it
OTPS. Adjust the metric IDs (m1,m2,m3) to match the IDs assigned toInvocationLatency,OutputTokenCount, andTimeToFirstTokenin your selection.m2 / (m1 - m3) * 1000
The graph now shows p50 OTPS per 5-minute window for the selected model. You can use this metric math expression as the basis for an alarm.
Create a CloudWatch alarm on OTPS
Because OTPS is a metric math expression rather than a published metric, you alarm on it by creating a metric math alarm. Two patterns are useful, depending on whether you have an established throughput baseline.
Static threshold alarm
Use a static threshold alarm when you have an established baseline OTPS for your model, for example from benchmarking or historical traffic.
-
From the OTPS metric math expression created in the preceding procedure, choose the alarm icon to create an alarm.
-
For Threshold type, choose Static.
-
For the alarm condition, choose Lower than and enter your threshold. A common starting point is 80 percent of your expected baseline. For example, if your model typically achieves 55 tokens per second, set the threshold to 44 tokens per second.
-
Under Additional configuration, set the evaluation to 3 out of 5 datapoints breaching to reduce noise from transient dips.
-
Set the missing data treatment to Treat missing data as breaching if you want gaps to count as degradation, or Treat missing data as missing if missing data is expected during low-traffic periods.
Anomaly detection alarm
Use an anomaly detection alarm when workload patterns vary over time and you want the threshold to adapt automatically. Anomaly detection requires sufficient historical data (at least two weeks) to build an accurate model. For new deployments, start with a static threshold.
-
Create the alarm from the OTPS metric math expression as in the preceding procedure, but for Threshold type, choose Anomaly detection.
-
Choose Lower than the band. OTPS drops, not spikes, indicate degradation.
-
Set the anomaly detection threshold to 2 or 3 standard deviations. Lower values produce a more sensitive alarm.
-
Use 3 out of 5 evaluation periods.
-
Set the missing data treatment as described in the static threshold procedure.
Create the alarm programmatically with the AWS SDK for Python (Boto3)
The following Python example uses the AWS SDK for Python (Boto3) to create the static
threshold alarm described in the preceding section. Replace MODEL_ID,
OTPS_THRESHOLD, and AlarmActions with values appropriate for your
environment.
import boto3 cw = boto3.client("cloudwatch", region_name="us-east-1") MODEL_ID = "us.anthropic.claude-sonnet-4-5-20250929-v1:0" ALARM_NAME = "Bedrock-OTPS-Low" OTPS_THRESHOLD = 44 # tokens/s; set to ~80% of your expected baseline cw.put_metric_alarm( AlarmName=ALARM_NAME, AlarmDescription="Fires when Bedrock OTPS drops below threshold, indicating model-side throughput degradation.", Metrics=[ { "Id": "m1", "MetricStat": { "Metric": { "Namespace": "AWS/Bedrock", "MetricName": "InvocationLatency", "Dimensions": [{"Name": "ModelId", "Value": MODEL_ID}], }, "Period": 300, "Stat": "p50", }, "ReturnData": False, }, { "Id": "m2", "MetricStat": { "Metric": { "Namespace": "AWS/Bedrock", "MetricName": "OutputTokenCount", "Dimensions": [{"Name": "ModelId", "Value": MODEL_ID}], }, "Period": 300, "Stat": "p50", }, "ReturnData": False, }, { "Id": "m3", "MetricStat": { "Metric": { "Namespace": "AWS/Bedrock", "MetricName": "TimeToFirstToken", "Dimensions": [{"Name": "ModelId", "Value": MODEL_ID}], }, "Period": 300, "Stat": "p50", }, "ReturnData": False, }, { "Id": "otps", "Expression": "m2 / (m1 - m3) * 1000", "Label": "OTPS", "ReturnData": True, }, ], ComparisonOperator="LessThanThreshold", Threshold=OTPS_THRESHOLD, EvaluationPeriods=5, DatapointsToAlarm=3, TreatMissingData="ignore", AlarmActions=[], # add SNS ARN, for example "arn:aws:sns:us-east-1:123456789012:my-topic" ) print(f"Alarm '{ALARM_NAME}' created.")