AGENTPERF07-BP02 Implement tenant-aware performance isolation and throttling
Trust in a shared agent service is built through consistent, predictable performance for every tenant, even during demand spikes. In pooled multitenant deployments, effective isolation requires throttling at multiple layers (API, inference, memory, and tools), monitoring per-tenant resource consumption, and adaptive fairness mechanisms that distribute shared resources equitably based on current load.
Desired outcome:
-
You have per-tenant throttling enforced at every shared resource layer.
-
You have tenant resource consumption monitored in real time with alerts for tenants approaching their limits.
-
You have graceful throttling that provides clear feedback to throttled tenants.
-
You have performance isolation validated through regular load testing that simulates noisy neighbor scenarios.
Common anti-patterns:
-
Applying throttling only at the API gateway layer without enforcing limits at downstream shared resources, letting tenants bypass API-level limits through long-running operations.
-
Using static throttling limits that don't adapt to current system load, wasting available capacity during low-load periods or failing to protect isolation during high-load periods.
-
Throttling all tenants equally regardless of their service tier, failing to honor premium SLAs.
Benefits of establishing this best practice:
-
Multi-layer throttling distributes shared resources fairly across tenants.
-
Real-time per-tenant consumption metrics support proactive management.
-
Per-tenant performance monitoring validates SLA compliance.
Level of risk exposed if this best practice is not established: High
Implementation guidance
Multi-layer throttling enforces tenant limits at every shared
resource. At the API layer,
Amazon API Gateway
Adaptive throttling adjusts limits based on current system load:
during low-load periods, tenants can burst above their baseline
limits to use available capacity, and during high-load periods,
strict limits protect isolation. Per-tenant
Amazon CloudWatch
Implementation steps
-
Define per-tenant throttling limits for each resource layer: Set per-tenant limits for API requests per second, concurrent inference calls, memory storage quota, and tool invocations per minute.
-
Implement API Gateway usage plans with per-tenant API keys and rate/burst limits: Use Amazon API Gateway
usage plans with per-tenant API keys to enforce rate and burst limits at ingress. -
Deploy tenant-aware inference queuing with per-tenant concurrency limits: Queue Amazon Bedrock
inference calls per tenant so no single tenant can consume all inference capacity. -
Configure adaptive throttling that adjusts limits based on current system load: Allow bursts during low-load periods and enforce strict limits during high-load periods to protect isolation.
-
Create per-tenant CloudWatch dashboards and configure SLA-based alarms: Publish per-tenant metrics in Amazon CloudWatch
and alarm on consumption approaching limits or latency exceeding SLA thresholds. -
Establish regular noisy neighbor load testing to validate isolation effectiveness: Schedule noisy neighbor load tests that simulate high-load scenarios for individual tenants and verify others stay within SLA.
Resources
Related best practices:
Related documents:
Related videos:
Related services: