View a markdown version of this page

AGENTPERF07-BP02 Implement tenant-aware performance isolation and throttling - Agentic AI Lens

AGENTPERF07-BP02 Implement tenant-aware performance isolation and throttling

Trust in a shared agent service is built through consistent, predictable performance for every tenant, even during demand spikes. In pooled multitenant deployments, effective isolation requires throttling at multiple layers (API, inference, memory, and tools), monitoring per-tenant resource consumption, and adaptive fairness mechanisms that distribute shared resources equitably based on current load.

Desired outcome:

  • You have per-tenant throttling enforced at every shared resource layer.

  • You have tenant resource consumption monitored in real time with alerts for tenants approaching their limits.

  • You have graceful throttling that provides clear feedback to throttled tenants.

  • You have performance isolation validated through regular load testing that simulates noisy neighbor scenarios.

Common anti-patterns:

  • Applying throttling only at the API gateway layer without enforcing limits at downstream shared resources, letting tenants bypass API-level limits through long-running operations.

  • Using static throttling limits that don't adapt to current system load, wasting available capacity during low-load periods or failing to protect isolation during high-load periods.

  • Throttling all tenants equally regardless of their service tier, failing to honor premium SLAs.

Benefits of establishing this best practice:

  • Multi-layer throttling distributes shared resources fairly across tenants.

  • Real-time per-tenant consumption metrics support proactive management.

  • Per-tenant performance monitoring validates SLA compliance.

Level of risk exposed if this best practice is not established: High

Implementation guidance

Multi-layer throttling enforces tenant limits at every shared resource. At the API layer, Amazon API Gateway usage plans with per-tenant API keys enforce request rate and burst limits. At the inference layer, tenant-aware request queuing caps concurrent Amazon Bedrock inference calls per tenant. At the memory and tool layers, per-tenant rate limiting applies to shared endpoints. For agents deployed on Amazon Bedrock AgentCore Runtime, the runtime's session isolation provides natural per-session resource boundaries.

Adaptive throttling adjusts limits based on current system load: during low-load periods, tenants can burst above their baseline limits to use available capacity, and during high-load periods, strict limits protect isolation. Per-tenant Amazon CloudWatch dashboards and metrics track request volume, inference consumption, latency percentiles, throttle rates, and error rates. Alarms fire when a tenant approaches their limits or when per-tenant latency exceeds SLA thresholds. Regular noisy neighbor testing, simulating high-load scenarios for individual tenants, validates that other tenants' performance stays within SLA bounds.

Implementation steps

  1. Define per-tenant throttling limits for each resource layer: Set per-tenant limits for API requests per second, concurrent inference calls, memory storage quota, and tool invocations per minute.

  2. Implement API Gateway usage plans with per-tenant API keys and rate/burst limits: Use Amazon API Gateway usage plans with per-tenant API keys to enforce rate and burst limits at ingress.

  3. Deploy tenant-aware inference queuing with per-tenant concurrency limits: Queue Amazon Bedrock inference calls per tenant so no single tenant can consume all inference capacity.

  4. Configure adaptive throttling that adjusts limits based on current system load: Allow bursts during low-load periods and enforce strict limits during high-load periods to protect isolation.

  5. Create per-tenant CloudWatch dashboards and configure SLA-based alarms: Publish per-tenant metrics in Amazon CloudWatch and alarm on consumption approaching limits or latency exceeding SLA thresholds.

  6. Establish regular noisy neighbor load testing to validate isolation effectiveness: Schedule noisy neighbor load tests that simulate high-load scenarios for individual tenants and verify others stay within SLA.

Resources

Related best practices:

Related documents:

Related videos:

Related services: