AGENTREL07-BP03 Implement distributed tracing to track system dependencies and facilitate recovery
Correlating logs across services by hand is slow, and during an incident slow is worse than expensive. Distributed tracing across all components with agent-specific annotations gives operators the full request path in one view and turns broad restarts into targeted recovery actions.
Desired outcome:
-
You have distributed tracing across every agent component with agent-specific annotations.
-
You propagate trace context through synchronous and asynchronous communication boundaries.
-
You correlate traces, metrics, and logs in a unified view that surfaces root causes quickly.
Common anti-patterns:
-
Tracing only at the application boundary without propagating context through internal service calls.
-
Skipping correlation of traces with logs and metrics, reducing the risk of unified analysis during incidents.
-
Omitting agent-specific annotations that make filtering by agent ID, task type, or model used possible.
Benefits of establishing this best practice:
-
Root causes surface quickly because the request flow is visible end to end.
-
Mean time to recovery drops because trace data drives targeted actions instead of broad restarts.
-
Latency bottlenecks become visible, enabling proactive performance optimization.
Level of risk exposed if this best practice is not established: High
Implementation guidance
Amazon Bedrock AgentCore Observability captures the full execution path of each agent invocation through OpenTelemetry-compatible telemetry. Turn it on across every agent component, not only the outer boundary, so the trace actually spans the request end to end. Without component-level coverage, traces have gaps where invisible work happens, which is exactly where debugging slows down during an incident.
Annotations are what make traces searchable at the scale of real systems. Agent-specific tags, agent ID, task type, model ID, workflow ID, let you filter traces to a specific agent or failure scenario instead of grepping through an undifferentiated stream. Instrument Strands Agents framework-level traces to capture reasoning steps, tool invocations, and their outcomes in a unified trace view, because the agent's internal decisions are where most of the interesting signals live.
Context propagation is the detail that decides whether asynchronous paths are visible. For queue-based communication, propagate trace headers through message attributes so traces continue across queue boundaries. Without propagation, the trace ends at the producer and a new trace starts at the consumer, and the fact that they relate is lost. Create trace-based alerting through Amazon CloudWatch that correlates traces, metrics, and logs in a unified view, so a trace anomaly, a metric spike, and a log error appear together rather than separately.
Implementation steps
-
Enable AgentCore Observability with OpenTelemetry tracing: Turn on Amazon Bedrock AgentCore Observability across every agent component.
-
Add agent-specific annotations: Tag traces with agent ID, task type, and model ID so filtering during incidents is possible.
-
Propagate trace context across boundaries: Include synchronous and asynchronous paths, with message attributes for queue-based communication.
-
Instrument Strands Agents framework-level traces: Capture reasoning steps and tool invocations in the unified trace view.
-
Create CloudWatch dashboards that correlate traces, metrics, and logs: Build one unified view so incident response works on signal.
Resources
Related best practices:
Related documents:
Related videos:
Related workshops:
Related tools:
Related services: