AGENTCOST02-BP02 Cost optimize token consumption through efficient prompt engineering
Every token in a system prompt is paid for on every invocation, so prompt bloat compounds linearly with traffic. Compressing system prompts, tool descriptions, and output formats makes the fixed cost of each agent call proportional to the decision it has to make, not the verbosity of its instructions.
Desired outcome:
-
You have agent system prompts compressed to the minimum tokens needed for accurate task completion.
-
You have tool descriptions presented dynamically based on the current task rather than transmitted in full on every call.
-
You constrain output length explicitly so verbose responses don't compound across multi-turn reasoning.
-
You version prompts and track cost-per-task per version so efficiency changes are measurable over time.
Common anti-patterns:
-
Writing verbose system prompts with lengthy persona descriptions and redundant explanations, inflating the fixed token cost on every invocation.
-
Including every tool description in every invocation regardless of task relevance, which inflates input tokens when only a subset of tools applies.
-
Allowing unconstrained output length without formatting directives, enabling verbose responses that compound across multi-turn reasoning cycles.
-
Treating prompts as uncontrolled strings rather than versioned artifacts, so regressions in token efficiency go unnoticed until a billing review.
Benefits of establishing this best practice:
-
Fixed-cost reductions from prompt compression compound across every agent invocation in high-volume deployments.
-
Dynamic tool loading transmits only task-relevant tools, reducing tool description overhead in proportion to catalog size.
-
Prompt versioning with token tracking makes compression an ongoing, measurable practice rather than a one-off cleanup.
Level of risk exposed if this best practice is not established: Medium
Implementation guidance
The system prompt is the largest fixed cost in every model invocation, which means every unnecessary sentence is paid for again each time the agent is called. Start by auditing prompts with Amazon Bedrock AgentCore Observability to measure token footprints, then compress systematically. Replace verbose instructions with structured directives, tighten role definitions, and remove redundant explanations.
Tool descriptions behave the same way. If the prompt carries the full tool catalog even when only three tools are relevant, you are paying for the rest of the catalog on every call. Amazon Bedrock AgentCore Gateway uses MCP-based Semantic Tool Selection to present only the tools relevant to the current intent, and Amazon Bedrock AgentCore Memory manages conversation state across multi-turn sessions so you don't pay for manual history concatenation.
In your context window, allocate tokens across the system prompt (20 to 30%), user context (30 to 40%), few-shot examples (10 to 20%), and agent scratchpad (20 to 30%), and adjust based on your agent's reasoning patterns. For few-shot examples in particular, test whether the model performs well zero-shot before paying for examples on every call. When examples are needed, identify the minimum count that maintains task accuracy against the quality baseline. Two or three examples are typically enough, and dynamic example selection from a semantic index keeps only the relevant ones in context.
Output length is the last thing to adjust. Explicit formatting directives in the system prompt (response structure, maximum length) directly control output token costs. Treat prompts as versioned artifacts. Record token count, task success rate, and cost-per-task for each version using AgentCore Observability telemetry, and establish a monthly review cadence that tracks cumulative savings in AWS Cost Explorer.
Implementation steps
-
Audit and compress system prompts: Use Amazon Bedrock AgentCore Observability to measure token footprints for every prompt, then apply compression to minimize prompt size while maintaining decision accuracy.
-
Present tools dynamically through Gateway: Use Amazon Bedrock AgentCore Gateway Semantic Tool Selection to present only relevant tools per invocation, and compress tool descriptions to tool name, one-sentence description, and concise parameter schema.
-
Constrain output length: Add explicit output formatting directives to the system prompt and configure model-specific token limit parameters to enforce hard caps on response size.
-
Use managed memory for multi-turn context: Configure Amazon Bedrock AgentCore Memory so conversation state is maintained automatically instead of re-transmitted as full history.
-
Reduce few-shot example overhead: Evaluate whether each agent task needs examples at all, then reduce to the minimum effective count of two to three. Load examples dynamically from an Amazon S3-hosted library using semantic similarity retrieval.
-
Version prompts and track efficiency: Record token count and task success rate per prompt version, and establish a monthly optimization review cadence that tracks cumulative savings against cost-per-task targets.
Resources
Related best practices:
Related documents:
Related videos:
Related examples:
Related tools:
Related services: