Domain 4: Operational Efficiency and Optimization (12%) โ
โ Domain 3 ยท Next: Domain 5 โ
Exam Tip
This is the smallest domain (12%) but the questions are direct. Know the Provisioned Throughput vs. On-Demand trade-off cold: PTU = predictable consistent traffic, On-Demand = sporadic. Don't choose PTU for development or unpredictable workloads. Know that streaming doesn't save money โ it only improves perceived latency.
4.1 Provisioned Throughput vs. On-Demand โ
Comparison Table โ
| Provisioned Throughput (PTU) | On-Demand | |
|---|---|---|
| Pricing model | Fixed (per Model Unit per hour) | Pay per input/output token |
| Commitment | 1 month or 6 months | No commitment |
| Traffic pattern | Predictable, steady-state | Sporadic, variable |
| Performance | Guaranteed Model Units โ no throttling | May throttle during peak demand |
| Best for | Production 24/7 applications | Development, testing, burst workloads |
Model Units (MUs) โ
- Provisioned Throughput is measured in Model Units (MUs)
- Each MU provides a specific tokens-per-minute (TPM) capacity
- You purchase a fixed number of MUs for a 1-month or 6-month term
- Unused MUs are still billed โ commitment cost applies regardless of usage
When to Use Each โ
Traffic is consistent and predictable (24/7)?
โโ Provisioned Throughput (guaranteed throughput, lower per-token cost at scale)
Traffic is unpredictable, bursty, or for dev/test?
โโ On-Demand (no commitment risk, pay only for what you use)
High-volume, non-real-time batch jobs?
โโ Batch Inference (see 4.3)Exam Trap
Do not choose PTU for sporadic or development workloads. Even though PTU is cheaper per token at high volume, you pay for the full commitment period regardless of usage. On-Demand is correct for unpredictable or low-volume scenarios.
Also: PTU commits you for 1 month or 6 months โ the exam tests this commitment period detail.
4.2 Token Efficiency & Cost Optimization โ
Token Cost Drivers โ
- Input tokens: The length of your prompt (system prompt + user message + retrieved context)
- Output tokens: The length of the model's response
- Model choice: Larger, more capable models cost more per token
Optimization Techniques โ
| Technique | How It Helps |
|---|---|
| Concise system prompts | Shorter system prompts = fewer input tokens on every call |
Set maxTokens explicitly | Caps output length โ prevents runaway long responses |
| Streaming | Does NOT reduce cost โ improves perceived latency only |
| Smaller model for simple tasks | Use Claude Haiku instead of Sonnet for tasks that don't need full reasoning |
| Truncate conversation history | Only include recent relevant turns, not the full history |
| Reduce top-K in RAG | Fewer retrieved chunks = shorter input context = lower cost |
TIP
The exam distinguishes between techniques that reduce cost vs. improve latency. Streaming improves perceived latency but does not reduce token count or cost. Cost optimization is exclusively about reducing input and output tokens.
4.3 Batch Inference โ
What Is Batch Inference? โ
Batch inference allows you to submit a large dataset of prompts at once and receive all responses asynchronously โ at a lower cost than on-demand inference.
Key characteristics:
- Jobs submitted to a queue and processed asynchronously
- Input: S3 object (JSONL file with prompts)
- Output: S3 object (JSONL file with responses)
- Not real-time โ not suitable for interactive or latency-sensitive workloads
When to Use Batch Inference โ
- Running nightly analysis on thousands of customer reviews
- Generating product descriptions in bulk
- Processing a large document corpus for classification or summarization
- Any high-volume, non-time-sensitive workload
Cost Advantage โ
Batch inference is typically ~50% cheaper per token compared to on-demand pricing (varies by model) โ making it the most cost-effective option for non-urgent, high-volume jobs.
4.4 Monitoring Operational Metrics โ
CloudWatch Metrics for Bedrock โ
| Metric | What It Measures |
|---|---|
| InvocationLatency | End-to-end latency of model calls |
| InputTokenCount | Number of input tokens consumed |
| OutputTokenCount | Number of output tokens generated |
| InvocationClientErrors | 4xx errors (bad requests โ client-side issues) |
| InvocationServerErrors | 5xx errors (Bedrock service errors) |
| ThrottledRequests | Requests throttled due to exceeding on-demand TPS limit |
Responding to ThrottlingException โ
When you receive a ThrottlingException:
- Short-term: Implement exponential backoff and retry
- Long-term: Switch to Provisioned Throughput for guaranteed Model Units and no throttling
Flashcards
When should you use Provisioned Throughput instead of On-Demand?
(Click to reveal)