Domain 4: Operational Efficiency and Optimization (12%) โ
โ Domain 3 ยท Next: Domain 5 โ
Exam Tip
This is the smallest domain (12%) but the questions are direct. Know the Provisioned Throughput vs. On-Demand trade-off cold: PTU = predictable consistent traffic, On-Demand = sporadic. Don't choose PTU for development or unpredictable workloads. Know that streaming doesn't save money โ it only improves perceived latency.
4.1 Provisioned Throughput vs. On-Demand โ
Comparison Table โ
| Provisioned Throughput (PTU) | On-Demand | |
|---|---|---|
| Pricing model | Fixed (per Model Unit per hour) | Pay per input/output token |
| Commitment | 1 month or 6 months | No commitment |
| Traffic pattern | Predictable, steady-state | Sporadic, variable |
| Performance | Guaranteed Model Units โ no throttling | May throttle during peak demand |
| Best for | Production 24/7 applications | Development, testing, burst workloads |
Model Units (MUs) โ
- Provisioned Throughput is measured in Model Units (MUs)
- Each MU provides a specific tokens-per-minute (TPM) capacity
- You purchase a fixed number of MUs for a 1-month or 6-month term
- Unused MUs are still billed โ commitment cost applies regardless of usage
When to Use Each โ
Traffic is consistent and predictable (24/7)?
โโ Provisioned Throughput
Traffic is unpredictable, bursty, or for dev/test?
โโ On-Demand
High-volume, non-real-time batch jobs?
โโ Batch InferenceExam Trap
Do not choose PTU for sporadic or development workloads. Even though PTU is cheaper per token at high volume, you pay for the full commitment period regardless of usage. On-Demand is correct for unpredictable or low-volume scenarios.
Also: PTU commits you for 1 month or 6 months โ the exam tests this commitment period detail.
4.2 Token Efficiency & Cost Optimization โ
Token Cost Drivers โ
- Input tokens: The length of your prompt (system prompt + user message + retrieved context)
- Output tokens: The length of the model's response
- Model choice: Larger, more capable models cost more per token
Optimization Techniques โ
| Technique | How It Helps |
|---|---|
| Concise system prompts | Shorter system prompts = fewer input tokens on every call |
Set maxTokens explicitly | Caps output length โ prevents runaway long responses |
| Streaming | Does NOT reduce cost โ improves perceived latency only |
| Smaller model for simple tasks | Use Claude Haiku instead of Sonnet for tasks that don't need full reasoning |
| Truncate conversation history | Only include recent relevant turns, not the full history |
| Reduce top-K in RAG | Fewer retrieved chunks = shorter input context = lower cost |
TIP
The exam distinguishes between techniques that reduce cost vs. improve latency. Streaming improves perceived latency but does not reduce token count or cost.
Inference Parameter Quality Controls โ
- Low temperature improves consistency for standardized outputs
- Higher temperature increases diversity for creative generation
- Lower topP keeps sampling tighter and more predictable
maxTokenslimits output size and coststopSequencesstops generation when a specified token or phrase appears, helping enforce cleaner output boundaries
Best-fit examples:
- Contracts, policy language, compliance wording โ lower temperature
- Brainstorming, alternate copy, variant generation โ higher temperature
Prompt Caching โ
Prompt caching reduces repeated inference cost and latency when many requests share the same large static prompt prefix.
How to think about it:
- Cache the repeated prefix, such as a long system prompt, policy block, or embedded manual excerpt
- Later calls reuse that cached prefix instead of recomputing it from scratch
- Best fit when the expensive part of the prompt stays mostly unchanged across many requests
TIP
If the question asks for the lowest-cost or lowest-latency way to reuse a long, repeated prompt prefix, think Prompt Caching.
Semantic Caching โ
Semantic caching avoids FM invocations entirely for repeat questions by returning a cached response when a new question is semantically equivalent to one already answered.
How it works:
Incoming question
โ embed (Titan / Cohere)
Query vector cache (ElastiCache Redis)
โ cosine similarity check
Similarity โฅ threshold?
โโ Yes โ return cached response (no FM call, zero token cost)
โโ No โ invoke FM โ cache response + embedding โ returnKey characteristics:
- Uses embeddings + cosine similarity (or other distance metrics) against a configurable threshold
- ElastiCache (Redis) is the AWS service for storing and querying the vector embeddings
- Handles paraphrased questions that are textually different but semantically identical โ e.g., "What is your return policy?" vs "How do returns work?"
Semantic Caching vs. Prompt Caching
These are two different techniques โ the exam may use them as distractors:
| Semantic Caching | Prompt Caching | |
|---|---|---|
| What is cached | Full FM response, keyed by embedding similarity | Static system prompt portion |
| Where | External โ ElastiCache (Redis) + vector search | Bedrock native feature |
| Bypasses FM? | Yes โ cache hit skips the FM entirely | No โ FM still invoked, but reuses cached prompt context |
| Best for | Repeat or paraphrased user questions | Large, stable system prompts reused across many calls |
4.3 Batch Inference โ
What Is Batch Inference? โ
Batch inference allows you to submit a large dataset of prompts at once and receive all responses asynchronously โ at a lower cost than on-demand inference.
Key characteristics:
- Jobs submitted to a queue and processed asynchronously
- Input: S3 object (JSONL file with prompts)
- Output: S3 object (JSONL file with responses)
- Not real-time โ not suitable for interactive or latency-sensitive workloads
When to Use Batch Inference โ
- Running nightly analysis on thousands of customer reviews
- Generating product descriptions in bulk
- Processing a large document corpus for classification or summarization
- Any high-volume, non-time-sensitive workload
Cost Advantage โ
Batch inference is typically ~50% cheaper per token compared to on-demand pricing (varies by model) โ making it the most cost-effective option for non-urgent, high-volume jobs.
Batch Inference Hard Constraints โ
- Input/output format: S3 JSONL only โ no other formats accepted
- Not real-time: jobs are queued and processed asynchronously; never use for interactive workloads
- Custom models require Provisioned Throughput: you cannot run batch inference on a custom fine-tuned model without first purchasing PTU
4.4 Monitoring Operational Metrics โ
CloudWatch Metrics for Bedrock โ
| Metric | What It Measures |
|---|---|
| InvocationLatency | End-to-end latency of model calls |
| InputTokenCount | Number of input tokens consumed |
| OutputTokenCount | Number of output tokens generated |
| InvocationClientErrors | 4xx errors |
| InvocationServerErrors | 5xx errors |
| ThrottledRequests | Requests throttled due to exceeding on-demand TPS limit |
How to use these on the exam:
- CloudWatch dashboards help visualize token consumption and latency trends over time
- CloudWatch Alarms are the right answer when you need notifications for high token usage, error spikes, or throttling
- Token metrics can be used for usage analysis, cost forecasting, and identifying unusually expensive prompt patterns
- Questions may describe this as monitoring usage "by model" or over time; think CloudWatch metrics + dashboards
Responding to ThrottlingException โ
When you receive a ThrottlingException:
- Short-term: Implement exponential backoff and retry
- Long-term: Switch to Provisioned Throughput