Skip to content

Domain 4: Operational Efficiency and Optimization (12%) โ€‹

โ† Domain 3 ยท Next: Domain 5 โ†’

Exam Tip

This is the smallest domain (12%) but the questions are direct. Know the Provisioned Throughput vs. On-Demand trade-off cold: PTU = predictable consistent traffic, On-Demand = sporadic. Don't choose PTU for development or unpredictable workloads. Know that streaming doesn't save money โ€” it only improves perceived latency.


4.1 Provisioned Throughput vs. On-Demand โ€‹

Comparison Table โ€‹

Provisioned Throughput (PTU)On-Demand
Pricing modelFixed (per Model Unit per hour)Pay per input/output token
Commitment1 month or 6 monthsNo commitment
Traffic patternPredictable, steady-stateSporadic, variable
PerformanceGuaranteed Model Units โ€” no throttlingMay throttle during peak demand
Best forProduction 24/7 applicationsDevelopment, testing, burst workloads

Model Units (MUs) โ€‹

  • Provisioned Throughput is measured in Model Units (MUs)
  • Each MU provides a specific tokens-per-minute (TPM) capacity
  • You purchase a fixed number of MUs for a 1-month or 6-month term
  • Unused MUs are still billed โ€” commitment cost applies regardless of usage

When to Use Each โ€‹

text
Traffic is consistent and predictable (24/7)?
โ””โ”€ Provisioned Throughput

Traffic is unpredictable, bursty, or for dev/test?
โ””โ”€ On-Demand

High-volume, non-real-time batch jobs?
โ””โ”€ Batch Inference

Exam Trap

Do not choose PTU for sporadic or development workloads. Even though PTU is cheaper per token at high volume, you pay for the full commitment period regardless of usage. On-Demand is correct for unpredictable or low-volume scenarios.

Also: PTU commits you for 1 month or 6 months โ€” the exam tests this commitment period detail.


4.2 Token Efficiency & Cost Optimization โ€‹

Token Cost Drivers โ€‹

  1. Input tokens: The length of your prompt (system prompt + user message + retrieved context)
  2. Output tokens: The length of the model's response
  3. Model choice: Larger, more capable models cost more per token

Optimization Techniques โ€‹

TechniqueHow It Helps
Concise system promptsShorter system prompts = fewer input tokens on every call
Set maxTokens explicitlyCaps output length โ€” prevents runaway long responses
StreamingDoes NOT reduce cost โ€” improves perceived latency only
Smaller model for simple tasksUse Claude Haiku instead of Sonnet for tasks that don't need full reasoning
Truncate conversation historyOnly include recent relevant turns, not the full history
Reduce top-K in RAGFewer retrieved chunks = shorter input context = lower cost

TIP

The exam distinguishes between techniques that reduce cost vs. improve latency. Streaming improves perceived latency but does not reduce token count or cost.

Inference Parameter Quality Controls โ€‹

  • Low temperature improves consistency for standardized outputs
  • Higher temperature increases diversity for creative generation
  • Lower topP keeps sampling tighter and more predictable
  • maxTokens limits output size and cost
  • stopSequences stops generation when a specified token or phrase appears, helping enforce cleaner output boundaries

Best-fit examples:

  • Contracts, policy language, compliance wording โ†’ lower temperature
  • Brainstorming, alternate copy, variant generation โ†’ higher temperature

Prompt Caching โ€‹

Prompt caching reduces repeated inference cost and latency when many requests share the same large static prompt prefix.

How to think about it:

  • Cache the repeated prefix, such as a long system prompt, policy block, or embedded manual excerpt
  • Later calls reuse that cached prefix instead of recomputing it from scratch
  • Best fit when the expensive part of the prompt stays mostly unchanged across many requests

TIP

If the question asks for the lowest-cost or lowest-latency way to reuse a long, repeated prompt prefix, think Prompt Caching.

Semantic Caching โ€‹

Semantic caching avoids FM invocations entirely for repeat questions by returning a cached response when a new question is semantically equivalent to one already answered.

How it works:

text
Incoming question
    โ†“ embed (Titan / Cohere)
Query vector cache (ElastiCache Redis)
    โ†“ cosine similarity check
Similarity โ‰ฅ threshold?
    โ”œโ”€ Yes โ†’ return cached response (no FM call, zero token cost)
    โ””โ”€ No  โ†’ invoke FM โ†’ cache response + embedding โ†’ return

Key characteristics:

  • Uses embeddings + cosine similarity (or other distance metrics) against a configurable threshold
  • ElastiCache (Redis) is the AWS service for storing and querying the vector embeddings
  • Handles paraphrased questions that are textually different but semantically identical โ€” e.g., "What is your return policy?" vs "How do returns work?"

Semantic Caching vs. Prompt Caching

These are two different techniques โ€” the exam may use them as distractors:

Semantic CachingPrompt Caching
What is cachedFull FM response, keyed by embedding similarityStatic system prompt portion
WhereExternal โ€” ElastiCache (Redis) + vector searchBedrock native feature
Bypasses FM?Yes โ€” cache hit skips the FM entirelyNo โ€” FM still invoked, but reuses cached prompt context
Best forRepeat or paraphrased user questionsLarge, stable system prompts reused across many calls

4.3 Batch Inference โ€‹

What Is Batch Inference? โ€‹

Batch inference allows you to submit a large dataset of prompts at once and receive all responses asynchronously โ€” at a lower cost than on-demand inference.

Key characteristics:

  • Jobs submitted to a queue and processed asynchronously
  • Input: S3 object (JSONL file with prompts)
  • Output: S3 object (JSONL file with responses)
  • Not real-time โ€” not suitable for interactive or latency-sensitive workloads

When to Use Batch Inference โ€‹

  • Running nightly analysis on thousands of customer reviews
  • Generating product descriptions in bulk
  • Processing a large document corpus for classification or summarization
  • Any high-volume, non-time-sensitive workload

Cost Advantage โ€‹

Batch inference is typically ~50% cheaper per token compared to on-demand pricing (varies by model) โ€” making it the most cost-effective option for non-urgent, high-volume jobs.

Batch Inference Hard Constraints โ€‹

  • Input/output format: S3 JSONL only โ€” no other formats accepted
  • Not real-time: jobs are queued and processed asynchronously; never use for interactive workloads
  • Custom models require Provisioned Throughput: you cannot run batch inference on a custom fine-tuned model without first purchasing PTU

4.4 Monitoring Operational Metrics โ€‹

CloudWatch Metrics for Bedrock โ€‹

MetricWhat It Measures
InvocationLatencyEnd-to-end latency of model calls
InputTokenCountNumber of input tokens consumed
OutputTokenCountNumber of output tokens generated
InvocationClientErrors4xx errors
InvocationServerErrors5xx errors
ThrottledRequestsRequests throttled due to exceeding on-demand TPS limit

How to use these on the exam:

  • CloudWatch dashboards help visualize token consumption and latency trends over time
  • CloudWatch Alarms are the right answer when you need notifications for high token usage, error spikes, or throttling
  • Token metrics can be used for usage analysis, cost forecasting, and identifying unusually expensive prompt patterns
  • Questions may describe this as monitoring usage "by model" or over time; think CloudWatch metrics + dashboards

Responding to ThrottlingException โ€‹

When you receive a ThrottlingException:

  • Short-term: Implement exponential backoff and retry
  • Long-term: Switch to Provisioned Throughput

โ† Domain 3 ยท Next: Domain 5 โ†’

Happy Studying! ๐Ÿš€ โ€ข Privacy-friendly analytics โ€” no cookies, no personal data
Privacy Policy โ€ข AI Disclaimer โ€ข Report an issue