Skip to content

Domain 4: Operational Efficiency and Optimization (12%) โ€‹

โ† Domain 3 ยท Next: Domain 5 โ†’

Exam Tip

This is the smallest domain (12%) but the questions are direct. Know the Provisioned Throughput vs. On-Demand trade-off cold: PTU = predictable consistent traffic, On-Demand = sporadic. Don't choose PTU for development or unpredictable workloads. Know that streaming doesn't save money โ€” it only improves perceived latency.


4.1 Provisioned Throughput vs. On-Demand โ€‹

Comparison Table โ€‹

Provisioned Throughput (PTU)On-Demand
Pricing modelFixed (per Model Unit per hour)Pay per input/output token
Commitment1 month or 6 monthsNo commitment
Traffic patternPredictable, steady-stateSporadic, variable
PerformanceGuaranteed Model Units โ€” no throttlingMay throttle during peak demand
Best forProduction 24/7 applicationsDevelopment, testing, burst workloads

Model Units (MUs) โ€‹

  • Provisioned Throughput is measured in Model Units (MUs)
  • Each MU provides a specific tokens-per-minute (TPM) capacity
  • You purchase a fixed number of MUs for a 1-month or 6-month term
  • Unused MUs are still billed โ€” commitment cost applies regardless of usage

When to Use Each โ€‹

Traffic is consistent and predictable (24/7)?
โ””โ”€ Provisioned Throughput (guaranteed throughput, lower per-token cost at scale)

Traffic is unpredictable, bursty, or for dev/test?
โ””โ”€ On-Demand (no commitment risk, pay only for what you use)

High-volume, non-real-time batch jobs?
โ””โ”€ Batch Inference (see 4.3)

Exam Trap

Do not choose PTU for sporadic or development workloads. Even though PTU is cheaper per token at high volume, you pay for the full commitment period regardless of usage. On-Demand is correct for unpredictable or low-volume scenarios.

Also: PTU commits you for 1 month or 6 months โ€” the exam tests this commitment period detail.


4.2 Token Efficiency & Cost Optimization โ€‹

Token Cost Drivers โ€‹

  1. Input tokens: The length of your prompt (system prompt + user message + retrieved context)
  2. Output tokens: The length of the model's response
  3. Model choice: Larger, more capable models cost more per token

Optimization Techniques โ€‹

TechniqueHow It Helps
Concise system promptsShorter system prompts = fewer input tokens on every call
Set maxTokens explicitlyCaps output length โ€” prevents runaway long responses
StreamingDoes NOT reduce cost โ€” improves perceived latency only
Smaller model for simple tasksUse Claude Haiku instead of Sonnet for tasks that don't need full reasoning
Truncate conversation historyOnly include recent relevant turns, not the full history
Reduce top-K in RAGFewer retrieved chunks = shorter input context = lower cost

TIP

The exam distinguishes between techniques that reduce cost vs. improve latency. Streaming improves perceived latency but does not reduce token count or cost. Cost optimization is exclusively about reducing input and output tokens.


4.3 Batch Inference โ€‹

What Is Batch Inference? โ€‹

Batch inference allows you to submit a large dataset of prompts at once and receive all responses asynchronously โ€” at a lower cost than on-demand inference.

Key characteristics:

  • Jobs submitted to a queue and processed asynchronously
  • Input: S3 object (JSONL file with prompts)
  • Output: S3 object (JSONL file with responses)
  • Not real-time โ€” not suitable for interactive or latency-sensitive workloads

When to Use Batch Inference โ€‹

  • Running nightly analysis on thousands of customer reviews
  • Generating product descriptions in bulk
  • Processing a large document corpus for classification or summarization
  • Any high-volume, non-time-sensitive workload

Cost Advantage โ€‹

Batch inference is typically ~50% cheaper per token compared to on-demand pricing (varies by model) โ€” making it the most cost-effective option for non-urgent, high-volume jobs.


4.4 Monitoring Operational Metrics โ€‹

CloudWatch Metrics for Bedrock โ€‹

MetricWhat It Measures
InvocationLatencyEnd-to-end latency of model calls
InputTokenCountNumber of input tokens consumed
OutputTokenCountNumber of output tokens generated
InvocationClientErrors4xx errors (bad requests โ€” client-side issues)
InvocationServerErrors5xx errors (Bedrock service errors)
ThrottledRequestsRequests throttled due to exceeding on-demand TPS limit

Responding to ThrottlingException โ€‹

When you receive a ThrottlingException:

  • Short-term: Implement exponential backoff and retry
  • Long-term: Switch to Provisioned Throughput for guaranteed Model Units and no throttling

Flashcards

1 / 6
โ“

When should you use Provisioned Throughput instead of On-Demand?

(Click to reveal)
๐Ÿ’ก
Use Provisioned Throughput when your application has predictable, consistent, high-volume traffic running 24/7. Avoid it for development, testing, or unpredictable workloads โ€” you pay the full commitment cost regardless of actual usage.

โ† Domain 3 ยท Next: Domain 5 โ†’

Happy Studying! ๐Ÿš€ โ€ข Privacy-friendly analytics โ€” no cookies, no personal data
Privacy Policy โ€ข AI Disclaimer โ€ข Report an issue