Skip to content

Domain 5: Testing, Validation, and Troubleshooting (11%) โ€‹

โ† Domain 4 ยท โ† Back to Overview

Exam Tip

This domain is 11% of the exam. Know the four automated evaluation metrics cold โ€” especially Groundedness (detects hallucinations in RAG). Know when human evaluation is required vs. automated, and always remember: CloudWatch = operational metrics, CloudTrail = audit trail.


5.1 Model Evaluation โ€‹

What Is Model Evaluation in Bedrock? โ€‹

Amazon Bedrock includes a built-in Model Evaluation feature that lets you evaluate and compare FM performance using standardized metrics โ€” both automated and human-based.

Automated Evaluation Metrics โ€‹

MetricWhat It MeasuresRAG-Specific?
GroundednessIs the response factually supported by the retrieved context?Yes โ€” detects hallucinations
RelevanceDoes the response actually answer the user's question?No
AccuracyIs the factual content of the response correct?No
FluencyIs the response natural, readable, and well-written?No
RobustnessDoes the model perform consistently across varied prompt phrasings?No

Key Distinction

Groundedness is the most RAG-specific metric โ€” it checks whether the model's answer is supported by the retrieved documents, not invented. A low Groundedness score = the model is hallucinating (generating information not in the retrieved context).

Automated vs. Human Evaluation โ€‹

Automated EvaluationHuman Evaluation
HowAlgorithm-scored using metrics or automated judgesReal humans rate responses
SpeedFast, scalableSlow, expensive
Best forRegression testing, large dataset comparisonsSubjective quality, tone, nuance

Use human evaluation when:

  • Evaluating subjective qualities (tone, brand voice, empathy)
  • Validating safety content decisions that require human judgment
  • Ground truth labels are unavailable or difficult to define algorithmically

5.2 Setting Up a Model Evaluation Job โ€‹

Steps โ€‹

  1. Select models: Choose one or more FM models or fine-tuned variants to compare
  2. Provide a dataset: Upload a prompt dataset to S3 (JSONL format)
  3. Select metrics: Choose which automated metrics to compute
  4. Run the job: Bedrock runs each prompt against the selected models and scores each response
  5. Review results: Compare metric scores across models in the Bedrock console

Prompt Dataset Format โ€‹

  • Stored in Amazon S3 as a JSONL file
  • Each line contains a prompt (and optionally a reference/expected answer for accuracy scoring)
  • Example line: {"prompt": "What is the capital of France?", "referenceResponse": "Paris"}

5.3 CloudWatch Monitoring โ€‹

Key Bedrock CloudWatch Metrics โ€‹

MetricDescription
InvocationLatencyP50/P90/P99 latency for model calls (end-to-end)
InputTokenCountTotal input tokens consumed in the period
OutputTokenCountTotal output tokens generated in the period
ThrottledRequestsCount of requests throttled (on-demand TPS exceeded)
InvocationClientErrors4xx errors (bad request format, invalid model ID)
InvocationServerErrors5xx errors (Bedrock service-side failures)

Setting Up CloudWatch Alarms โ€‹

Use CloudWatch Alarms to proactively catch issues:

  • ThrottledRequests > 0 โ†’ Consider switching to Provisioned Throughput
  • InvocationLatency > [threshold] โ†’ Investigate prompt length or model choice
  • InvocationServerErrors > 0 โ†’ Investigate Bedrock service health or retry configuration

5.4 Troubleshooting Common Issues โ€‹

Problem โ†’ Cause โ†’ Solution โ€‹

ProblemLikely CauseSolution
ThrottlingExceptionExceeded on-demand TPS limitImplement exponential backoff; use Provisioned Throughput
High latencyLong prompts or large retrieved contextReduce input tokens; use streaming for UX improvement
Poor RAG answer qualityWrong chunks returned (low retrieval relevance)Tune chunking strategy, adjust top-K, improve embedding model
Guardrail blocking valid contentContent filter sensitivity too highLower filter strength; review denied topics configuration
Hallucinated responseFM ignoring retrieved contextStrengthen system prompt to emphasize context use; reduce temperature; check Groundedness metric
Agent not calling the right actionAction Group schema unclear or description impreciseImprove OpenAPI schema descriptions; ensure action names are unambiguous
High costVerbose system prompts or large top-K retrievalShorten system prompt; reduce top-K; set maxTokens; use a smaller model

5.5 Debugging Bedrock Agents โ€‹

Using the Orchestration Trace โ€‹

Enable enableTrace: true in InvokeAgent to inspect:

  • The Agent's step-by-step reasoning
  • Which Action Group it decided to call and why
  • What Knowledge Base query it ran
  • The final synthesis of results

Common Agent Issues โ€‹

  • Agent loops: Agent calls tools repeatedly without progressing โ†’ improve action descriptions and add clearer stopping conditions
  • Wrong action called: Agent misidentifies the right tool โ†’ improve the Action Group's name and description in the OpenAPI schema
  • Knowledge Base returns irrelevant chunks: Poor chunking or embedding quality โ†’ re-index with a better chunking strategy or embedding model

Flashcards

1 / 5
โ“

Which model evaluation metric detects hallucinations in a RAG application?

(Click to reveal)
๐Ÿ’ก
Groundedness โ€” it measures whether the model's response is supported by the retrieved context. A low Groundedness score means the model is generating information not found in the knowledge base (hallucinating).

โ† Domain 4 ยท โ† Back to Overview

Happy Studying! ๐Ÿš€ โ€ข Privacy-friendly analytics โ€” no cookies, no personal data
Privacy Policy โ€ข AI Disclaimer โ€ข Report an issue