Domain 5: Testing, Validation, and Troubleshooting (11%) โ
โ Domain 4 ยท โ Back to Overview
Exam Tip
This domain is 11% of the exam. Know the four automated evaluation metrics cold โ especially Groundedness (detects hallucinations in RAG). Know when human evaluation is required vs. automated, and always remember: CloudWatch = operational metrics, CloudTrail = audit trail.
5.1 Model Evaluation โ
What Is Model Evaluation in Bedrock? โ
Amazon Bedrock includes a built-in Model Evaluation feature that lets you evaluate and compare FM performance using standardized metrics โ both automated and human-based.
Automated Evaluation Metrics โ
| Metric | What It Measures | RAG-Specific? |
|---|---|---|
| Groundedness | Is the response factually supported by the retrieved context? | Yes โ detects hallucinations |
| Relevance | Does the response actually answer the user's question? | No |
| Accuracy | Is the factual content of the response correct? | No |
| Fluency | Is the response natural, readable, and well-written? | No |
| Robustness | Does the model perform consistently across varied prompt phrasings? | No |
Key Distinction
Groundedness is the most RAG-specific metric โ it checks whether the model's answer is supported by the retrieved documents, not invented. A low Groundedness score = the model is hallucinating (generating information not in the retrieved context).
Automated vs. Human Evaluation โ
| Automated Evaluation | Human Evaluation | |
|---|---|---|
| How | Algorithm-scored using metrics or automated judges | Real humans rate responses |
| Speed | Fast, scalable | Slow, expensive |
| Best for | Regression testing, large dataset comparisons | Subjective quality, tone, nuance |
Use human evaluation when:
- Evaluating subjective qualities (tone, brand voice, empathy)
- Validating safety content decisions that require human judgment
- Ground truth labels are unavailable or difficult to define algorithmically
5.2 Setting Up a Model Evaluation Job โ
Steps โ
- Select models: Choose one or more FM models or fine-tuned variants to compare
- Provide a dataset: Upload a prompt dataset to S3 (JSONL format)
- Select metrics: Choose which automated metrics to compute
- Run the job: Bedrock runs each prompt against the selected models and scores each response
- Review results: Compare metric scores across models in the Bedrock console
Prompt Dataset Format โ
- Stored in Amazon S3 as a JSONL file
- Each line contains a prompt (and optionally a reference/expected answer for accuracy scoring)
- Example line:
{"prompt": "What is the capital of France?", "referenceResponse": "Paris"}
5.3 CloudWatch Monitoring โ
Key Bedrock CloudWatch Metrics โ
| Metric | Description |
|---|---|
InvocationLatency | P50/P90/P99 latency for model calls (end-to-end) |
InputTokenCount | Total input tokens consumed in the period |
OutputTokenCount | Total output tokens generated in the period |
ThrottledRequests | Count of requests throttled (on-demand TPS exceeded) |
InvocationClientErrors | 4xx errors (bad request format, invalid model ID) |
InvocationServerErrors | 5xx errors (Bedrock service-side failures) |
Setting Up CloudWatch Alarms โ
Use CloudWatch Alarms to proactively catch issues:
ThrottledRequests > 0โ Consider switching to Provisioned ThroughputInvocationLatency > [threshold]โ Investigate prompt length or model choiceInvocationServerErrors > 0โ Investigate Bedrock service health or retry configuration
5.4 Troubleshooting Common Issues โ
Problem โ Cause โ Solution โ
| Problem | Likely Cause | Solution |
|---|---|---|
ThrottlingException | Exceeded on-demand TPS limit | Implement exponential backoff; use Provisioned Throughput |
| High latency | Long prompts or large retrieved context | Reduce input tokens; use streaming for UX improvement |
| Poor RAG answer quality | Wrong chunks returned (low retrieval relevance) | Tune chunking strategy, adjust top-K, improve embedding model |
| Guardrail blocking valid content | Content filter sensitivity too high | Lower filter strength; review denied topics configuration |
| Hallucinated response | FM ignoring retrieved context | Strengthen system prompt to emphasize context use; reduce temperature; check Groundedness metric |
| Agent not calling the right action | Action Group schema unclear or description imprecise | Improve OpenAPI schema descriptions; ensure action names are unambiguous |
| High cost | Verbose system prompts or large top-K retrieval | Shorten system prompt; reduce top-K; set maxTokens; use a smaller model |
5.5 Debugging Bedrock Agents โ
Using the Orchestration Trace โ
Enable enableTrace: true in InvokeAgent to inspect:
- The Agent's step-by-step reasoning
- Which Action Group it decided to call and why
- What Knowledge Base query it ran
- The final synthesis of results
Common Agent Issues โ
- Agent loops: Agent calls tools repeatedly without progressing โ improve action descriptions and add clearer stopping conditions
- Wrong action called: Agent misidentifies the right tool โ improve the Action Group's name and description in the OpenAPI schema
- Knowledge Base returns irrelevant chunks: Poor chunking or embedding quality โ re-index with a better chunking strategy or embedding model
Flashcards
Which model evaluation metric detects hallucinations in a RAG application?
(Click to reveal)