Domain 5: Testing, Validation, and Troubleshooting (11%) โ
โ Domain 4 ยท โ Back to Overview
Exam Tip
This domain is 11% of the exam. Know the four automated evaluation metrics cold โ especially Groundedness (detects hallucinations in RAG). Know when human evaluation is required vs. automated, and always remember: CloudWatch = operational metrics, CloudTrail = audit trail.
5.1 Model Evaluation โ
What Is Model Evaluation in Bedrock? โ
Amazon Bedrock includes a built-in Model Evaluation feature that lets you evaluate and compare FM performance using standardized metrics โ both automated and human-based.
Automated Evaluation Metrics โ
| Metric | What It Measures | RAG-Specific? |
|---|---|---|
| Groundedness | Is the response factually supported by the retrieved context? | Yes โ detects hallucinations |
| Relevance | Does the response actually answer the user's question? | No |
| Accuracy | Is the factual content of the response correct? | No |
| Fluency | Is the response natural, readable, and well-written? | No |
| Robustness | Does the model perform consistently across varied prompt phrasings? | No |
Key Distinction
Groundedness is the most RAG-specific metric โ it checks whether the model's answer is supported by the retrieved documents, not invented.
Hallucination Detection Techniques โ
Hallucination detection compares generated content against retrieved source documents to verify claims are grounded in source material:
| Technique | How It Works |
|---|---|
| Semantic similarity scoring | Embed both the generated claim and the source chunks โ high cosine similarity = grounded |
| Fact verification | Use the FM itself to check: "Does this source document support this specific statement?" |
| Confidence scoring | Score the model's certainty for each generated statement; flag low-confidence claims for review |
| Guardrails Contextual Grounding Check | Bedrock-native โ automatically blocks responses that are not supported by retrieved context |
Knowledge Base Citations โ
When using Amazon Bedrock Knowledge Bases, the retrieve-and-generate response includes citation metadata โ references to the specific source documents and chunks that supported each part of the answer.
- Citations show which S3 object and which chunk backed each claim
- Enables downstream validation: compare the response against cited sources to verify grounding
- If a statement has no citation, it may be hallucinated
TIP
Knowledge Base citations are the native Bedrock mechanism for source attribution. Surfacing citations to end users also builds trust โ they can click through to the original document.
Automated vs. Human Evaluation โ
| Automated Evaluation | Human Evaluation | |
|---|---|---|
| How | Algorithm-scored using metrics or automated judges | Real humans rate responses |
| Speed | Fast, scalable | Slow, expensive |
| Best for | Regression testing, large dataset comparisons | Subjective quality, tone, nuance |
Use human evaluation when:
- Evaluating subjective qualities (tone, brand voice, empathy)
- Validating safety content decisions that require human judgment
- Ground truth labels are unavailable or difficult to define algorithmically
LLM-as-a-Judge โ
An LLM-as-a-Judge pattern uses one model to evaluate the quality of another model's response against a rubric.
Typical use:
- Score answers for relevance, helpfulness, or rubric compliance
- Compare multiple candidate responses at scale
- Support regression testing when manual review is too slow
WARNING
LLM-as-a-Judge is useful for scalable evaluation, but it is still an automated evaluation technique, not a substitute for human review in subjective or high-stakes cases.
5.2 Setting Up a Model Evaluation Job โ
Steps โ
- Select models: Choose one or more FM models or fine-tuned variants to compare
- Provide a dataset: Upload a prompt dataset to S3 (JSONL format)
- Select metrics: Choose which automated metrics to compute
- Run the job: Bedrock runs each prompt against the selected models and scores each response
- Review results: Compare metric scores across models in the Bedrock console
Prompt Dataset Format โ
- Stored in Amazon S3 as a JSONL file
- Each line contains a prompt (and optionally a reference/expected answer for accuracy scoring)
- Example line:
{"prompt": "What is the capital of France?", "referenceResponse": "Paris"}
Evaluation Workflow Patterns โ
When a company wants to replace a production model with a new one, the evaluation process should be treated as a gated workflow, not an ad hoc test run.
Typical sequence:
- Define evaluation metrics such as relevance, accuracy, fluency, and groundedness
- Create a test dataset with realistic scenarios, difficult prompts, and edge cases
- Run controlled comparisons between candidate models using the same dataset
- Apply quality gates so weak results do not progress to the next stage
- Analyze results and produce a decision report before promotion to production
Important exam ideas:
- Do not compare models before defining the metrics and dataset
- A high-quality evaluation dataset should include diverse scenarios and edge cases
- A/B testing is a valid comparison pattern when you need to compare a new model against an existing production model
- AWS Step Functions is a strong answer when the workflow requires sequential stages, approvals, branching, retries, and state tracking between evaluation steps
TIP
If the question emphasizes sequential validation, approval gates, or promotion only after passing review, think in terms of a workflow orchestration pattern rather than a single evaluation job.
5.3 CloudWatch Monitoring โ
Key Bedrock CloudWatch Metrics โ
| Metric | Description |
|---|---|
InvocationLatency | P50/P90/P99 latency for model calls (end-to-end) |
InputTokenCount | Total input tokens consumed in the period |
OutputTokenCount | Total output tokens generated in the period |
ThrottledRequests | Count of requests throttled |
InvocationClientErrors | 4xx errors |
InvocationServerErrors | 5xx errors |
Setting Up CloudWatch Alarms โ
Use CloudWatch Alarms to proactively catch issues:
ThrottledRequests > 0โ Consider switching to Provisioned ThroughputInvocationLatency > [threshold]โ Investigate prompt length or model choiceInvocationServerErrors > 0โ Investigate Bedrock service health or retry configuration
5.4 Troubleshooting Common Issues โ
| Problem | Likely Cause | Solution |
|---|---|---|
ThrottlingException | Exceeded on-demand TPS limit | Implement exponential backoff; use Provisioned Throughput |
| High latency | Long prompts or large retrieved context | Reduce input tokens; use streaming for UX improvement |
| Poor RAG answer quality | Wrong chunks returned | Tune chunking strategy, adjust top-K, improve embedding model |
| Guardrail blocking valid content | Filter sensitivity too high | Lower filter strength; review denied topics configuration |
| Hallucinated response | FM ignoring retrieved context | Strengthen system prompt; reduce temperature; check Groundedness |
| Agent not calling the right action | Schema unclear | Improve OpenAPI descriptions |
| High cost | Verbose prompts or large top-K retrieval | Shorten prompt; reduce top-K; set maxTokens; use a smaller model |
5.5 Debugging Bedrock Agents โ
Using the Orchestration Trace โ
Enable enableTrace: true in InvokeAgent to inspect:
- The Agent's step-by-step reasoning
- Which Action Group it decided to call and why
- What Knowledge Base query it ran
- The final synthesis of results
AWS X-Ray for Distributed Tracing โ
Use AWS X-Ray when the problem is tracing a request across multiple services in a GenAI application stack.
Good fit:
- API Gateway โ Lambda โ Bedrock
- Lambda โ external APIs โ Bedrock
- Multi-service workflows where latency attribution matters
How to think about it:
- CloudWatch = metrics, logs, alarms
- X-Ray = end-to-end request tracing across services
Common Agent Issues โ
- Agent loops: improve action descriptions and stopping conditions
- Wrong action called: improve Action Group names and descriptions
- Knowledge Base returns irrelevant chunks: re-index with better chunking or embeddings