Domain 5: Testing, Validation, and Troubleshooting (11%)

Exam Tip

This domain is 11% of the exam. Know the four automated evaluation metrics cold — especially Groundedness (detects hallucinations in RAG). Know when human evaluation is required vs. automated, and always remember: CloudWatch = operational metrics, CloudTrail = audit trail.

5.1 Model Evaluation

What Is Model Evaluation in Bedrock?

Amazon Bedrock includes a built-in Model Evaluation feature that lets you evaluate and compare FM performance using standardized metrics — both automated and human-based.

Automated Evaluation Metrics

Metric	What It Measures	RAG-Specific?
Groundedness	Is the response factually supported by the retrieved context?	Yes — detects hallucinations
Relevance	Does the response actually answer the user's question?	No
Accuracy	Is the factual content of the response correct?	No
Fluency	Is the response natural, readable, and well-written?	No
Robustness	Does the model perform consistently across varied prompt phrasings?	No

Key Distinction

Groundedness is the most RAG-specific metric — it checks whether the model's answer is supported by the retrieved documents, not invented.

Hallucination Detection Techniques

Hallucination detection compares generated content against retrieved source documents to verify claims are grounded in source material:

Technique	How It Works
Semantic similarity scoring	Embed both the generated claim and the source chunks — high cosine similarity = grounded
Fact verification	Use the FM itself to check: "Does this source document support this specific statement?"
Confidence scoring	Score the model's certainty for each generated statement; flag low-confidence claims for review
Guardrails Contextual Grounding Check	Bedrock-native — automatically blocks responses that are not supported by retrieved context

Knowledge Base Citations

When using Amazon Bedrock Knowledge Bases, the retrieve-and-generate response includes citation metadata — references to the specific source documents and chunks that supported each part of the answer.

Citations show which S3 object and which chunk backed each claim
Enables downstream validation: compare the response against cited sources to verify grounding
If a statement has no citation, it may be hallucinated

TIP

Knowledge Base citations are the native Bedrock mechanism for source attribution. Surfacing citations to end users also builds trust — they can click through to the original document.

Automated vs. Human Evaluation

	Automated Evaluation	Human Evaluation
How	Algorithm-scored using metrics or automated judges	Real humans rate responses
Speed	Fast, scalable	Slow, expensive
Best for	Regression testing, large dataset comparisons	Subjective quality, tone, nuance

Use human evaluation when:

Evaluating subjective qualities (tone, brand voice, empathy)
Validating safety content decisions that require human judgment
Ground truth labels are unavailable or difficult to define algorithmically

LLM-as-a-Judge

An LLM-as-a-Judge pattern uses one model to evaluate the quality of another model's response against a rubric.

Typical use:

Score answers for relevance, helpfulness, or rubric compliance
Compare multiple candidate responses at scale
Support regression testing when manual review is too slow

WARNING

LLM-as-a-Judge is useful for scalable evaluation, but it is still an automated evaluation technique, not a substitute for human review in subjective or high-stakes cases.

5.2 Setting Up a Model Evaluation Job

Steps

Select models: Choose one or more FM models or fine-tuned variants to compare
Provide a dataset: Upload a prompt dataset to S3 (JSONL format)
Select metrics: Choose which automated metrics to compute
Run the job: Bedrock runs each prompt against the selected models and scores each response
Review results: Compare metric scores across models in the Bedrock console

Prompt Dataset Format

Stored in Amazon S3 as a JSONL file
Each line contains a prompt (and optionally a reference/expected answer for accuracy scoring)
Example line: {"prompt": "What is the capital of France?", "referenceResponse": "Paris"}

Evaluation Workflow Patterns

When a company wants to replace a production model with a new one, the evaluation process should be treated as a gated workflow, not an ad hoc test run.

Typical sequence:

Define evaluation metrics such as relevance, accuracy, fluency, and groundedness
Create a test dataset with realistic scenarios, difficult prompts, and edge cases
Run controlled comparisons between candidate models using the same dataset
Apply quality gates so weak results do not progress to the next stage
Analyze results and produce a decision report before promotion to production

Important exam ideas:

Do not compare models before defining the metrics and dataset
A high-quality evaluation dataset should include diverse scenarios and edge cases
A/B testing is a valid comparison pattern when you need to compare a new model against an existing production model
AWS Step Functions is a strong answer when the workflow requires sequential stages, approvals, branching, retries, and state tracking between evaluation steps

TIP

If the question emphasizes sequential validation, approval gates, or promotion only after passing review, think in terms of a workflow orchestration pattern rather than a single evaluation job.

5.3 CloudWatch Monitoring

Key Bedrock CloudWatch Metrics

Metric	Description
`InvocationLatency`	P50/P90/P99 latency for model calls (end-to-end)
`InputTokenCount`	Total input tokens consumed in the period
`OutputTokenCount`	Total output tokens generated in the period
`ThrottledRequests`	Count of requests throttled
`InvocationClientErrors`	4xx errors
`InvocationServerErrors`	5xx errors

Setting Up CloudWatch Alarms

Use CloudWatch Alarms to proactively catch issues:

ThrottledRequests > 0 → Consider switching to Provisioned Throughput
InvocationLatency > [threshold] → Investigate prompt length or model choice
InvocationServerErrors > 0 → Investigate Bedrock service health or retry configuration

5.4 Troubleshooting Common Issues

Problem	Likely Cause	Solution
`ThrottlingException`	Exceeded on-demand TPS limit	Implement exponential backoff; use Provisioned Throughput
High latency	Long prompts or large retrieved context	Reduce input tokens; use streaming for UX improvement
Poor RAG answer quality	Wrong chunks returned	Tune chunking strategy, adjust top-K, improve embedding model
Guardrail blocking valid content	Filter sensitivity too high	Lower filter strength; review denied topics configuration
Hallucinated response	FM ignoring retrieved context	Strengthen system prompt; reduce temperature; check Groundedness
Agent not calling the right action	Schema unclear	Improve OpenAPI descriptions
High cost	Verbose prompts or large top-K retrieval	Shorten prompt; reduce top-K; set `maxTokens`; use a smaller model

5.5 Debugging Bedrock Agents

Using the Orchestration Trace

Enable enableTrace: true in InvokeAgent to inspect:

The Agent's step-by-step reasoning
Which Action Group it decided to call and why
What Knowledge Base query it ran
The final synthesis of results

AWS X-Ray for Distributed Tracing

Use AWS X-Ray when the problem is tracing a request across multiple services in a GenAI application stack.

Good fit:

API Gateway → Lambda → Bedrock
Lambda → external APIs → Bedrock
Multi-service workflows where latency attribution matters

How to think about it:

CloudWatch = metrics, logs, alarms
X-Ray = end-to-end request tracing across services

Common Agent Issues

Agent loops: improve action descriptions and stopping conditions
Wrong action called: improve Action Group names and descriptions
Knowledge Base returns irrelevant chunks: re-index with better chunking or embeddings

← Domain 4 · ← Back to Overview

Domain 5: Testing, Validation, and Troubleshooting (11%) ​

5.1 Model Evaluation ​

What Is Model Evaluation in Bedrock? ​

Automated Evaluation Metrics ​

Hallucination Detection Techniques ​

Knowledge Base Citations ​

Automated vs. Human Evaluation ​

LLM-as-a-Judge ​

5.2 Setting Up a Model Evaluation Job ​

Steps ​

Prompt Dataset Format ​

Evaluation Workflow Patterns ​

5.3 CloudWatch Monitoring ​

Key Bedrock CloudWatch Metrics ​

Setting Up CloudWatch Alarms ​

5.4 Troubleshooting Common Issues ​

5.5 Debugging Bedrock Agents ​

Using the Orchestration Trace ​

AWS X-Ray for Distributed Tracing ​

Common Agent Issues ​