Skip to content

Domain 5: Testing, Validation, and Troubleshooting (11%) โ€‹

โ† Domain 4 ยท โ† Back to Overview

Exam Tip

This domain is 11% of the exam. Know the four automated evaluation metrics cold โ€” especially Groundedness (detects hallucinations in RAG). Know when human evaluation is required vs. automated, and always remember: CloudWatch = operational metrics, CloudTrail = audit trail.


5.1 Model Evaluation โ€‹

What Is Model Evaluation in Bedrock? โ€‹

Amazon Bedrock includes a built-in Model Evaluation feature that lets you evaluate and compare FM performance using standardized metrics โ€” both automated and human-based.

Automated Evaluation Metrics โ€‹

MetricWhat It MeasuresRAG-Specific?
GroundednessIs the response factually supported by the retrieved context?Yes โ€” detects hallucinations
RelevanceDoes the response actually answer the user's question?No
AccuracyIs the factual content of the response correct?No
FluencyIs the response natural, readable, and well-written?No
RobustnessDoes the model perform consistently across varied prompt phrasings?No

Key Distinction

Groundedness is the most RAG-specific metric โ€” it checks whether the model's answer is supported by the retrieved documents, not invented.

Hallucination Detection Techniques โ€‹

Hallucination detection compares generated content against retrieved source documents to verify claims are grounded in source material:

TechniqueHow It Works
Semantic similarity scoringEmbed both the generated claim and the source chunks โ€” high cosine similarity = grounded
Fact verificationUse the FM itself to check: "Does this source document support this specific statement?"
Confidence scoringScore the model's certainty for each generated statement; flag low-confidence claims for review
Guardrails Contextual Grounding CheckBedrock-native โ€” automatically blocks responses that are not supported by retrieved context

Knowledge Base Citations โ€‹

When using Amazon Bedrock Knowledge Bases, the retrieve-and-generate response includes citation metadata โ€” references to the specific source documents and chunks that supported each part of the answer.

  • Citations show which S3 object and which chunk backed each claim
  • Enables downstream validation: compare the response against cited sources to verify grounding
  • If a statement has no citation, it may be hallucinated

TIP

Knowledge Base citations are the native Bedrock mechanism for source attribution. Surfacing citations to end users also builds trust โ€” they can click through to the original document.

Automated vs. Human Evaluation โ€‹

Automated EvaluationHuman Evaluation
HowAlgorithm-scored using metrics or automated judgesReal humans rate responses
SpeedFast, scalableSlow, expensive
Best forRegression testing, large dataset comparisonsSubjective quality, tone, nuance

Use human evaluation when:

  • Evaluating subjective qualities (tone, brand voice, empathy)
  • Validating safety content decisions that require human judgment
  • Ground truth labels are unavailable or difficult to define algorithmically

LLM-as-a-Judge โ€‹

An LLM-as-a-Judge pattern uses one model to evaluate the quality of another model's response against a rubric.

Typical use:

  • Score answers for relevance, helpfulness, or rubric compliance
  • Compare multiple candidate responses at scale
  • Support regression testing when manual review is too slow

WARNING

LLM-as-a-Judge is useful for scalable evaluation, but it is still an automated evaluation technique, not a substitute for human review in subjective or high-stakes cases.


5.2 Setting Up a Model Evaluation Job โ€‹

Steps โ€‹

  1. Select models: Choose one or more FM models or fine-tuned variants to compare
  2. Provide a dataset: Upload a prompt dataset to S3 (JSONL format)
  3. Select metrics: Choose which automated metrics to compute
  4. Run the job: Bedrock runs each prompt against the selected models and scores each response
  5. Review results: Compare metric scores across models in the Bedrock console

Prompt Dataset Format โ€‹

  • Stored in Amazon S3 as a JSONL file
  • Each line contains a prompt (and optionally a reference/expected answer for accuracy scoring)
  • Example line: {"prompt": "What is the capital of France?", "referenceResponse": "Paris"}

Evaluation Workflow Patterns โ€‹

When a company wants to replace a production model with a new one, the evaluation process should be treated as a gated workflow, not an ad hoc test run.

Typical sequence:

  1. Define evaluation metrics such as relevance, accuracy, fluency, and groundedness
  2. Create a test dataset with realistic scenarios, difficult prompts, and edge cases
  3. Run controlled comparisons between candidate models using the same dataset
  4. Apply quality gates so weak results do not progress to the next stage
  5. Analyze results and produce a decision report before promotion to production

Important exam ideas:

  • Do not compare models before defining the metrics and dataset
  • A high-quality evaluation dataset should include diverse scenarios and edge cases
  • A/B testing is a valid comparison pattern when you need to compare a new model against an existing production model
  • AWS Step Functions is a strong answer when the workflow requires sequential stages, approvals, branching, retries, and state tracking between evaluation steps

TIP

If the question emphasizes sequential validation, approval gates, or promotion only after passing review, think in terms of a workflow orchestration pattern rather than a single evaluation job.


5.3 CloudWatch Monitoring โ€‹

Key Bedrock CloudWatch Metrics โ€‹

MetricDescription
InvocationLatencyP50/P90/P99 latency for model calls (end-to-end)
InputTokenCountTotal input tokens consumed in the period
OutputTokenCountTotal output tokens generated in the period
ThrottledRequestsCount of requests throttled
InvocationClientErrors4xx errors
InvocationServerErrors5xx errors

Setting Up CloudWatch Alarms โ€‹

Use CloudWatch Alarms to proactively catch issues:

  • ThrottledRequests > 0 โ†’ Consider switching to Provisioned Throughput
  • InvocationLatency > [threshold] โ†’ Investigate prompt length or model choice
  • InvocationServerErrors > 0 โ†’ Investigate Bedrock service health or retry configuration

5.4 Troubleshooting Common Issues โ€‹

ProblemLikely CauseSolution
ThrottlingExceptionExceeded on-demand TPS limitImplement exponential backoff; use Provisioned Throughput
High latencyLong prompts or large retrieved contextReduce input tokens; use streaming for UX improvement
Poor RAG answer qualityWrong chunks returnedTune chunking strategy, adjust top-K, improve embedding model
Guardrail blocking valid contentFilter sensitivity too highLower filter strength; review denied topics configuration
Hallucinated responseFM ignoring retrieved contextStrengthen system prompt; reduce temperature; check Groundedness
Agent not calling the right actionSchema unclearImprove OpenAPI descriptions
High costVerbose prompts or large top-K retrievalShorten prompt; reduce top-K; set maxTokens; use a smaller model

5.5 Debugging Bedrock Agents โ€‹

Using the Orchestration Trace โ€‹

Enable enableTrace: true in InvokeAgent to inspect:

  • The Agent's step-by-step reasoning
  • Which Action Group it decided to call and why
  • What Knowledge Base query it ran
  • The final synthesis of results

AWS X-Ray for Distributed Tracing โ€‹

Use AWS X-Ray when the problem is tracing a request across multiple services in a GenAI application stack.

Good fit:

  • API Gateway โ†’ Lambda โ†’ Bedrock
  • Lambda โ†’ external APIs โ†’ Bedrock
  • Multi-service workflows where latency attribution matters

How to think about it:

  • CloudWatch = metrics, logs, alarms
  • X-Ray = end-to-end request tracing across services

Common Agent Issues โ€‹

  • Agent loops: improve action descriptions and stopping conditions
  • Wrong action called: improve Action Group names and descriptions
  • Knowledge Base returns irrelevant chunks: re-index with better chunking or embeddings

โ† Domain 4 ยท โ† Back to Overview

Happy Studying! ๐Ÿš€ โ€ข Privacy-friendly analytics โ€” no cookies, no personal data
Privacy Policy โ€ข AI Disclaimer โ€ข Report an issue