AIP-C01 · 5-Minute Cheat Sheet

D1 · Amazon Bedrock Essentials

Bedrock — managed API to invoke 20+ FMs, no infrastructure to manage
FMs available: Anthropic Claude, Meta Llama, Cohere, Amazon Titan, Stability AI, Mistral
Inference params: temperature (creativity 0→1) · top-P/K (sampling) · max_tokens · stop_sequences
API calls: InvokeModel (sync) · InvokeModelWithResponseStream (streaming)
Pricing: On-demand = pay per token · Provisioned = reserved capacity, required for custom models · Batch = async, −50% cost
Embeddings: Amazon Titan Embeddings, Cohere Embed — convert text → vectors for RAG
Cross-region inference: route across AWS regions for high availability

D1 · RAG Pipeline

Docs / S3

source

→

Chunk

split

→

Embed

vectorise

→

Vector DB

store

→

Retrieve

kNN

→

Augment

prompt

→

generate

Knowledge Bases — fully managed RAG, zero code
Vector stores: OpenSearch Serverless (default) · Aurora pgvector · Pinecone · Redis
Chunking: fixed-size (simple) · semantic (quality) · hierarchical (docs)
Hybrid search = semantic + keyword (BM25) → better recall
Metadata filters narrow retrieved chunks before similarity search
OpenSearch Neural Plugin — semantic + BM25 hybrid in one query
Re-ranking — cross-encoder second pass for precision

D1 · Customisation Decision

Scenario	Use
Need fresh / dynamic data	RAG
Teach model a new output format or tone	Fine-tune
Domain-specific vocabulary adaptation	Cont. Pre-train
Quick, no training data available	RAG
Consistent JSON / structured output	Fine-tune
Latency-critical, no retrieval lag	Fine-tune

Fine-tune data format: JSONL — prompt / completion pairs
Custom models require Provisioned Throughput (not on-demand)
Distillation: large teacher → small student, same quality cheaper
LoRA — updates <1% of params; adapters load on top of base model

D2 · Prompt Engineering

Zero-shot — no examples; rely on model knowledge
Few-shot — 3–5 examples in prompt; controls format & style precisely
Chain-of-Thought (CoT) — "think step by step"; improves logic & maths
ReAct = Reason + Act → basis of Bedrock Agents loop
Prompt chaining — output of step N becomes input to step N+1
System prompt sets persona/constraints · Human/Assistant turns = conversation
Prompt injection — attacker embeds malicious instructions → mitigate with Guardrails
Bedrock Prompt Management — versioned templates with variable placeholders

D2 · Bedrock Agents (ReAct Loop)

Agent Orchestration Cycle

User
Input

→

Reason
(Thought)

→

Act
(Tool)

→

Observe
(Result)

→

Answer /
Loop

Action Group = Lambda fn + OpenAPI schema → what the agent can do
Knowledge Base attached to agent → automatic RAG retrieval
Return of Control (RoC) — agent pauses, application handles human approval
Multi-agent: Supervisor → delegates to Specialist Sub-agents
Agent memory: session (within convo) · cross-session (persisted)
Bedrock Flows — low-code visual pipeline builder

D2 · AWS Service Selector

Custom FM app, full control

Amazon Bedrock

No-code chatbot over company docs

Amazon Q Business

Enterprise keyword + semantic search

Amazon Kendra

Sentiment, entities, PII from text

Amazon Comprehend

Extract data from PDFs / forms

Amazon Textract

Train / deploy custom ML on GPU

Amazon SageMaker

Image / video analysis

Amazon Rekognition

Multi-step pipeline orchestration

AWS Step Functions

Audio → text (multimodal pipelines)

Amazon Transcribe

Validate data before FM ingestion

Glue Data Quality

D2 · SageMaker LMI & DJL Serving

Parameter	Controls	Key insight
`tensor_parallel_degree=N`	GPUs per replica	8 GPUs ÷ 4 = 2 replicas
`max_sequence_length`	KV-cache size	↓ length → ↑ batch → ↑ throughput
`max_rolling_batch_size`	Concurrent requests	Continuous batching fills GPU dynamically

Endpoint types: Real-time (<60s) · Async (long jobs) · Serverless (sporadic) · Batch Transform (offline)
Model Registry — Pending → Approved → CI/CD deploy (governance gate)
Model Cards — programmatic Responsible AI documentation via SDK
Glue Data Catalog — data lineage & source attribution for compliance

D3 · Guardrails & Responsible AI

Content Filters

hate · violence · sexual · insults — configurable severity threshold

Topic Denial

block competitor mentions, legal advice, investment topics, etc.

PII Redaction

SSN · credit card · email · phone → ANONYMIZE or BLOCK

Grounding Check

verify answer is supported by retrieved context (anti-hallucination)

Word Filters

custom deny-list · profanity filter

Responsible AI pillars: Fairness · Explainability · Privacy · Robustness · Transparency
Invocation logging → S3 / CloudWatch (required for audit; captures full prompt + response)
Encryption: KMS at rest · TLS in transit · VPC endpoints for private access
IAM: identity-based + resource-based policies · least-privilege per Lambda
SCP — deny bedrock:InvokeModel without GuardrailIdentifier org-wide

D4 · Cost & Performance Optimization

Strategy

Cost ↓

Latency ↓

Batch Inference API

✓ −50%

✗ async

Prompt Caching

✓

Streaming responses

✓ perceived

Smaller / distilled model

✓

Provisioned Throughput

~ high-vol

✓ no throttle

Reduce max_sequence_length

✓

✓ ↑ batch

Prompt caching caches the static system prompt portion — subsequent calls skip re-compute
Batch inference = async jobs for nightly/bulk processing, not real-time
CloudWatch metrics: InputTokenCount · OutputTokenCount · InvocationLatency · ThrottledRequests
ThrottlingException → exponential backoff + jitter, or upgrade to provisioned throughput

D5 · Testing, Evaluation & Troubleshooting

Model Evaluation jobs (Bedrock built-in): auto metrics or human review
Auto metrics: ROUGE (summarisation) · BERTScore (semantic) · accuracy
RAG-specific (RAGAS):
Context Precision · Context Recall · Answer Relevancy · Faithfulness
Faithfulness = is answer grounded in retrieved context? Low = hallucination
Hallucination → use grounding check in Guardrails + citations in response
Agent debugging: enable Trace in Bedrock console → shows each ReAct step (Thought / Action / Observation)
X-Ray tracing — end-to-end latency across Lambda + Bedrock calls
LLM-as-a-Judge — use an FM to evaluate another FM's output quality
Common failures: retrieval miss (bad chunking) · context window overflow · prompt injection · format errors · throttling