temperature (creativity 0→1) · top-P/K (sampling) · max_tokens · stop_sequencesInvokeModel (sync) · InvokeModelWithResponseStream (streaming)| Scenario | Use |
|---|---|
| Need fresh / dynamic data | RAG |
| Teach model a new output format or tone | Fine-tune |
| Domain-specific vocabulary adaptation | Cont. Pre-train |
| Quick, no training data available | RAG |
| Consistent JSON / structured output | Fine-tune |
| Latency-critical, no retrieval lag | Fine-tune |
JSONL — prompt / completion pairs| Parameter | Controls | Key insight |
|---|---|---|
tensor_parallel_degree=N |
GPUs per replica | 8 GPUs ÷ 4 = 2 replicas |
max_sequence_length |
KV-cache size | ↓ length → ↑ batch → ↑ throughput |
max_rolling_batch_size |
Concurrent requests | Continuous batching fills GPU dynamically |
bedrock:InvokeModel without GuardrailIdentifier org-wideInputTokenCount · OutputTokenCount · InvocationLatency · ThrottledRequests