Microsoft Certification · SQL AI Developer Associate
DP-800 · 5-Minute Cheat Sheet
D1 · Design & Develop 35–40% D2 · Secure, Optimize & Deploy 35–40% D3 · AI Capabilities 25–30%
📝 D1 · T-SQL Power Features
FeatureKey Syntax / PatternWhen
CTEWITH cte AS (SELECT …) SELECT …Recursive hierarchies, readability
WindowROW_NUMBER() OVER (PARTITION BY … ORDER BY …)Ranking, running totals, lag/lead
JSONJSON_VALUE / OPENJSON / FOR JSON PATHSemi-structured data, LLM payloads
REGEXPREGEXP_LIKE(col, pattern)Pattern matching (Fabric SQL)
FuzzyEDIT_DISTANCE(a, b)Typo tolerance, similarity search
GraphMATCH (node1-(edge)->node2)Social, recommendation graphs
ErrorBEGIN TRY … END TRY BEGIN CATCH THROW END CATCHStored procedures, RAG callers
🗄️ D1 · Specialized Table Types
Temporal TableSYSTEM_TIME history. FOR SYSTEM_TIME AS OF '2024-01-01'. Point-in-time queries. Separate history table auto-managed.
In-Memory OLTPMEMORY_OPTIMIZED=ON. No locks — version-based concurrency. Natively compiled SPs. Best for high-throughput OLTP.
Columnstore IndexClustered (replaces row store) vs Non-clustered. Delta store → compressed. Best for analytics / aggregations on large tables.
Ledger TableAppend-only or updatable. Cryptographic hash chain. Tamper-evident audit trail.
Graph TableNODE table (entities) + EDGE table (relationships). MATCH clause for pattern traversal. SHORTEST_PATH for path queries.
External TablePolyBase / Fabric. Reads from blob, ADLS, or other DB. No data moved. Push-down filtering.
⚙️ D1 · Programmability Objects
ObjectKey Facts
ViewWITH SCHEMABINDING for indexed views. Updatable if single-table, no aggregates. NOT WITH CHECK OPTION trap.
Inline TVFReturns table, single SELECT. Optimised like a view — prefer over multi-statement TVF.
Multi-stmt TVFUses table variable. Optimizer can't see row count — cardinality issues. Avoid if performance matters.
Scalar UDFInhibits parallelism. Use inline TVF or computed column instead where possible.
Stored ProcInput/output params. sp_executesql for parameterized dynamic SQL. Core RAG executor.
AFTER TriggerFires after DML. Uses inserted/deleted pseudo-tables. Good for audit, embedding update.
INSTEAD OFReplaces DML. Needed for views. Risky — easy to break insert logic.
SEQUENCENEXT VALUE FOR seq. Can gap-skip on rollback. Useful across tables (unlike IDENTITY).
🔒 D2 · Security Quick Reference
FeatureSyntax / ConfigNotes
RLSCREATE SECURITY POLICY
FILTER PREDICATE fn(col)
FILTER = SELECT. BLOCK = DML. Inline TVF predicate is mandatory.
DDMMASKED WITH (FUNCTION='default()')DEFAULT/EMAIL/PARTIAL/CUSTOM. UNMASK permission bypasses. Not encryption — app-layer only.
Always EncryptedCMK (stored in Key Vault) → CEK → ColumnDeterministic = can filter/join. Randomized = more secure, no search. Driver-side encryption.
Column EncryptionENCRYPTBYKEY / DECRYPTBYKEYSymmetric key. Server-side. Weaker than Always Encrypted.
RLS PredicateCREATE FUNCTION … RETURNS TABLE WITH (NOEXPAND)Must use SCHEMABINDING. One policy per table max.
PasswordlessManaged Identity → Azure AD tokenNo password in connection string. System-assigned vs user-assigned MI.
AuditingServer-level > DB-level. Destinations: Blob, Log Analytics, Event Hub.Action groups: SUCCESSFUL_DATABASE_AUTHENTICATION_GROUP etc.
D2 · Query Performance Toolkit
Execution Plan Red FlagsKey Lookup → add covering index. Sort operator → add ORDER BY to index. Clustered Index Scan on large table → check WHERE clause selectivity. Nested Loops on large sets → index or join hint.
Query Store WorkflowRegressed Queries → find plan change → Force Plan. Top Resource Consumers → spot recurring expensive queries. Query Performance Insight (Azure) → cloud UI overlay.
Top DMVssys.dm_exec_query_stats → CPU/reads per query
sys.dm_exec_requests → active sessions
sys.dm_os_wait_stats → bottleneck type
sys.dm_exec_query_plan → XML plan
Isolation Levels (low → high lock)READ UNCOMMITTED → READ COMMITTED → REPEATABLE READ → SERIALIZABLE → SNAPSHOT (optimistic, no locks on reads)
Blocking / DeadlockDeadlock = circular wait. Graph in Event Session / Extended Events. Resolution: retry logic, lock timeout, index to reduce hold time. RCSI eliminates most read blocks.
🔄 D2 · SQL DB Projects & CI/CD
.sqlproj (SDK-style)MSBuild project. Declares SQL objects. No scripts — declarative model.
dacpacCompiled artifact. Contains schema snapshot. Input to SqlPackage.
publish profile (.publish.xml)Target server, auth, deployment options (IgnorePermissions etc.)
Schema DriftSqlPackage /Action:Report compares dacpac → target. Returns drift XML.
tSQLtUnit test framework. FakeTable, AssertEquals, ExpectException. Runs in SQL.
🔁 Pipeline Stages
PR → Branch Policy
Build dacpac
Unit Tests (tSQLt)
Schema Drift Report
Deploy to Staging
Approval Gate
Prod
Secrets ManagementAzure Key Vault → pipeline variable injection. Never hardcode in .sqlproj.
Code OwnersCODEOWNERS file. Protects main/prod branches. Enforces PR review by owner.
🌐 D2 · Data API Builder (DAB)
// dab-config.json skeleton { "data-source": { "database-type": "mssql", "connection-string": "@env('CONN')" }, "entities": { "Product": { "source": "dbo.Products", "permissions": [{"role":"authenticated","actions":["read"]}], "rest": { "enabled": true, "path": "/products" }, "graphql": { "enabled": true, "type": "Product" } } } }
Config KeyPurpose
rest.pathURL path override. Default = entity name.
graphql.typeGraphQL type name. Enables relationships.
permissions.roleanonymous | authenticated | custom role
source.type = stored-procedureExpose SP as REST/GraphQL endpoint
cache.enabled = trueIn-memory cache, TTL configurable
pagination.$limitMax page size. Use $after cursor for keyset.
📡 D2 · Change Detection & Integration
MethodGranularityBest ForOverhead
Change TrackingRow (changed/deleted flag)Sync, lightweight embedding triggersLow
CDCRow + before/after column valuesAudit, ETL, embedding pipelinesMedium
CESReal-time event streamStreaming analytics, event-driven AIHigher
Az Functions SQL TriggerRow-level, event-drivenServerless embedding refresh on changeLow
Logic AppsConnector-based pollingLow-code integration pipelinesMedium
📊 Azure Monitor Targets
Application InsightsDAB & app-layer telemetry. Request traces, dependency calls, exceptions.
Log Analytics WorkspaceSQL audit logs, diagnostics. KQL queries across resources.
SQL InsightsDMV-based monitoring. CPU, blocking, wait stats time series.
🧮 D3 · Vector Data & Embeddings
-- Declare vector column (up to 1998 dims) CREATE TABLE docs ( id INT, content NVARCHAR(MAX), embedding VECTOR(1536) -- text-embedding-3-small dims ); -- Cosine similarity search SELECT TOP 5 id, content, VECTOR_DISTANCE('cosine', embedding, @queryVec) AS dist FROM docs ORDER BY dist;
FunctionPurpose
VECTOR_DISTANCE(metric, v1, v2)cosine | dot_product | euclidean
VECTOR_NORMALIZE(v)Unit vector — required before dot product
VECTORPROPERTY(col, 'Dimensions')Returns dimension count of a VECTOR column
VECTOR_SEARCH()ANN top-K via index (DiskANN). Fast at scale.
ANN (Approximate)
Uses vector index (DiskANN)
Fast — O(log n)
May miss some matches
Production scale
ENN (Exact)
Full table scan
Slow — O(n)
100% recall
Small datasets / testing
🔍 D3 · Intelligent Search Strategies
TypeMechanismUse When
Full-TextBM25 TF-IDF scoring. CONTAINS / FREETEXT / FREETEXTTABLE. Requires full-text index.Keyword search, exact phrase matching, stemming
Semantic VectorVECTOR_DISTANCE on embeddings. Meaning-based retrieval.Conceptual match, multilingual, paraphrase detection
Hybrid SearchFull-text score + vector score merged via RRF.Best of both — production RAG retrieval
🎯 Reciprocal Rank Fusion (RRF)
-- RRF score formula score = 1 / (k + rank_in_list) -- k=60 (constant to smooth outliers) -- Combine: full_text_score + vector_score -- ORDER BY combined_rrf_score DESC
k=60 prevents top-rank results from dominating. Higher k = more uniform distribution.
CONTAINS(col, 'term')Exact / inflectional / thesaurus / proximity
FREETEXT(col, 'phrase')NLP tokenization — more forgiving than CONTAINS
FREETEXTTABLEReturns RANK score — joinable for RRF hybrid
🤖 D3 · External Models & Embedding Maintenance
-- Create external model (Fabric SQL) CREATE EXTERNAL MODEL myEmbedModel WITH (LOCATION = 'https://…/openai', CREDENTIAL = oaiCred, TASK = 'EMBEDDINGS', MODEL = 'text-embedding-3-small');
Maintenance MethodTrade-off
AFTER INSERT/UPDATE TriggerSynchronous — blocks write. Simple. Risk: trigger failure = write failure.
Change Tracking + batch jobAsynchronous — no write overhead. Slight lag. Good for bulk tables.
CDC + Azure FunctionEvent-driven. Near real-time. Serverless scale. Recommended for prod.
CESStreaming. Lowest latency. Most complex to manage.
🗂️ Chunking Strategies
Fixed-sizeSimple. Risk: breaks mid-sentence. Use overlap (10–15%).
SentenceNLP sentence boundaries. Better semantic units.
SemanticTopic-based splits. Best quality. Most complex.
Include which cols?Content cols + context cols (category, date). Exclude IDs, foreign keys.
D3 · RAG Pattern in SQL
User Query
Embed Query
Hybrid Search
RRF Rank
Build Prompt (FOR JSON)
sp_invoke LLM
OPENJSON Response
-- Core RAG stored procedure pattern EXEC sp_invoke_external_rest_endpoint @url = N'https://<openai>/chat/completions', @method = N'POST', @credential = [myCredential], @payload = @jsonPayload, -- built via FOR JSON PATH @response = @result OUTPUT; -- Extract answer from response SELECT JSON_VALUE(@result, '$.result.choices[0].message.content');
FOR JSON PATHConverts SQL rows → JSON context for LLM prompt
FOR JSON AUTOAuto-infers structure — less control, quick prototype
OPENJSON(@resp,'$.choices')Parse LLM response array back to rows
@credentialDATABASE SCOPED CREDENTIAL — stores API key securely
🎯 Scenario Speed-Decide · Exam Pattern Recognition
Security Scenarios
"Users see only their own rows"→ Row-Level Security (RLS) with FILTER predicate
"Hide last 4 digits of SSN"→ Dynamic Data Masking, PARTIAL function
"Encrypt SSN — app must query it"→ Always Encrypted DETERMINISTIC (allows equality filter)
"Encrypt but no search needed"→ Always Encrypted RANDOMIZED (stronger)
"No passwords in code"→ Managed Identity + Azure AD authentication
Performance Scenarios
"Slow aggregation on 500M rows"→ Add clustered columnstore index
"Query plan regressed after stats update"→ Query Store → Force previous plan
"Readers blocking writers"→ Enable RCSI (READ_COMMITTED_SNAPSHOT)
"Need point-in-time data audit"→ Temporal table + FOR SYSTEM_TIME AS OF
"Deadlock on two tables"→ Enforce consistent lock order + add index
AI / Search Scenarios
"Find semantically similar documents"→ VECTOR_DISTANCE cosine on embedding column
"Best search across keyword + meaning"→ Hybrid search + RRF ranking
"Generate answer from DB content"→ RAG: hybrid search → FOR JSON → sp_invoke → OPENJSON
"Embeddings must stay current"→ Azure Function SQL trigger (CDC-based, async)
"100% recall needed on small set"→ ENN (exact). Large set → ANN + DiskANN index
Deployment / Integration
"Expose table as REST + GraphQL"→ Data API Builder (DAB) with entity config
"DB schema out of sync with code"→ SQL DB Projects + SqlPackage schema drift report
"Real-time row change triggers AI update"→ Azure Functions SQL trigger binding
"Need Copilot to understand SQL conventions"→ .github/copilot-instructions.md instruction file
"GraphQL endpoint must use Managed Identity"→ DAB + Managed Identity auth, no API key