Domain 6: Implement knowledge mining and information extraction (15-20%) โ
โ Domain 5 ยท Cheatsheet โ
This domain covers extracting insights from large volumes of unstructured data using Azure AI Search, Document Intelligence, and the newer Content Understanding service. At 15โ20% weight, it is one of the highest-value domains on the exam.
6.1 Azure Content Understanding โ
A newer service in AI Foundry for building automated, multimodal extraction pipelines. It goes beyond Document Intelligence by handling images, video, and audio alongside text documents.
Key Capabilities โ
| Capability | Description |
|---|---|
| OCR Pipeline | Extracts text and layout from complex multi-page documents and images |
| Summarization & Classification | Uses generative AI to categorize and summarize content during ingestion |
| Entity & Table Extraction | Identifies structured data (tables, key-value pairs) within unstructured documents |
| Multimodal Processing | Ingests and analyzes video and audio alongside traditional documents |
Content Understanding vs Document Intelligence
Content Understanding โ multimodal pipeline (docs + images + video + audio) with generative AI summarization. It's the newer, broader service. Document Intelligence โ focused on forms and structured document extraction (invoices, receipts, ID documents) using prebuilt or custom models.
The exam phrase "new multimodal document pipeline" โ Content Understanding. "Extract fields from invoices/receipts" โ Document Intelligence.
6.2 Document Intelligence โ
Extracts structured data from forms, invoices, receipts, and other documents.
Model Types โ
| Model Type | Use Case | Training Requirement |
|---|---|---|
| Prebuilt | Invoice, Receipt, ID, W-2, Business Card | None โ use as-is |
| Custom Template | Documents with fixed layout (forms, tables in same position) | Minimum 5 labeled documents |
| Custom Neural | Unstructured documents (contracts, letters โ layout varies) | 100+ labeled documents; higher accuracy |
| Composed | Routes a document to the best matching model from a collection | Multiple custom models trained separately |
Custom Template vs Custom Neural
Template model = fixed layout (form fields always in the same place). Train with 5+ docs. Neural model = variable layout (contracts, unstructured text). Needs 100+ docs. Better accuracy for complex docs.
Training & Evaluation โ
- Document Intelligence Studio: Visual labeling and training tool โ drag to label fields.
- Accuracy metric: Percentage of correctly identified fields in test documents.
- Confidence scores: Each extracted value returns a 0โ1 confidence score. Filter low-confidence extractions downstream.
6.3 Azure AI Search โ
The core platform for building full-text and vector search solutions over large document corpora.
Enrichment Pipeline (Indexing) โ
Data Source โ Indexer โ Skillset โ Index| Stage | Description |
|---|---|
| Data Source | Where documents live: Azure Blob Storage, Azure SQL, Cosmos DB |
| Indexer | Schedules and runs the crawl โ processes new and changed documents |
| Skillset | Chain of AI skills applied to each document during indexing |
| Index | The final searchable JSON structure queried at runtime |
Built-in Skills vs Custom Skills โ
| Skill Type | Examples | Key Detail |
|---|---|---|
| Built-in Cognitive Skills | OCR, Sentiment, Entity Recognition, Key Phrase, Image Analysis | Powered by Azure AI services โ just reference in skillset JSON |
| Custom Skill | Your own logic in an Azure Function or web API | Must follow exact values[] input/output schema |
Custom Skill Schema
A Custom Skill receives a values array and must return a values array with the same record keys. Forgetting this contract is the most common Custom Skill mistake on the exam.
// Input to your function:
{ "values": [{ "recordId": "1", "data": { "text": "..." } }] }
// Your function must return:
{ "values": [{ "recordId": "1", "data": { "myOutput": "..." } }] }Knowledge Store โ
Persists enriched data as projections outside the search index for non-search use cases:
| Projection Type | Use Case |
|---|---|
| Table Projections | Power BI analytics, SQL queries |
| Object Projections | Secondary AI processing (JSON blobs in Blob Storage) |
| File Projections | Save extracted images or normalized document files |
Shaper Skill
Before projecting complex nested data, use the Shaper Skill in your skillset to reshape and flatten the enriched document into the schema required by your projection. The exam often asks which skill structures data for the Knowledge Store.
Query Syntax & Parameters โ
| Syntax / Parameter | Purpose | Example |
|---|---|---|
| Simple | Basic keyword search | azure search |
| Lucene (Full) | Wildcards, regex, fuzzy, proximity | azur~, "azure search"~3 |
$filter | Boolean OData filter | Category eq 'Finance' and Year gt 2020 |
$select | Return specific fields only | $select=title,summary |
$top / $skip | Pagination | $top=10&$skip=20 |
$orderby | Sort results | $orderby=score desc |
6.4 Vector and Semantic Search โ
Search Techniques Compared โ
| Technique | How It Works | Best For |
|---|---|---|
| Keyword Search | BM25 exact/fuzzy text matching | Precise term lookups |
| Vector Search | Embeddings + HNSW algorithm โ finds semantically similar docs | Conceptual similarity, paraphrases |
| Hybrid Search | Combines keyword + vector scores using RRF (Reciprocal Rank Fusion) | Best overall recall |
| Semantic Ranking | LLM-based L2 re-ranker applied after retrieval | Surfacing the single best answer |
Semantic Ranking vs Vector Search
Vector search finds semantically similar documents using embeddings โ it is a retrieval technique. Semantic ranking re-ranks already retrieved results using an LLM to surface the single best answer โ it is a post-retrieval step.
They are not the same. The exam tests this distinction directly.
Integrated Vectorization โ
AI Search can generate embeddings during indexing automatically using a built-in skillset โ no separate embedding step in your pipeline code.
HNSW Algorithm โ
Hierarchical Navigable Small World โ the approximate nearest-neighbor algorithm used internally by AI Search for vector similarity. Exam may reference it as the algorithm behind vector search.