Domain 6: Implement knowledge mining and information extraction (15-20%)

← Domain 5 · Cheatsheet →

This domain covers extracting insights from large volumes of unstructured data using Azure AI Search, Document Intelligence, and the newer Content Understanding service. At 15–20% weight, it is one of the highest-value domains on the exam.

6.1 Azure Content Understanding

A newer service in AI Foundry for building automated, multimodal extraction pipelines. It goes beyond Document Intelligence by handling images, video, and audio alongside text documents.

Key Capabilities

Capability	Description
OCR Pipeline	Extracts text and layout from complex multi-page documents and images
Summarization & Classification	Uses generative AI to categorize and summarize content during ingestion
Entity & Table Extraction	Identifies structured data (tables, key-value pairs) within unstructured documents
Multimodal Processing	Ingests and analyzes video and audio alongside traditional documents

Content Understanding vs Document Intelligence

Content Understanding → multimodal pipeline (docs + images + video + audio) with generative AI summarization. It's the newer, broader service. Document Intelligence → focused on forms and structured document extraction (invoices, receipts, ID documents) using prebuilt or custom models.

The exam phrase "new multimodal document pipeline" → Content Understanding. "Extract fields from invoices/receipts" → Document Intelligence.

6.2 Document Intelligence

Extracts structured data from forms, invoices, receipts, and other documents.

Model Types

Model Type	Use Case	Training Requirement
Prebuilt	Invoice, Receipt, ID, W-2, Business Card	None — use as-is
Custom Template	Documents with fixed layout (forms, tables in same position)	Minimum 5 labeled documents
Custom Neural	Unstructured documents (contracts, letters — layout varies)	100+ labeled documents; higher accuracy
Composed	Routes a document to the best matching model from a collection	Multiple custom models trained separately

Custom Template vs Custom Neural

Template model = fixed layout (form fields always in the same place). Train with 5+ docs. Neural model = variable layout (contracts, unstructured text). Needs 100+ docs. Better accuracy for complex docs.

Training & Evaluation

Document Intelligence Studio: Visual labeling and training tool — drag to label fields.
Accuracy metric: Percentage of correctly identified fields in test documents.
Confidence scores: Each extracted value returns a 0–1 confidence score. Filter low-confidence extractions downstream.

6.3 Azure AI Search

The core platform for building full-text and vector search solutions over large document corpora.

Enrichment Pipeline (Indexing)

Data Source → Indexer → Skillset → Index

Stage	Description
Data Source	Where documents live: Azure Blob Storage, Azure SQL, Cosmos DB
Indexer	Schedules and runs the crawl — processes new and changed documents
Skillset	Chain of AI skills applied to each document during indexing
Index	The final searchable JSON structure queried at runtime

Built-in Skills vs Custom Skills

Skill Type	Examples	Key Detail
Built-in Cognitive Skills	OCR, Sentiment, Entity Recognition, Key Phrase, Image Analysis	Powered by Azure AI services — just reference in skillset JSON
Custom Skill	Your own logic in an Azure Function or web API	Must follow exact `values[]` input/output schema

Custom Skill Schema

A Custom Skill receives a values array and must return a values array with the same record keys. Forgetting this contract is the most common Custom Skill mistake on the exam.

json

// Input to your function:
{ "values": [{ "recordId": "1", "data": { "text": "..." } }] }

// Your function must return:
{ "values": [{ "recordId": "1", "data": { "myOutput": "..." } }] }

Knowledge Store

Persists enriched data as projections outside the search index for non-search use cases:

Projection Type	Use Case
Table Projections	Power BI analytics, SQL queries
Object Projections	Secondary AI processing (JSON blobs in Blob Storage)
File Projections	Save extracted images or normalized document files

Shaper Skill

Before projecting complex nested data, use the Shaper Skill in your skillset to reshape and flatten the enriched document into the schema required by your projection. The exam often asks which skill structures data for the Knowledge Store.

Query Syntax & Parameters

Syntax / Parameter	Purpose	Example
Simple	Basic keyword search	`azure search`
Lucene (Full)	Wildcards, regex, fuzzy, proximity	`azur~`, `"azure search"~3`
`$filter`	Boolean OData filter	`Category eq 'Finance' and Year gt 2020`
`$select`	Return specific fields only	`$select=title,summary`
`$top` / `$skip`	Pagination	`$top=10&$skip=20`
`$orderby`	Sort results	`$orderby=score desc`

6.4 Vector and Semantic Search

Search Techniques Compared

Technique	How It Works	Best For
Keyword Search	BM25 exact/fuzzy text matching	Precise term lookups
Vector Search	Embeddings + HNSW algorithm — finds semantically similar docs	Conceptual similarity, paraphrases
Hybrid Search	Combines keyword + vector scores using RRF (Reciprocal Rank Fusion)	Best overall recall
Semantic Ranking	Language-understanding L2 re-ranker applied after retrieval	Surfacing the single best answer

Semantic Ranking vs Vector Search

Vector search finds semantically similar documents using embeddings — it is a retrieval technique. Semantic ranking re-ranks already retrieved results using Microsoft language understanding / machine reading comprehension models — it is a post-retrieval step.

They are not the same. The exam tests this distinction directly.

Integrated Vectorization

AI Search can generate embeddings during indexing automatically using a built-in skillset — no separate embedding step in your pipeline code.

HNSW Algorithm

Hierarchical Navigable Small World — the approximate nearest-neighbor algorithm used internally by AI Search for vector similarity. Exam may reference it as the algorithm behind vector search.

Flashcards

1 / 8

❓

(Click to reveal)

💡

← Domain 5 · Cheatsheet →

Domain 6: Implement knowledge mining and information extraction (15-20%) ​

6.1 Azure Content Understanding ​

Key Capabilities ​

6.2 Document Intelligence ​

Model Types ​

Training & Evaluation ​

6.3 Azure AI Search ​

Enrichment Pipeline (Indexing) ​

Built-in Skills vs Custom Skills ​

Knowledge Store ​

Query Syntax & Parameters ​

6.4 Vector and Semantic Search ​

Search Techniques Compared ​

Integrated Vectorization ​

HNSW Algorithm ​