Skip to content

Domain 6: Implement knowledge mining and information extraction (15-20%) โ€‹

โ† Domain 5 ยท Cheatsheet โ†’


This domain covers extracting insights from large volumes of unstructured data using Azure AI Search, Document Intelligence, and the newer Content Understanding service. At 15โ€“20% weight, it is one of the highest-value domains on the exam.

6.1 Azure Content Understanding โ€‹

A newer service in AI Foundry for building automated, multimodal extraction pipelines. It goes beyond Document Intelligence by handling images, video, and audio alongside text documents.

Key Capabilities โ€‹

CapabilityDescription
OCR PipelineExtracts text and layout from complex multi-page documents and images
Summarization & ClassificationUses generative AI to categorize and summarize content during ingestion
Entity & Table ExtractionIdentifies structured data (tables, key-value pairs) within unstructured documents
Multimodal ProcessingIngests and analyzes video and audio alongside traditional documents

Content Understanding vs Document Intelligence

Content Understanding โ†’ multimodal pipeline (docs + images + video + audio) with generative AI summarization. It's the newer, broader service. Document Intelligence โ†’ focused on forms and structured document extraction (invoices, receipts, ID documents) using prebuilt or custom models.

The exam phrase "new multimodal document pipeline" โ†’ Content Understanding. "Extract fields from invoices/receipts" โ†’ Document Intelligence.


6.2 Document Intelligence โ€‹

Extracts structured data from forms, invoices, receipts, and other documents.

Model Types โ€‹

Model TypeUse CaseTraining Requirement
PrebuiltInvoice, Receipt, ID, W-2, Business CardNone โ€” use as-is
Custom TemplateDocuments with fixed layout (forms, tables in same position)Minimum 5 labeled documents
Custom NeuralUnstructured documents (contracts, letters โ€” layout varies)100+ labeled documents; higher accuracy
ComposedRoutes a document to the best matching model from a collectionMultiple custom models trained separately

Custom Template vs Custom Neural

Template model = fixed layout (form fields always in the same place). Train with 5+ docs. Neural model = variable layout (contracts, unstructured text). Needs 100+ docs. Better accuracy for complex docs.

Training & Evaluation โ€‹

  • Document Intelligence Studio: Visual labeling and training tool โ€” drag to label fields.
  • Accuracy metric: Percentage of correctly identified fields in test documents.
  • Confidence scores: Each extracted value returns a 0โ€“1 confidence score. Filter low-confidence extractions downstream.

The core platform for building full-text and vector search solutions over large document corpora.

Enrichment Pipeline (Indexing) โ€‹

Data Source โ†’ Indexer โ†’ Skillset โ†’ Index
StageDescription
Data SourceWhere documents live: Azure Blob Storage, Azure SQL, Cosmos DB
IndexerSchedules and runs the crawl โ€” processes new and changed documents
SkillsetChain of AI skills applied to each document during indexing
IndexThe final searchable JSON structure queried at runtime

Built-in Skills vs Custom Skills โ€‹

Skill TypeExamplesKey Detail
Built-in Cognitive SkillsOCR, Sentiment, Entity Recognition, Key Phrase, Image AnalysisPowered by Azure AI services โ€” just reference in skillset JSON
Custom SkillYour own logic in an Azure Function or web APIMust follow exact values[] input/output schema

Custom Skill Schema

A Custom Skill receives a values array and must return a values array with the same record keys. Forgetting this contract is the most common Custom Skill mistake on the exam.

json
// Input to your function:
{ "values": [{ "recordId": "1", "data": { "text": "..." } }] }

// Your function must return:
{ "values": [{ "recordId": "1", "data": { "myOutput": "..." } }] }

Knowledge Store โ€‹

Persists enriched data as projections outside the search index for non-search use cases:

Projection TypeUse Case
Table ProjectionsPower BI analytics, SQL queries
Object ProjectionsSecondary AI processing (JSON blobs in Blob Storage)
File ProjectionsSave extracted images or normalized document files

Shaper Skill

Before projecting complex nested data, use the Shaper Skill in your skillset to reshape and flatten the enriched document into the schema required by your projection. The exam often asks which skill structures data for the Knowledge Store.

Query Syntax & Parameters โ€‹

Syntax / ParameterPurposeExample
SimpleBasic keyword searchazure search
Lucene (Full)Wildcards, regex, fuzzy, proximityazur~, "azure search"~3
$filterBoolean OData filterCategory eq 'Finance' and Year gt 2020
$selectReturn specific fields only$select=title,summary
$top / $skipPagination$top=10&$skip=20
$orderbySort results$orderby=score desc

Search Techniques Compared โ€‹

TechniqueHow It WorksBest For
Keyword SearchBM25 exact/fuzzy text matchingPrecise term lookups
Vector SearchEmbeddings + HNSW algorithm โ€” finds semantically similar docsConceptual similarity, paraphrases
Hybrid SearchCombines keyword + vector scores using RRF (Reciprocal Rank Fusion)Best overall recall
Semantic RankingLLM-based L2 re-ranker applied after retrievalSurfacing the single best answer

Semantic Ranking vs Vector Search

Vector search finds semantically similar documents using embeddings โ€” it is a retrieval technique. Semantic ranking re-ranks already retrieved results using an LLM to surface the single best answer โ€” it is a post-retrieval step.

They are not the same. The exam tests this distinction directly.

Integrated Vectorization โ€‹

AI Search can generate embeddings during indexing automatically using a built-in skillset โ€” no separate embedding step in your pipeline code.

HNSW Algorithm โ€‹

Hierarchical Navigable Small World โ€” the approximate nearest-neighbor algorithm used internally by AI Search for vector similarity. Exam may reference it as the algorithm behind vector search.


Flashcards

1 / 8
โ“

(Click to reveal)
๐Ÿ’ก

โ† Domain 5 ยท Cheatsheet โ†’

Happy Studying! ๐Ÿš€ โ€ข Privacy-friendly analytics โ€” no cookies, no personal data
Privacy Policy โ€ข AI Disclaimer โ€ข Report an issue