Skip to content

Domain 4: Implement computer vision solutions (10-15%) โ€‹

โ† Domain 3 ยท Domain 5 โ†’


This domain covers analyzing images and videos using Azure AI Vision, Custom Vision, Video Indexer, Spatial Analysis, and the Face API.

4.1 Image Analysis 4.0 โ€‹

Core Features โ€‹

FeatureWhat It ReturnsExam Signal
CaptioningOne sentence describing the whole image"describe the image"
Dense CaptioningCaptions for multiple regions/objects in the image"describe each object in the image"
TaggingList of objects, scenes, and actions detected"identify what is in the image"
Smart CroppingThumbnail coordinates that preserve the important region"generate thumbnail keeping subject in frame"
People DetectionBounding boxes around people (no identification)"count people in the image"
Background RemovalSeparates foreground subject from background"remove background from product photo"

OCR โ€” Read API (async) โ€‹

The Read API is the primary OCR tool for dense text, handwriting, and multi-page documents.

Async pattern (same as Document Translation and batch operations):

POST /imageanalysis:analyze?features=read
  โ†’ 202 Accepted + Operation-Location header

GET <Operation-Location URL>
  โ†’ poll until { "status": "succeeded" }
  โ†’ read "readResult" in response body

Async OCR Pattern

The initial POST returns 202 Accepted โ€” it does not contain the extracted text. You must GET the Operation-Location URL and poll until status: succeeded. Many candidates try to parse the initial POST response.


4.2 Custom Vision โ€‹

Use Custom Vision when the pre-built Image Analysis features are not specific enough for your domain.

Model Types โ€‹

TypeOutputUse Case
Classification โ€” MulticlassOne tag per image"Is this a cat, dog, or bird?"
Classification โ€” MultilabelMultiple tags per image"What objects are in this image?"
Object DetectionTag + bounding box coordinates"Where is the defect on the product?"

Exam Signal

"Locate objects with bounding boxes" โ†’ Custom Vision โ€” Object Detection (not Image Analysis tagging, which doesn't return coordinates).

Training Workflow โ€‹

Create project โ†’ Upload images โ†’ Tag images โ†’ Train โ†’ Evaluate โ†’ Publish โ†’ Test endpoint

Evaluation Metrics โ€‹

MetricDefinitionFor
PrecisionOf all predicted tags, how many were correct?Classification
RecallOf all actual tags, how many did the model find?Classification
mAP (mean Average Precision)Overall detection accuracy across all classesObject Detection

4.3 Video Indexer โ€‹

Extracts deep insights from video files or live streams without writing model training code.

Key Capabilities โ€‹

Insight TypeDescription
Facial RecognitionIdentifies and groups people across the video timeline
Topic InferenceHigh-level themes extracted from transcript + visuals
Brand DetectionRecognizes company logos and brand names in frames
OCR in VideoExtracts text from on-screen displays and captions
Sentiment AnalysisSentiment shifts across video segments
Scene SegmentationSplits video into scenes based on content changes
Audio TranscriptionFull transcript with speaker diarization

Video Indexer vs Spatial Analysis

Video Indexer = deep semantic insights from pre-recorded video (topics, faces, brands, transcripts) โ€” cloud-based, no real-time requirement. Spatial Analysis = real-time movement and presence detection in a live video feed โ€” edge container, measures people count, distance, dwell time.


4.4 Spatial Analysis โ€‹

Runs on edge devices via Docker containers, analyzing a live camera feed without sending raw video to the cloud.

Key Measurements โ€‹

MetricDescription
People CountingCount of people present in a defined zone at any moment
Zone Entry/ExitDetect when a person enters or leaves a zone
Distance MonitoringMeasure proximity between people (e.g., social distancing)
Dwell TimeHow long a person stays within a zone

Spatial Analysis Container

Spatial Analysis runs as an IoT Edge module or Docker container on an NVIDIA GPU-enabled device. It reports events to Azure IoT Hub or Event Hubs, not as direct HTTP responses.


4.5 Face API โ€‹

Core Capabilities โ€‹

CapabilityOperationExam Trigger
DetectionLocate faces + return attributes (blur, exposure, age estimate, occlusion)"detect faces in an image"
Verification1:1 โ€” are these two face images the same person?"compare two faces", "1:1 match"
Identification1:N โ€” who is this person from a known group?"identify from employee database", "1:N match"
Find Similar1:N โ€” find faces that look similar (no identity required)"find similar faces"

PersonGroup vs FaceList โ€‹

StructureUsed ForTraining Required?
PersonGroup / LargePersonGroupIdentification (1:N โ€” "who is this?")Yes โ€” must call Train after adding faces
FaceList / LargeFaceListFind Similar (1:N โ€” "what faces look like this?")No โ€” add faces and query directly

PersonGroup vs FaceList

PersonGroup = structured by people โ€” each person has an ID and multiple face images. Used for Identification. Must be trained after adding faces. FaceList = unstructured list of face images โ€” no person concept. Used for Find Similar. No training step.

The exam tests this: "identify from known employees" โ†’ PersonGroup + Identify. "find faces that look like this image" โ†’ FaceList + FindSimilar.

PersonGroup Workflow โ€‹

Create PersonGroup โ†’ Add Person objects โ†’ Add face images to each Person
  โ†’ Train PersonGroup โ†’ Identify a new face against the group

Limited Access

Many Face API capabilities (identification, verification, emotion detection) require Limited Access approval from Microsoft due to privacy and responsible AI policies. The exam may mention this constraint.


Flashcards

1 / 8
โ“

(Click to reveal)
๐Ÿ’ก

โ† Domain 3 ยท Domain 5 โ†’

Happy Studying! ๐Ÿš€ โ€ข Privacy-friendly analytics โ€” no cookies, no personal data
Privacy Policy โ€ข AI Disclaimer โ€ข Report an issue