Domain 4: Implement computer vision solutions (10-15%) โ
This domain covers analyzing images and videos using Azure AI Vision, Custom Vision, Video Indexer, Spatial Analysis, and the Face API.
4.1 Image Analysis 4.0 โ
Core Features โ
| Feature | What It Returns | Exam Signal |
|---|---|---|
| Captioning | One sentence describing the whole image | "describe the image" |
| Dense Captioning | Captions for multiple regions/objects in the image | "describe each object in the image" |
| Tagging | List of objects, scenes, and actions detected | "identify what is in the image" |
| Smart Cropping | Thumbnail coordinates that preserve the important region | "generate thumbnail keeping subject in frame" |
| People Detection | Bounding boxes around people (no identification) | "count people in the image" |
| Background Removal | Separates foreground subject from background | "remove background from product photo" |
OCR โ Read API (async) โ
The Read API is the primary OCR tool for dense text, handwriting, and multi-page documents.
Async pattern (same as Document Translation and batch operations):
POST /imageanalysis:analyze?features=read
โ 202 Accepted + Operation-Location header
GET <Operation-Location URL>
โ poll until { "status": "succeeded" }
โ read "readResult" in response bodyAsync OCR Pattern
The initial POST returns 202 Accepted โ it does not contain the extracted text. You must GET the Operation-Location URL and poll until status: succeeded. Many candidates try to parse the initial POST response.
4.2 Custom Vision โ
Use Custom Vision when the pre-built Image Analysis features are not specific enough for your domain.
Model Types โ
| Type | Output | Use Case |
|---|---|---|
| Classification โ Multiclass | One tag per image | "Is this a cat, dog, or bird?" |
| Classification โ Multilabel | Multiple tags per image | "What objects are in this image?" |
| Object Detection | Tag + bounding box coordinates | "Where is the defect on the product?" |
Exam Signal
"Locate objects with bounding boxes" โ Custom Vision โ Object Detection (not Image Analysis tagging, which doesn't return coordinates).
Training Workflow โ
Create project โ Upload images โ Tag images โ Train โ Evaluate โ Publish โ Test endpointEvaluation Metrics โ
| Metric | Definition | For |
|---|---|---|
| Precision | Of all predicted tags, how many were correct? | Classification |
| Recall | Of all actual tags, how many did the model find? | Classification |
| mAP (mean Average Precision) | Overall detection accuracy across all classes | Object Detection |
4.3 Video Indexer โ
Extracts deep insights from video files or live streams without writing model training code.
Key Capabilities โ
| Insight Type | Description |
|---|---|
| Facial Recognition | Identifies and groups people across the video timeline |
| Topic Inference | High-level themes extracted from transcript + visuals |
| Brand Detection | Recognizes company logos and brand names in frames |
| OCR in Video | Extracts text from on-screen displays and captions |
| Sentiment Analysis | Sentiment shifts across video segments |
| Scene Segmentation | Splits video into scenes based on content changes |
| Audio Transcription | Full transcript with speaker diarization |
Video Indexer vs Spatial Analysis
Video Indexer = deep semantic insights from pre-recorded video (topics, faces, brands, transcripts) โ cloud-based, no real-time requirement. Spatial Analysis = real-time movement and presence detection in a live video feed โ edge container, measures people count, distance, dwell time.
4.4 Spatial Analysis โ
Runs on edge devices via Docker containers, analyzing a live camera feed without sending raw video to the cloud.
Key Measurements โ
| Metric | Description |
|---|---|
| People Counting | Count of people present in a defined zone at any moment |
| Zone Entry/Exit | Detect when a person enters or leaves a zone |
| Distance Monitoring | Measure proximity between people (e.g., social distancing) |
| Dwell Time | How long a person stays within a zone |
Spatial Analysis Container
Spatial Analysis runs as an IoT Edge module or Docker container on an NVIDIA GPU-enabled device. It reports events to Azure IoT Hub or Event Hubs, not as direct HTTP responses.
4.5 Face API โ
Core Capabilities โ
| Capability | Operation | Exam Trigger |
|---|---|---|
| Detection | Locate faces + return attributes (blur, exposure, age estimate, occlusion) | "detect faces in an image" |
| Verification | 1:1 โ are these two face images the same person? | "compare two faces", "1:1 match" |
| Identification | 1:N โ who is this person from a known group? | "identify from employee database", "1:N match" |
| Find Similar | 1:N โ find faces that look similar (no identity required) | "find similar faces" |
PersonGroup vs FaceList โ
| Structure | Used For | Training Required? |
|---|---|---|
| PersonGroup / LargePersonGroup | Identification (1:N โ "who is this?") | Yes โ must call Train after adding faces |
| FaceList / LargeFaceList | Find Similar (1:N โ "what faces look like this?") | No โ add faces and query directly |
PersonGroup vs FaceList
PersonGroup = structured by people โ each person has an ID and multiple face images. Used for Identification. Must be trained after adding faces. FaceList = unstructured list of face images โ no person concept. Used for Find Similar. No training step.
The exam tests this: "identify from known employees" โ PersonGroup + Identify. "find faces that look like this image" โ FaceList + FindSimilar.
PersonGroup Workflow โ
Create PersonGroup โ Add Person objects โ Add face images to each Person
โ Train PersonGroup โ Identify a new face against the groupLimited Access
Many Face API capabilities (identification, verification, emotion detection) require Limited Access approval from Microsoft due to privacy and responsible AI policies. The exam may mention this constraint.