What is a Content understanding analyzer?

An analyzer in Azure Content Understanding in Foundry Tools is a configurable processing unit that defines how your content should be analyzed and what information should be extracted. Think of an analyzer as a recipe that tells the service:

What type of content to process (documents, images, audio, or video)
What elements to extract (text, layout, tables, fields, transcripts)
How to structure the output (markdown, JSON fields, segments)
Which AI models to use for processing

Analyzers are the core building blocks of Content Understanding. They combine content extraction, AI-powered analysis, and structured data output into a single, reusable configuration. You can use prebuilt analyzers for common scenarios or create custom analyzers tailored to your specific needs.

Analyzer types

Content Understanding provides several types of analyzers:

Base analyzers: Foundational analyzers that provide core processing capabilities for each content type (prebuilt-document, prebuilt-audio, prebuilt-video, prebuilt-image). These analyzers are typically used as building blocks for custom analyzers.
RAG analyzers: Optimized for retrieval-augmented generation scenarios, extracting content with semantic understanding for search and AI applications (ex. prebuilt-documentSearch, prebuilt-videoSearch).
Domain-specific analyzers: Preconfigured for specific document types and industries, like invoices, receipts, ID documents, and contracts (ex. prebuilt-invoice, prebuilt-receipt, prebuilt-idDocument).
Custom analyzers: Analyzers you create by extending base analyzers with custom field schemas and configurations to meet your specific requirements.

For more information and a complete list of available Domain-specific analyzers, see Prebuilt analyzers.

Analyzer configuration structure

An analyzer configuration is defined using a JSON object that contains several top-level properties. You can configure the following components:

Analyzer properties - Core identity and metadata
- analyzerId - Unique identifier
- name - Display name
- description - Purpose description
- baseAnalyzerId - Parent analyzer reference
Model configuration - AI model settings
- models - Default models
Processing configuration - Content processing options
- config - Behavior settings
Field schema - Structured data extraction
- fieldSchema - Field definitions

Here's a condensed example showing the overall structure of an analyzer configuration:

{
  "analyzerId": "my-custom-invoice-analyzer",
  "description": "Extracts vendor information, line items, and totals from commercial invoices",
  "baseAnalyzerId": "prebuilt-document",
  "config": {
    ...
    "enableOcr": true
    ...
  },
  "fieldSchema": {...}
    }
  },
  "models": {
    "completion": "gpt-4.1",
    "embedding": "text-embedding-3-large"
  }
}

Analyzer properties

These properties uniquely identify and describe your analyzer:

`analyzerId`

Description: Unique identifier for the analyzer. This identifier is how you reference the analyzer in API calls.
Example: "prebuilt-invoice", "my-custom-analyzer"
Guidelines:
- Use descriptive names that indicate the analyzer's purpose
- For custom analyzers, choose names that don't conflict with prebuilt analyzer names
- Use lowercase with hyphens for consistency

`name`

Description: Human-readable display name shown in user interfaces and documentation
Example: "Invoice document understanding", "Custom receipt processor"

`description`

Description: Brief explanation of what the analyzer does and what content it processes. This description is used as context by the AI model during field extraction, so clear descriptions improve extraction accuracy.
Example: "Analyzes invoice documents to extract line items, totals, vendor information, and payment terms"
Guidelines:
- Be specific about what the analyzer extracts
- Mention the content types it supports
- Keep it concise but informative
- Write clear descriptions as they guide the AI model's understanding

`baseAnalyzerId`

Description: References a parent analyzer from which this analyzer inherits configuration
Supported base analyzers:
- "prebuilt-document" - for document-based custom analyzers
- "prebuilt-audio" - for audio-based custom analyzers
- "prebuilt-video" - for video-based custom analyzers
- "prebuilt-image" - for image-based custom analyzers
Example: "baseAnalyzerId": "prebuilt-document"

Note

When you specify a base analyzer, your custom analyzer inherits all default configurations and can override specific settings.

Model configuration

`models`

Description: Specifies which Foundry model names to use when processing with this analyzer. These are the model names (not deployment names) that the service uses. They must match one of the supportedModels from the base analyzer. The full list of models supported by Content Understanding is list at supported models.
Properties:
- completion - Model name for completion tasks (field extraction, segmentation, figure analysis etc.)
- embedding - Model name for embedding tasks (using a knowledge base)
Important: These are model names from the Foundry catalog, not deployment names. At runtime, the service maps these model names to the actual model deployments you configure at the resource level.

Example:

{
  "completion": "gpt-4o",
  "embedding": "text-embedding-3-large"
}

See Connect your Content Understanding resource with Foundry models for more details on how to configure connected models.

Processing configuration

The config object contains all processing options that control how content is analyzed. These options are divided into categories based on functionality:

Config object properties

General options

`returnDetails`

Default: false (varies by analyzer)
Description: Controls whether to include detailed information in the response (confidence scores, bounding boxes, text spans, metadata)
When to use:
- Set to true when debugging extraction issues
- When you need location information for extracted data
- When confidence scores are required for validation
- For quality assurance and testing
Impact on response: Significantly increases response size with more metadata

Document content extraction options

`enableOcr`

Default: true
Description: Enables Optical Character Recognition to extract text from images and scanned documents
When to use:
- Enable for scanned documents, photos, and image-based PDFs
- Disable for native digital PDFs to improve performance
Supported by: Document analyzers

`enableLayout`

Default: true
Description: Extracts layout information including paragraphs, lines, words, reading order, and structural elements
When to use:
- Required for understanding document structure and hierarchy
- Needed for accurate paragraph and section extraction
- Disable if only raw text extraction is needed
Supported by: Document-based analyzers

`enableFormula`

Default: true
Description: Detects and extracts mathematical formulas and equations in LaTeX format
When to use:
- Enable for scientific papers, research documents, technical documentation
- Disable for general business documents to improve performance
Supported by: Document-based analyzers

`enableBarcode`

Default: true
Description: Detects and extracts barcodes and QR codes, returning the decoded values
When to use:
- Enable for inventory documents, shipping labels, product documentation
- Disable when barcodes aren't present to improve performance
Supported by: Document-based analyzers
Supported barcode types: QR Code, PDF417, UPC-A, UPC-E, Code 39, Code 128, EAN-8, EAN-13, DataBar, Code 93, Codabar, ITF, Micro QR Code, Aztec, Data Matrix, MaxiCode

Table and chart options

`tableFormat`

Default: "html"
Supported values: "html", "markdown"
Description: Specifies the output format for extracted tables
When to use:
- Use "html" for web rendering or when complex table structures need preservation
- Use "markdown" for simple tables in documentation or text-based processing
Supported by: Document-based analyzers

`chartFormat`

Default: "chartjs"
Supported values: "chartjs"
Description: Specifies the format for extracted chart and graph data (compatible with Chart.js library)
When to use:
- When extracting data from bar charts, line graphs, pie charts
- Converting visual charts to structured data for re-rendering
Supported by: Document-based analyzers

Figure and image analysis options

`enableFigureDescription`

Default: false
Description: Generates natural language text descriptions for figures, diagrams, images, and illustrations
When to use:
- For accessibility requirements (alt text generation)
- Understanding diagrams and flowcharts
- Extracting insights from infographics
Supported by: Document-based analyzers

`enableFigureAnalysis`

Default: false
Description: Performs deeper analysis of figures including chart data extraction and diagram component identification
When to use:
- Extracting structured data from charts embedded in documents
- Understanding complex diagrams
- Detailed figure classification
Supported by: Document-based analyzers

Annotation options

`annotationFormat`

Default: "markdown"
Supported values: "markdown"
Description: Specifies the format for returned annotations
Supported by: Document-based analyzers

Field extraction options

`estimateFieldSourceAndConfidence`

Default: false (varies by analyzer)
Description: Returns source location (page number, bounding box) and confidence score for each extracted field value.
When to use:
- Validation and quality assurance workflows
- Understanding extraction accuracy
- Debugging extraction issues
- Highlighting source text in user interfaces
Supported by: Document analyzers (invoice, receipt, ID documents, tax forms)

Audio and video options

`locales`

Default: [] (empty array)
Description: List of locale/language codes for language-specific processing (primarily for transcription)
Supported values: BCP-47 language codes (for example, ["en-US", "es-ES", "fr-FR", "de-DE"])
When to use:
- Multi-language audio transcription
- Specifying expected language for better accuracy
- Processing content in specific regional variants
Supported by: prebuilt-audio, prebuilt-video, prebuilt-callCenter

Note

For a complete list of supported languages and locales, see Language and region support.

`disableFaceBlurring`

Default: false
Description: Controls whether faces in images and videos should be blurred for privacy protection
When to use:
- Set to true when face visibility is required for analysis
- Set to false when de-identification of individuals in shared content in desired
Supported by: prebuilt-image, prebuilt-video

Important

The Face capabilities feature in Content Understanding is a Limited Access service and registration is required for access. Face grouping and identification feature in Content Understanding is limited based on eligibility and usage criteria. Face service is only available to Microsoft managed customers and partners. Use the Face Recognition intake form to apply for access. For more information, see the Responsible AI investments and safeguards for facial recognition.

Classification options

`contentCategories`

Default: Not set
Description: Defines categories or content types for automatic classification and routing to specialized handlers. When used with enableSegment set to false is currently only supported for documents. It classifies the entire file. When used with enableSegment=true, the file is broken into chunks based on these categories, with each segment classified and optionally processed by a category-specific analyzer. Always selects a single option from the list of available categories.
Structure: Each category contains:
- description - (Required) Detailed description of the category/document type. This description acts as a prompt that guides the AI model in determining segment boundaries and classification. Include distinguishing characteristics to help identify where one category ends and another begins.
- analyzerId - (Optional) Reference to another analyzer to use for this category. The referenced analyzer is linked, not copied, ensuring consistent behavior. If omitted, only categorization is performed without more processing (split-only scenario).
Model usage: The models specified in the parent analyzer's models property are used only for segmentation and classification. Each subanalyzer uses its own model configuration for extraction.
Behavior with enableSegment:
- enableSegment: true: Content is split into segments based on the category descriptions. Each segment is classified into one of the defined categories. Returns segment metadata in the original content object, plus more content objects for segments with analyzerId specified.
- enableSegment: false: The entire content is classified as a whole into one category and routed accordingly. Useful for hierarchical classification without splitting.
Category matching: If an "other" or "default" category isn't defined, content is forced to classify into one of the listed categories. Include an "other" category to handle unmatched content gracefully.
Supported by: Document and video analyzers. For video, only you can only define one contentCategory.

`enableSegment`

Default: false
Description: Enables content segmentation, breaking the file into chunks based on the categories specified in contentCategories. Each segment is then classified into one of the defined categories for selective processing.
Segmentation behavior: The service divides content into logical units by analyzing the content against the category descriptions. Segment boundaries are determined using:
- Documents: Category descriptions combined with content structure (pages, sections, formatting changes)
- Videos: Category descriptions combined with visual cues (shot changes, scene transitions, temporal boundaries). Only one contentCategory is supported.
When to use:
- Processing mixed-content batches where different parts need different handling (for example, a PDF containing both invoices and receipts)
- Splitting long documents into categorized chunks for selective analysis
- Analyzing videos by content type (for example, separate ads from main content)
Output structure:
- Returns a segments array in the content object containing metadata for each segment (ID, boundaries, category)
- Each segment includes its classified category from contentCategories
- More content objects are returned for segments with category analyzerId specified
Hierarchical segmentation: If a category's analyzer also has enableSegment: true, segments can be recursively split, enabling multi-level content breakdown
Performance impact: Increases processing time for large files, especially with many segments
Supported by: Document and video analyzers

`segmentPerPage`

Default: false
Description: When segmentation is enabled, force one segment per page instead of using logical content boundaries. Replaces the need for separate "perPage" split modes.
When to use:
- Page-by-page processing workflows
- Each page should be treated as an independent unit
- Parallel processing of individual pages
- Page-level field extraction in multi-page documents
- Mixed document batches where each page is a different document type
Supported by: Document-based analyzers

`omitContent`

Default: false
Description: When true, excludes the original content object from the response, returning only structured field data or content objects from subanalyzers (when using contentCategories)
When to use:
- When you only need extracted field values
- In composed analyzers with contentCategories to return only categorized results
- For hierarchical classification chains, return only leaf analyzer results

Example - Selective analysis:

{
  "config": {
    "enableSegment": true,
    "contentCategories": {
      "invoice": { "analyzerId": "prebuilt-invoice" },
      "other": { }  // Categorize but don't process
    },
    "omitContent": true  // Only return invoice analysis results
  }
}

Supported by: Document analyzers

Field configuration

The fieldSchema property defines what structured data your analyzer extracts from content. It specifies the fields, their types, and how they should be extracted.

Design intent: Structured extraction

Field schemas transform unstructured content into structured, queryable data. The schema acts as both:

A contract defining what data is extracted
A guide for the AI model on what to look for and how to interpret it

Field schema structure

{
  "fieldSchema": {
    "name": "InvoiceAnalysis",
    "fields": {
      "VendorName": {
        "type": "string",
        "description": "Name of the vendor or supplier",
        "method": "extract"
      },
      "InvoiceTotal": {
        "type": "number",
        "description": "Total amount due on the invoice",
        "method": "extract"
      },
      "LineItems": {
        "type": "array",
        "items": {
          "type": "object",
          "properties": {
            "Description": { "type": "string" },
            "Quantity": { "type": "number" },
            "UnitPrice": { "type": "number" },
            "Amount": { "type": "number" }
          }
        },
        "description": "List of items on the invoice, typically in a table format",
        "method": "generative"
      }
    }
  }
}

Field schema properties

`name`

Description: Name of the schema, typically describing the content type or use case
Example: "InvoiceAnalysis", "ReceiptExtraction", "ContractFields"

`fields`

Description: Object defining each field to extract, with field names as keys. Empty object {} indicates no structured fields are extracted (for example, layout-only analyzers).
Hierarchical support: Supports nested fields through object and array types for representing complex data structures
Best practice: Avoid deep nesting (more than 2-3 levels) as it can reduce performance and extraction accuracy

Field definition properties

Each field in the fields object has the following properties:

`type`

Supported values: "string", "number", "boolean", "date", "object", "array"
Description: Data type of the field value. Choose the type that best matches your data semantics for optimal extraction.

`description`

Description: Clear explanation of what the field contains and where to find it. This description is processed by the AI model as a mini-prompt to guide field extraction, so specificity and clarity directly improve extraction accuracy.

For information about writing effective field descriptions, see Best practices for field extraction.

`method`

Supported values: "generate", "extract", "classify"
Description: Extraction method to use for this field. When not specified, the system automatically determines the best method based on the field type and description.
Method types:
- "generate" - Values are generated freely based on the content using AI models (best for complex or variable fields requiring interpretation)
- "extract" - Values are extracted as they appear in the content (best for literal text extraction from specific locations). Extract requires enableSourceGroundingAndConfidence to be set to true for this field.
- "classify" - Values are classified against a predefined set of categories (best when using enum with a fixed set of possible values)

`estimateSourceAndConfidence`

Default: false
Description: Returns source location (page number, bounding box) and confidence score for this field value. Must be true for fields with method = extract. This property will override the analyzer level estimateFieldSourceAndConfidence property.
When to use:
- Validation and quality assurance workflows
- Understanding extraction accuracy
- Debugging extraction issues
- Highlighting source text in user interfaces
Supported by: Document analyzers (invoice, receipt, ID documents, tax forms)

`items` (for array types)

Description: Defines the structure of items in the array
Properties:
- type - Type of array items ("string", "number", "object")
- properties - For object items, defines the nested field structure

`properties` (for object types)

Description: Defines the structure of nested fields within the object

Example:

{
  "Address": {
    "type": "object",
    "properties": {
      "Street": { "type": "string" },
      "City": { "type": "string" },
      "State": { "type": "string" },
      "ZipCode": { "type": "string" }
    },
    "description": "Complete mailing address"
  }
}

Complete analyzer example

Here's a comprehensive example of a custom invoice analyzer configuration that demonstrates the key concepts discussed in this reference:

{
  "analyzerId": "my-custom-invoice-analyzer",
  "name": "Custom Invoice Analyzer",
  "description": "Extracts vendor information, line items, and totals from commercial invoices",
  "baseAnalyzerId": "prebuilt-document",
  "config": {
    "returnDetails": true,
    "enableOcr": true,
    "enableLayout": true,
    "tableFormat": "html",
    "estimateFieldSourceAndConfidence": true,
    "omitContent": false
  },
  "fieldSchema": {
    "name": "InvoiceFields",
    "fields": {
      "VendorName": {
        "type": "string",
        "description": "Name of the vendor or supplier, typically found in the header section",
        "method": "extract"
      },
      "VendorAddress": {
        "type": "object",
        "properties": {
          "Street": { "type": "string" },
          "City": { "type": "string" },
          "State": { "type": "string" },
          "ZipCode": { "type": "string" }
        },
        "description": "Complete vendor mailing address"
      },
      "InvoiceNumber": {
        "type": "string",
        "description": "Unique invoice number, often labeled as 'Invoice #' or 'Invoice No.'",
        "method": "extract"
      },
      "InvoiceDate": {
        "type": "date",
        "description": "Date the invoice was issued, in format MM/DD/YYYY",
        "method": "extract"
      },
      "DueDate": {
        "type": "date",
        "description": "Payment due date",
        "method": "extract"
      },
      "LineItems": {
        "type": "array",
        "items": {
          "type": "object",
          "properties": {
            "Description": {
              "type": "string",
              "description": "Item or service description"
            },
            "Quantity": {
              "type": "number",
              "description": "Quantity ordered"
            },
            "UnitPrice": {
              "type": "number",
              "description": "Price per unit"
            },
            "Amount": {
              "type": "number",
              "description": "Line total (Quantity × UnitPrice)"
            }
          }
        },
        "description": "List of items or services on the invoice, typically in a table format",
        "method": "generative"
      },
      "Subtotal": {
        "type": "number",
        "description": "Sum of all line items before tax",
        "method": "extract"
      },
      "Tax": {
        "type": "number",
        "description": "Tax amount",
      },
      "Total": {
        "type": "number",
        "description": "Total amount due (Subtotal + Tax)",
      },
      "PaymentTerms": {
        "type": "string",
        "description": "Payment terms and conditions (e.g., 'Net 30', 'Due upon receipt')",
        "method": "generative"
      }
    }
  },
  "supportedModels": {
    "completion": ["gpt-4o", "gpt-4o-mini", "gpt-4.1"],
    "embedding": ["text-embedding-3-large", "text-embedding-3-small"]
  },
  "models": {
    "completion": "gpt-4.1",
    "embedding": "text-embedding-3-large"
  }
}

Creating a custom analyzer

To create a custom analyzer based on the configuration structure described in this document, use the Content Understanding REST API to submit your analyzer definition.

API endpoint

Use the following curl command to create a custom analyzer by submitting your analyzer configuration from a JSON file:

curl -X PUT "https://{endpoint}/contentunderstanding/analyzers/{analyzerId}?api-version=2025-11-01-preview" \
  -H "Content-Type: application/json" \
  -H "Ocp-Apim-Subscription-Key: {key}" \
  -d @analyzer-definition.json

Replace the following placeholders:

{endpoint} - Your Content Understanding resource endpoint
{analyzerId} - Unique identifier for your analyzer
{key} - Your Content Understanding subscription key
analyzer-definition.json - Path to your analyzer configuration file

Request body

The analyzer configuration file should be a JSON object containing the properties described in this reference. For a complete example, see the Create Custom Analyzer tutorial.

Response

The API returns a 201 Created response with an Operation-Location header that you can use to track the status of the analyzer creation operation.

Next steps

For a complete walkthrough with examples for different content types (documents, images, audio, video), see Create a custom analyzer.

Configuration by content type

Different content types support different configuration options. Here's a quick reference:

Document analyzers

Base analyzer: prebuilt-document

Supported configuration options:

✅ returnDetails
✅ omitContent
✅ enableOcr
✅ enableLayout
✅ enableFormula
✅ enableBarcode
✅ tableFormat
✅ chartFormat
✅ enableFigureDescription
✅ enableFigureAnalysis
✅ enableAnnotations
✅ annotationFormat
✅ enableSegment
✅ segmentPerPage
✅ estimateFieldSourceAndConfidence (structured analyzers)
✅ contentCategories (multi-variant analyzers)

Audio analyzers

Base analyzer: prebuilt-audio

Supported configuration options:

✅ returnDetails
✅ locales

Video analyzers

Base analyzer: prebuilt-video

Supported configuration options:

✅ returnDetails
✅ locales
✅ contentCategories
✅ enableSegment
✅ omitContent
✅ disableFaceBlurring

Image analyzers

Base analyzer: prebuilt-image

Supported configuration options:

✅ returnDetails
✅ disableFaceBlurring

Learn about prebuilt analyzers available in Content Understanding
Explore analyzer templates to get started quickly
Create your own analyzer by following the custom analyzer tutorial
Understand best practices for optimal extraction results
Review document elements and video elements for details on extracted content
Get started by create and test analyzers in Foundry

Feedback

Was this page helpful?

Last updated on 2025-11-18

Share via

What is a Content understanding analyzer?

Analyzer types

Analyzer configuration structure

Analyzer properties

analyzerId

name

description

baseAnalyzerId

Model configuration

models

Processing configuration

Config object properties

General options

returnDetails

Document content extraction options

enableOcr

enableLayout

enableFormula

enableBarcode

Table and chart options

tableFormat

chartFormat

Figure and image analysis options

enableFigureDescription

enableFigureAnalysis

Annotation options

annotationFormat

Field extraction options

estimateFieldSourceAndConfidence

Audio and video options

locales

disableFaceBlurring

Classification options

contentCategories

enableSegment

segmentPerPage

omitContent

Field configuration

Design intent: Structured extraction

Field schema structure

Field schema properties

name

fields

Field definition properties

type

description

method

estimateSourceAndConfidence

items (for array types)

properties (for object types)

Complete analyzer example

Creating a custom analyzer

API endpoint

Request body

Response

Next steps

Configuration by content type

Document analyzers

Audio analyzers

Video analyzers

Image analyzers

Related content

Feedback

Additional resources

`analyzerId`

`name`

`description`

`baseAnalyzerId`

`models`

`returnDetails`

`enableOcr`

`enableLayout`

`enableFormula`

`enableBarcode`

`tableFormat`

`chartFormat`

`enableFigureDescription`

`enableFigureAnalysis`

`annotationFormat`

`estimateFieldSourceAndConfidence`

`locales`

`disableFaceBlurring`

`contentCategories`

`enableSegment`

`segmentPerPage`

`omitContent`

`name`

`fields`

`type`

`description`

`method`

`estimateSourceAndConfidence`

`items` (for array types)

`properties` (for object types)