Sharepoint as datasource in Azure AI Search, error at reading .docx

One_employee 0 Reputation points
2025-05-14T17:23:34.7533333+00:00

Hi, I follow this tutorial to add a sharepoint as datasource and index it: https://learn.microsoft.com/en-us/azure/search/search-howto-index-sharepoint-online

It works well with .txt documents
But when it comes with .docx documents I got a string variables to describe the binary content from the files. But this binary content is corrupted.
The string information from the source (as an indexer feature) "/document/content" look likes this : Capture d’écran 2025-05-14 191839

IN "text" it should be the "/document/content" feature.

It is not exploitable. Whats is wrong?
Can someone help me.
Thanks

Azure AI Search
Azure AI Search
An Azure search service with built-in artificial intelligence capabilities that enrich information to help identify and explore relevant content at scale.
{count} votes

3 answers

Sort by: Most helpful
  1. One_employee 0 Reputation points
    2025-05-16T09:30:51.62+00:00

    I try to update my data source but :

    Contenu de la réponse : {'error': {'code': '', 'message': "The request is invalid. Details: The property 'dataToExtract' does not exist on type 'Microsoft.Azure.Search.V2024_05_01_Preview.DataSource' or is not present in the API version '2024-05-01-preview'. Make sure to only use property names that are defined by the type."}}
    
    

    And

    Contenu de la réponse : {'error': {'code': '', 'message': "The request is invalid. Details: The property 'dataToExtract' does not exist on type 'Microsoft.Azure.Search.V2024_05_01_Preview.DataSource' or is not present in the API version '2024-05-01-preview'. Make sure to only use property names that are defined by the type."}}
    

    So, how do I update it ? I also try with api-version=2020-06-30, got the same error.

    If you were talking about the indexer defintion, I already have this configuration for my indexer :

     "configuration": {
            "dataToExtract": "contentAndMetadata",
            "parsingMode": "text",
            "indexedFileNameExtensions" : ".pdf, .docx, .txt",
            "excludedFileNameExtensions" : ".png, .jpg"
          }
    

    I Can't add the block

    "fileDataSettings": { "dataFormat": "application/vnd.openxmlformats-officedocument.wordprocessingml.document" }
    

    it returns me

    <Response [400]>
    ❌ Erreur 400: {"error":{"code":"","message":"The request is invalid. Details: Value cannot be null. (Parameter 'edmType')"}}
    

    at the indexer creation.


  2. Suresh Chikkam 2,215 Reputation points Microsoft External Staff Moderator
    2025-05-23T10:01:12.9633333+00:00

    Hi One_employee,

    You did everything right by switching your parsing mode to "default" that’s exactly how you get readable text from .docx files in SharePoint using Azure AI Search. Unfortunately, as you’ve noticed, this approach only gives you plain, unstructured text. Formatting, headings, and markdown-style details are lost, and that’s a built-in limitation of using SharePoint as a data source. There’s no setting or skillset that changes this in the native indexer.

    If keeping structure like headings, lists, or markdown is important for the use case, you’ll need to build a custom extraction pipeline outside the default Azure Search indexer.

    • Download the .docx files from SharePoint with Microsoft Graph API. Use Microsoft Graph API to programmatically fetch the documents from SharePoint. This can be automated using a script, an Azure Function, or Logic App.
    • Extract text and structure from each document. Use a tool like python-docx (Python) or mammoth.js (Node.js) to process the .docx files. These libraries let you identify and separate out headings, paragraphs, lists, and other formatting, which you can then serialize as JSON or markdown.
    from docx import Document
    doc = Document("yourfile.docx")
    structured_content = []
    for para in doc.paragraphs:
        if para.style.name.startswith("Heading"):
            structured_content.append({"type": "heading", "text": para.text})
        else:
            structured_content.append({"type": "paragraph", "text": para.text})
    # Save structured_content as JSON for indexing
    
    • Upload the structured data to Azure Blob Storage. Store the output (whether it’s JSON, markdown, or another structure) in a Blob Storage container. This becomes your new source for Azure AI Search.
    • Point Azure AI Search to index from Blob Storage. Set up the indexer to use this blob container. Now you have full control over what gets indexed—including any structure or formatting you want to preserve.

    Hope it helps!


    Please do not forget to click "Accept the answer” and Yes wherever the information provided helps you, this can be beneficial to other community members.

    User's image

    If you have any other questions or still running into more issues, let me know in the "comments" and I would be happy to help you.

    0 comments No comments

  3. TianChang Liu 0 Reputation points
    2025-11-25T12:30:24.7266667+00:00

    User's image

    Where should I insert the skill for proper indexing?

    0 comments No comments

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.