Sharepoint as datasource in Azure AI Search, error at reading .docx

Question

Sharepoint as datasource in Azure AI Search, error at reading .docx

One_employee 0

Hi, I follow this tutorial to add a sharepoint as datasource and index it: https://learn.microsoft.com/en-us/azure/search/search-howto-index-sharepoint-online

It works well with .txt documents
But when it comes with .docx documents I got a string variables to describe the binary content from the files. But this binary content is corrupted.
The string information from the source (as an indexer feature) "/document/content" look likes this : Capture d’écran 2025-05-14 191839

IN "text" it should be the "/document/content" feature.

It is not exploitable. Whats is wrong?
Can someone help me.
Thanks

Anonymous

2025-05-14T18:23:00.48+00:00
Hi One_employee,

This problem occurs because by default, the raw file material (/document/content) obtained from Sharepoint is in binary format - not plain text. When working with file types such as .docx, .pdf, or images, Azure AI search requires the discovery and processing material using skills, especially underlying text skills such as cognitive skills.

Without a skill, the indexed draws only in binary material, which resembles the guard or base 64-encoded data. This is why you are looking at the until the delayed material in the /document /material area.

To properly act and find .docx files, you need to define and attach a skill that includes. This will process binary material and extract readable texts that can be indexed under a searching area (eg, text).

Create a skill with extract text skills.In the definition of your index, make sure you are doing mapping /document /content for the input of this skill.Map the skill output (eg, text) for a field in your index that is marked as searched. https://learn.microsoft.com/en-us/azure/search/search-howto-index-sharepoint-online
https://learn.microsoft.com/en-us/azure/search/cognitive-search-defining-skillset if you have any further concerns or queries, please feel free to reach out to us.

One_employee 0

Thanks for your answer but that's what i did. I tried with with skillset :

{
    "name": "large3-embedder",
    "description": "Skillset to chunk documents and generate embeddings",
    "skills": [
      {
        "@odata.type": "#Microsoft.Skills.Text.SplitSkill",
        "name": "#1",
        "description": "Split skill to chunk documents",
        "context": "/document",
        "defaultLanguageCode": "fr",
        "textSplitMode": "pages",
        "maximumPageLength": 10000,
        "pageOverlapLength": 700,
        "maximumPagesToTake": 0,
        "unit": "characters",
        "inputs": [
          {
            "name": "text",
            "source": "/document/content",
            "inputs": []
          }
        ],
        "outputs": [
          {
            "name": "textItems",
            "targetName": "pages"
          }
        ]
      },
      {
        "@odata.type": "#Microsoft.Skills.Text.AzureOpenAIEmbeddingSkill",
        "name": "#2",
        "context": "/document/pages/*",
        "resourceUri": "MyURI",
        "apiKey": "MyKey",
        "deploymentId": "text-embedding-3-large",
        "dimensions": 3072,
        "modelName": "text-embedding-3-large",
        "inputs": [
          {
            "name": "text",
            "source": "/document/pages/*",
            "inputs": []
          }
        ],
        "outputs": [
          {
            "name": "embedding",
            "targetName": "text_vector"
          }
        ]
      }
    ],
    "indexProjections": {
      "selectors": [
        {
          "targetIndexName": "new-index",
          "parentKeyFieldName": "parent_id",
          "sourceContext": "/document/pages/*",
          "mappings": [
            {
              "name": "text_vector",
              "source": "/document/pages/*/text_vector",
              "inputs": []
            },
            {
              "name": "chunk",
              "source": "/document/pages/*",
              "inputs": []
            },
            {
              "name": "title",
              "source": "/document/title",
              "sourceContext": null,
              "inputs": []
          },
          {
            "name": "metadata_spo_item_path",
            "source": "/document/metadata_spo_item_path",
            "sourceContext": null,
            "inputs": []
        },{
          "name": "content",
          "source": "/document/content",
          "sourceContext": null,
          "inputs": []
      },
        {
          "name": "metadata_spo_item_name",
          "source": "/document/metadata_spo_item_name",
          "sourceContext": null,
          "inputs": []
      }
          
          ]
          
          
        }
      ], 
      "parameters": {
        "projectionMode": "skipIndexingParentDocuments"
    }
    }
  }

It does not work.

I also tried with a custom skill using a Azure function (in order to read the Word binary with python-docx) instead of the TextSplitSkill : it does not work, because the binary content in /document/content is corrupted such as I've shown you the screnshot (It's a screenshot from my Azure function log).
Here is the code of my AZure function :


app = func.FunctionApp(http_auth_level=func.AuthLevel.FUNCTION)

@app.route(route="process_documents_docx", methods=[func.HttpMethod.POST])

def process_documents_docx(req: func.HttpRequest) -> func.HttpResponse:
    
    logging.info('Python HTTP trigger function processed a request.')

    # Récupération des données de l'input
    req_body = req.get_json()
    values = req_body.get('values')    
 
    # Instanciation d'une variable dans laquelle sera bâtie la réponse
    response_values = []
    
    # Traîtement des fichiers un par un
    for record in values:
        logging.info(f"values type : {type(record)}")
        logging.info(f"values  : {record}")
        
        # Récupération du record ID
        record_id = record.get("recordId")
        
        # Récupération de l'URI du fichier
        input_data = record.get("data").get("content")

        doc = Document(BytesIO(input_data)

Returns the error :
Result: Failure Exception: BadZipFile: File is not a zip file Stack: File "/azure-functions-host/workers/python/3.12/LINUX/X64/azure_functions_worker/dispatcher.py", line 676, in _handle__invocation_request call_result = await self._loop.run_in_executor( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/python/3.12/lib/python3.12/concurrent/futures/thread.py", line 58, in run result = self.fn(*self.args, **self.kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/azure-functions-host/workers/python/3.12/LINUX/X64/azure_functions_worker/dispatcher.py", line 1006, in _run_sync_func return ExtensionManager.get_sync_invocation_wrapper(context, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/azure-functions-host/workers/python/3.12/LINUX/X64/azure_functions_worker/extension.py", line 211, in _raw_invocation_wrapper result = function(**args) ^^^^^^^^^^^^^^^^ File "/home/site/wwwroot/function_app.py", line 37, in process_documents_docx doc = Document(BytesIO(input_data)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/site/wwwroot/.python_packages/lib/site-packages/docx/api.py", line 27, in Document document_part = cast("DocumentPart", Package.open(docx).main_document_part) ^^^^^^^^^^^^^^^^^^ File "/home/site/wwwroot/.python_packages/lib/site-packages/docx/opc/package.py", line 127, in open pkg_reader = PackageReader.from_file(pkg_file) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/site/wwwroot/.python_packages/lib/site-packages/docx/opc/pkgreader.py", line 22, in from_file phys_reader = PhysPkgReader(pkg_file) ^^^^^^^^^^^^^^^^^^^^^^^ File "/home/site/wwwroot/.python_packages/lib/site-packages/docx/opc/phys_pkg.py", line 76, in init self._zipf = ZipFile(pkg_file, "r") ^^^^^^^^^^^^^^^^^^^^^^ File "/opt/python/3.12/lib/python3.12/zipfile/init.py", line 1341, in init self._RealGetContents() File "/opt/python/3.12/lib/python3.12/zipfile/init.py", line 1408, in _RealGetContents raise BadZipFile("File is not a zip file")

This is why I can say it is corrrupted, because of "BadZipFile"

Anonymous

2025-05-15T18:15:12.2266667+00:00
Hi One_employee,
To solve this, we strongly recommend using the underlying document extraction capacity through skills in the azure cognitive discovery, which is specifically designed for handling. Docx, .PDF, and other binary formats. This feature extracts readable text from automatically supported forms (including .docx), before you apply any custom or AI enrichment skills such as SplitSkill or EmbeddingSkill.

Update your data source to include the dataToExtract as contentAndMetadata, such as:

"dataToExtract": "contentAndMetadata", "fileDataSettings": { "dataFormat": "application/vnd.openxmlformats-officedocument.wordprocessingml.document" }

Once the text is extracted by Azure, your skill (division, Embedding, etc.) will get a clean plain text. The document does not need to write a custom azure function to decode or parse - Azure handles heavily.

If you want to continue to enrich the material (eg, divide or generate embedding into pages), then your current skill is completely valid after text extraction.
Index SharePoint Online
Using Skillsets in Cognitive Search
if you have any further concerns or queries, please feel free to reach out to us.

3 answers

Your answer

Anonymous

2025-05-14T18:23:00.48+00:00

Hi One_employee,

This problem occurs because by default, the raw file material (/document/content) obtained from Sharepoint is in binary format - not plain text. When working with file types such as .docx, .pdf, or images, Azure AI search requires the discovery and processing material using skills, especially underlying text skills such as cognitive skills.

Without a skill, the indexed draws only in binary material, which resembles the guard or base 64-encoded data. This is why you are looking at the until the delayed material in the /document /material area.

To properly act and find .docx files, you need to define and attach a skill that includes. This will process binary material and extract readable texts that can be indexed under a searching area (eg, text).

Create a skill with extract text skills.In the definition of your index, make sure you are doing mapping /document /content for the input of this skill.Map the skill output (eg, text) for a field in your index that is marked as searched. https://learn.microsoft.com/en-us/azure/search/search-howto-index-sharepoint-online
https://learn.microsoft.com/en-us/azure/search/cognitive-search-defining-skillset if you have any further concerns or queries, please feel free to reach out to us.
Anonymous

2025-05-15T18:15:12.2266667+00:00

Hi One_employee,
To solve this, we strongly recommend using the underlying document extraction capacity through skills in the azure cognitive discovery, which is specifically designed for handling. Docx, .PDF, and other binary formats. This feature extracts readable text from automatically supported forms (including .docx), before you apply any custom or AI enrichment skills such as SplitSkill or EmbeddingSkill.

Update your data source to include the dataToExtract as contentAndMetadata, such as:

"dataToExtract": "contentAndMetadata", "fileDataSettings": { "dataFormat": "application/vnd.openxmlformats-officedocument.wordprocessingml.document" }

Once the text is extracted by Azure, your skill (division, Embedding, etc.) will get a clean plain text. The document does not need to write a custom azure function to decode or parse - Azure handles heavily.

If you want to continue to enrich the material (eg, divide or generate embedding into pages), then your current skill is completely valid after text extraction.
Index SharePoint Online
Using Skillsets in Cognitive Search
if you have any further concerns or queries, please feel free to reach out to us.

Answer 1

One_employee 0

I try to update my data source but :

Contenu de la réponse : {'error': {'code': '', 'message': "The request is invalid. Details: The property 'dataToExtract' does not exist on type 'Microsoft.Azure.Search.V2024_05_01_Preview.DataSource' or is not present in the API version '2024-05-01-preview'. Make sure to only use property names that are defined by the type."}}

And

Contenu de la réponse : {'error': {'code': '', 'message': "The request is invalid. Details: The property 'dataToExtract' does not exist on type 'Microsoft.Azure.Search.V2024_05_01_Preview.DataSource' or is not present in the API version '2024-05-01-preview'. Make sure to only use property names that are defined by the type."}}

So, how do I update it ? I also try with api-version=2020-06-30, got the same error.

If you were talking about the indexer defintion, I already have this configuration for my indexer :

 "configuration": {
        "dataToExtract": "contentAndMetadata",
        "parsingMode": "text",
        "indexedFileNameExtensions" : ".pdf, .docx, .txt",
        "excludedFileNameExtensions" : ".png, .jpg"
      }

I Can't add the block

"fileDataSettings": { "dataFormat": "application/vnd.openxmlformats-officedocument.wordprocessingml.document" }

it returns me

<Response [400]>
❌ Erreur 400: {"error":{"code":"","message":"The request is invalid. Details: Value cannot be null. (Parameter 'edmType')"}}

at the indexer creation.

Anonymous

2025-05-16T18:07:57.8666667+00:00
Hi One_employee,
The current API version for data source (2024-05-01-reckuity) is not supported in datatoextract properties. Similarly, filedatasettings block is currently applied only to Blob storage and is not supported to SharePoint as data source.

The property 'dataToExtract' does not exist on type 'Microsoft.Azure.Search.V2024_05_01_Preview.DataSource'

This limit is not very clearly documented very clearly, and I honestly appreciate your effort in trying various API versions for troubleshooting.

You are doing everything right - datatoextract setting is in indexer configuration, and as you have already confirmed, you are using it correctly:

"configuration": { "dataToExtract": "contentAndMetadata", "parsingMode": "text", "indexedFileNameExtensions": ".pdf, .docx, .txt", "excludedFileNameExtensions": ".png, .jpg" }

When using Sharepoint as a data source. This is the correct and only supported method to explain Azure AI discovery to remove both metadata and readable material from .docx and other supported formats. There is no need to add filedatasatings - this property applies only to storage sources.

With its current configuration in Indexer, Azure should remove readable material from .docx files and pass it to the enrichment pipeline. If you are still receiving binary data in /document/content, there are two possible reasons:

Cognitive skills are referring to incorrect fields (ie, reading raw material instead of rich outputs).

File extraction is failing silently due to inadvertent file structure or a parsing error.

Use your skills input /document/merged_content instead of /document/content. When the rich are enabled with datatoextract: contentandmetadata, Azure automatically cleaned the merged_Content with extracted, extracted text.

Example (for your division):

"inputs": [ { "name": "text", "source": "/document/merged_content" } ]

Attach the sequentially attached to the sequential and check the raw output to verify that the merged_content includes the text extracted from your .DOCX files.

Azure supports cognitive discovery. Documents generated from Docx Word (Ooxml format). If these files are exported or generated through a third-party equipment, they cannot follow the expected structure.
Anonymous

2025-05-19T17:20:25.6633333+00:00

Hi One_employee,
We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution, please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.
One_employee 0 Reputation points

2025-05-20T14:21:23.64+00:00

The solution is to change the parsing mode from "parsingMode": "text" to "parsingMode": "default" and it works well.

But it retrieves only the texts no other information such as markdown stuff which is quite useful for chunking.
Anonymous

2025-05-20T18:19:31.92+00:00
Hi One_employee,
As you have seen, while the "default" mode successfully removes the text content, it does not currently maintain formatting information such as heading, lists, or markdown-style structures.

Unfortunately, it is a current platform range - the documentation levels the output in plain text during the engine enhancement. Although it works well for vectorization and semantic search scenarios, it can make the bottom less accurately less accurate if you are relying on structural signs.

Use SplitSkill to chunk on the basis of size (character or sentence), which works effectively with flattened content:

"textSplitMode": "pages", "maximumPageLength": 10000, "pageOverlapLength": 500

If structural loyalty (eg, markdown passing) is important for your use case, a custom azure function can still be used, but only after the material extract and available in merged_content. It avoids decoding the raw binary and allows the post-processing of clean text to re-introduce the light structure using a light structure again (eg, Regex for the headings
Parsing Modes – Azure Indexers
if you have any further concerns or queries, please feel free to reach out to us.
One_employee 0 Reputation points

2025-05-21T07:36:30.9066667+00:00

I don't get your answer. How do you want me to use an Azure function as a custom skillset (so after the data extraction) because the only stuff I can extract from my data source is the raw text, I can't have the binary content that contains the markdown information.
Anonymous

2025-05-21T19:17:24.65+00:00

Hi One_employee,
Use a preproperating pipeline (eg, an azure function or azure logic app):

Draws directly from SharePoint .Docx files (using Microsoft Graff API),

Pythan-DokX or Mammath.js uses them, and Azure Blob Storage with output structured JSON (protected formatting).

Then, search for Azure AI to block storage instead of sharepoint. This approach gives you complete control over the document structure and drain the sequencing pipeline before entering.
Read .docx structure using python-docx
Ingest data from Blob Storage

If you continue with Share point as a data source, consider using the flattened material for chunking on it:

Length-based rules (SplitSkill)

Pattern matching (such as, detect lines in all caps as title)

This is not an ideal choice for Markdown Parsing, but it can still achieve a meaningful section for vector search and QA experiences.
Anonymous

2025-05-22T17:03:58.04+00:00

Hi One_employee, We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution, please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.

Answer 2

Hi One_employee,

You did everything right by switching your parsing mode to "default" that’s exactly how you get readable text from .docx files in SharePoint using Azure AI Search. Unfortunately, as you’ve noticed, this approach only gives you plain, unstructured text. Formatting, headings, and markdown-style details are lost, and that’s a built-in limitation of using SharePoint as a data source. There’s no setting or skillset that changes this in the native indexer.

If keeping structure like headings, lists, or markdown is important for the use case, you’ll need to build a custom extraction pipeline outside the default Azure Search indexer.

Download the .docx files from SharePoint with Microsoft Graph API. Use Microsoft Graph API to programmatically fetch the documents from SharePoint. This can be automated using a script, an Azure Function, or Logic App.
Extract text and structure from each document. Use a tool like python-docx (Python) or mammoth.js (Node.js) to process the .docx files. These libraries let you identify and separate out headings, paragraphs, lists, and other formatting, which you can then serialize as JSON or markdown.

from docx import Document
doc = Document("yourfile.docx")
structured_content = []
for para in doc.paragraphs:
    if para.style.name.startswith("Heading"):
        structured_content.append({"type": "heading", "text": para.text})
    else:
        structured_content.append({"type": "paragraph", "text": para.text})
# Save structured_content as JSON for indexing

Upload the structured data to Azure Blob Storage. Store the output (whether it’s JSON, markdown, or another structure) in a Blob Storage container. This becomes your new source for Azure AI Search.
Point Azure AI Search to index from Blob Storage. Set up the indexer to use this blob container. Now you have full control over what gets indexed—including any structure or formatting you want to preserve.

Hope it helps!

Please do not forget to click "Accept the answer” and Yes wherever the information provided helps you, this can be beneficial to other community members.

User's image

If you have any other questions or still running into more issues, let me know in the "comments" and I would be happy to help you.

Answer 3

TianChang Liu 0

User's image

Where should I insert the skill for proper indexing?

Share via

Sharepoint as datasource in Azure AI Search, error at reading .docx

3 answers

Your answer