Unable to parse input document file

Question

Unable to parse input document file

Jacquelin Martineau-Rousseau 135

Hello,

I am having a problem detecting a SIT within a file. Here are some details:

docx file
contains text + one image
- the text contains multiple occurences that are detected by the SIT when used in a .txt file
a manual test on the SIT directly gives the following error:

|Microsoft.Exchange.Management.Tasks.ErrorFileHasNoTextContentException|Unable to parse input document file. Make sure the document contains text and isn't encrypted by a password.

This is a regular docx file, no password, no label.

Any ideas?

VRISHABHANATH PATIL 1,820 Reputation points Microsoft External Staff Moderator

2025-11-29T03:33:09.2466667+00:00

Hi @Jacquelin Martineau-Rousseau

I hope you had a chance to review the information shared earlier, and I hope this information has been helpful! If you still have questions, please let us know what is needed in the comments so the question can be answered.
VRISHABHANATH PATIL 1,820 Reputation points Microsoft External Staff Moderator

2025-12-01T09:31:42.3366667+00:00

Hi @Jacquelin Martineau-Rousseau

Just checking in to see if you had any questions. We are happy to assist and continue the discussion whenever you are ready.
Jacquelin Martineau-Rousseau 135 Reputation points

2025-12-01T15:42:55.3333333+00:00

I did have time to review it, but it doesn't solve my problem. Thanks.
VRISHABHANATH PATIL 1,820 Reputation points Microsoft External Staff Moderator

2025-12-02T03:17:28.0866667+00:00

Hi @Jacquelin Martineau-Rousseau

Please share your additional questions here in detail, as providing more elaboration will help us investigate the issue further.
Jacquelin Martineau-Rousseau 135 Reputation points

2025-12-02T14:08:41.9333333+00:00

I understand that rebuilding or modifying the file may solve that problem.

However, since this is a DLP tool that I use with thousands of users, I do not see how this is a valid solution to the problem. I would expect the tool to be able to the detection in my context, since there is no encryption.

This currently makes us have a significant blindspot in our DLP setup.
VRISHABHANATH PATIL 1,820 Reputation points Microsoft External Staff Moderator

2025-12-03T04:36:11.05+00:00
Hi @Jacquelin Martineau-Rousseau

Thanks for providing the feedback, below are few mitigation steps will help you to resolve the issue -

Catch unscannable files explicitly (compensating control). Add DLP conditions for parsing failures so the system doesn’t silently miss them. In Endpoint DLP and Purview you can use predicates like “Document couldn’t be scanned” / “Document didn’t complete analysis.”

Action: Block or require user override + policy tip and log the event.

Why: Even when a file’s structure breaks text extraction, you’ll still get an alert and can prevent risky exfiltration.

Normalize risky content at the source (authoring guidance you can enforce). Publish a simple guideline: use PNG/JPEG for images, avoid EMF/WMF/SVG/OLE embeds, and keep sensitive text in normal paragraphs (not shapes/text boxes).

Action: Share this as a one‑pager; include it in templates/macros.

Ref: Sensitive info types overview & testing – https://learn.microsoft.com/en-us/microsoft-365/compliance/sensitive-information-type-learn-about

Monitor for failures and tune policies, centrally. Use Activity/Alert/Incident Explorer to track where parsing fails and which locations/apps are involved; tighten rules for those channels.

Action: Set an analytic rule/alert for parsing‑failure predicates and review weekly.

Keep EXO/Purview DLP current (backend improvements help). Exchange Online DLP continues to improve true file‑type and content extraction. Make sure your tenant uses the latest DLP capabilities and test your SITs in the Purview test portal when you update.

Action: Re‑test with the SIT Test tool whenever policies change.

No silent misses: Files that can’t be parsed are still controlled and visible (blocked/overridden + audited).

Low user friction: Most users won’t notice anything; only edge cases trigger a clear tip.

Fast remediation path: If a business‑critical file hits the predicate, your team can quickly normalize the image or move text out of shapes—without relying on every user to “fix” their doc.
VRISHABHANATH PATIL 1,820 Reputation points Microsoft External Staff Moderator

2025-12-04T06:23:21.6566667+00:00

Hi @Jacquelin Martineau-Rousseau

I hope you had a chance to review the information shared earlier, and I hope this information has been helpful! If you still have questions, please let us know what is needed in the comments so the question can be answered.
VRISHABHANATH PATIL 1,820 Reputation points Microsoft External Staff Moderator

2025-12-05T03:46:54.2666667+00:00

Hi @Jacquelin Martineau-Rousseau

If this solution helped resolve your issue, please consider clicking ‘Accept Answer’ or giving it an upvote to help others find it easily.

2 answers

Your answer

VRISHABHANATH PATIL 1,820 Reputation points Microsoft External Staff Moderator

2025-11-29T03:33:09.2466667+00:00

Hi @Jacquelin Martineau-Rousseau

I hope you had a chance to review the information shared earlier, and I hope this information has been helpful! If you still have questions, please let us know what is needed in the comments so the question can be answered.
VRISHABHANATH PATIL 1,820 Reputation points Microsoft External Staff Moderator

2025-12-01T09:31:42.3366667+00:00

Hi @Jacquelin Martineau-Rousseau

Just checking in to see if you had any questions. We are happy to assist and continue the discussion whenever you are ready.
Jacquelin Martineau-Rousseau 135 Reputation points

2025-12-01T15:42:55.3333333+00:00

I did have time to review it, but it doesn't solve my problem. Thanks.
VRISHABHANATH PATIL 1,820 Reputation points Microsoft External Staff Moderator

2025-12-02T03:17:28.0866667+00:00

Hi @Jacquelin Martineau-Rousseau

Please share your additional questions here in detail, as providing more elaboration will help us investigate the issue further.
Jacquelin Martineau-Rousseau 135 Reputation points

2025-12-02T14:08:41.9333333+00:00

I understand that rebuilding or modifying the file may solve that problem.

However, since this is a DLP tool that I use with thousands of users, I do not see how this is a valid solution to the problem. I would expect the tool to be able to the detection in my context, since there is no encryption.

This currently makes us have a significant blindspot in our DLP setup.
VRISHABHANATH PATIL 1,820 Reputation points Microsoft External Staff Moderator

2025-12-04T06:23:21.6566667+00:00

Hi @Jacquelin Martineau-Rousseau

I hope you had a chance to review the information shared earlier, and I hope this information has been helpful! If you still have questions, please let us know what is needed in the comments so the question can be answered.
VRISHABHANATH PATIL 1,820 Reputation points Microsoft External Staff Moderator

2025-12-05T03:46:54.2666667+00:00

Hi @Jacquelin Martineau-Rousseau

If this solution helped resolve your issue, please consider clicking ‘Accept Answer’ or giving it an upvote to help others find it easily.

Answer 1

The error message you are encountering, Microsoft.Exchange.Management.Tasks.ErrorFileHasNoTextContentException, indicates that the document may not be recognized as containing text content, even though it is a regular .docx file. Here are some potential reasons and troubleshooting steps:

Check Document Content: Ensure that the text in the document is not embedded in a way that makes it unreadable by the parser (e.g., as an image or within text boxes).
Image Content: Since your document contains an image, verify that the presence of the image is not interfering with the text extraction process. Sometimes, if the text is not in the main body but rather in headers, footers, or text boxes, it may not be detected.
File Integrity: Ensure that the .docx file is not corrupted. Try opening it in Microsoft Word and saving it again to create a fresh copy.
Test with Different Formats: Since you mentioned that the SIT works with a .txt file, consider converting the .docx file to a .txt format and testing it again to see if the issue persists.
Document Size and Complexity: If the document is large or complex, consider simplifying it to see if that resolves the issue.

If these steps do not resolve the issue, you may want to consult further documentation on sensitive information types and their requirements for processing.

References:

Jacquelin Martineau-Rousseau 135 Reputation points

2025-11-27T16:00:36.9533333+00:00

I have done a few tests:

copying the text to the text file -> good detection with SIT

copying the text to another docx file (without the image) -> good detection

copying the entire content (text + image) -> same error

Answer 2

Hi @Jacquelin Martineau-Rousseau

Thank you for contacting Microsoft Q&A. Please find below the detailed steps to address the reported issue.

The below stated error

Microsoft.Exchange.Management.Tasks.ErrorFileHasNoTextContentException

Unable to parse input document file. Make sure the document contains text and isn't encrypted by a password.

It is not about your Sensitive Information Type (SIT) being wrong. The SIT works fine — you proved that by testing with a .txt file and a clean .docx without the image. The problem is that the original Word file has something inside (most likely the embedded image or an object) that breaks Purview’s text extraction process. When the parser can’t read the text, it throws this error.

Why this happens -

Corrupt or complex OOXML parts: Certain embedded objects (like EMF/WMF/SVG images or OLE packages) can make the document structure invalid for the extractor.
Text inside shapes or controls: If your text is inside text boxes or grouped shapes, Purview may not treat it as normal text.
File integrity issues: Large or malformed files sometimes hit parsing limits.

Microsoft has acknowledged similar behaviors in their forums and guidance:

https://learn.microsoft.com/en-us/microsoft-365/compliance/sensitive-information-type-learn-about

How to fix it

Rebuild the Word file

-- Open the file in Word → File > Info > Convert (if it shows Compatibility Mode).

-- Then Save As a fresh .docx.

-- If Word prompts for repair, choose Open and Repair.

-- Test again in Purview.

Normalize the image

-- Replace the current image with a PNG or JPEG.

-- Avoid EMF/WMF/SVG or embedded OLE objects.

-- Reinsert the image and retest.

Move text out of shape

-- If text is inside text boxes or grouped shapes, copy it into the main body of the document.

Reduce complexity

-- Remove unnecessary headers/footers, tracked changes, or embedded items.

-- Compress large images.

Avoid this in the future

Insert images as PNG/JPEG, not EMF/SVG/OLE.
Keep text in normal paragraphs, not shapes or controls.
Enable DLP predicates like “Document couldn’t be scanned” so users get clear feedback when parsing fails:

Suggested customer-facing explanation

The issue wasn’t with your SIT — it was with the document structure. The embedded image caused Purview to fail when extracting text. We rebuilt the file and replaced the image with a standard format, and now SIT detection works as expected. Going forward, use PNG/JPEG for images and keep text in the main body to avoid similar issues.

Jacquelin Martineau-Rousseau 135 Reputation points

2025-12-01T15:40:54.3766667+00:00

I understand that this may be the problem, but rebuilding the file is not a solution since the idea is to detect the SIT within realworld business processes.

Share via

Unable to parse input document file

2 answers

Your answer