Hello Christian Schwerdtfeger,
Welcome to Microsoft Q&A and Thanks for sharing the details.
Yes, the behavior you're seeing, where PDF files give noticeably lower accuracy than PNGs, is something we’ve observed with Azure Document Intelligence, especially when working with custom models.
Document Intelligence handles PDFs through an internal PDF-rendering pipeline. Depending on how the PDF was generated (embedded fonts, vector layers, low-DPI scans inside the PDF, compression, or mixed content), the rendered image quality can drop significantly before OCR even begins. When you convert the same PDF to a high-resolution PNG, you're essentially giving the service a clean, flattened image which is why accuracy improves immediately.
Below are the most effective ways to improve accuracy for PDF inputs:
1. Check the underlying PDF quality
A PDF that looks fine can still contain:
• low-resolution pages
• compressed or downsampled images
• scanned images wrapped inside a PDF container
• special/embedded fonts All of these can reduce OCR quality. If possible, ensure that PDFs are generated/exported at 300 DPI or higher.
2. Preprocess PDFs before calling the service
Many customers see better results by converting each PDF page to a 300–400 DPI PNG before sending it to the model. This aligns with what you already observed the conversion improves accuracy because the OCR receives a consistent and clean image.
3. Flatten or optimize PDFs
PDFs coming from different systems (printers, ERPs, scanners) may include layered content or vector text that renders poorly. Flattening or optimizing the file (e.g., via Ghostscript or other PDF processing tools) often produces more stable OCR results.
4. Retrain your custom model with image-based PDFs or PNGs
Even though you have many PDFs in your training set, the visual appearance between your training PDFs and runtime PDFs may differ. Training the model with:
• PDF pages rendered as images, or
• PNGs that match your preprocessing flow usually improves generalization and accuracy.
5. Check whether PDFs are scanned or digitally generated
Digitally created PDFs (exported from software) usually perform well. Scanned PDFs vary widely in quality, and if the scanning resolution is low, accuracy will drop.
6.Use Layout OCR + custom extraction
In some cases, running the Layout model first and using its text/regions in your custom model provides better consistency across formats.
Please refer this
I Hope this helps. Do let me know if you have any further queries.
Thank you!