How to make sure indexer can index Chinese words correctly instead of showing garbled characters?

Question

Current setup:

I have AI Search service setup
I have defined an index (let say we have a field named title),
- In the title field of our index, I have already configured analyzer as zh-Hant.microsoft.

I have configured a datasource and an indexer.

I have defined a fieldMapping as below. Please note metadata_title is the content metadata of HTML file

        {
          "sourceFieldName": "metadata_title",
          "targetFieldName": "title",
          "mappingFunction": null
        },

The flow is basically like below

We put a HTML page with Chinese characters' title in Azure Blob Storage
We manually run the indexer to crawl the title of the HTML page
However, when I search in Search Explorer, I see garbled characters.

User's image

Can I have some advices on that?

Accepted Answer

Hi @Hub=-For-=Dev,

Thanks for posting your question in the Microsoft Q&A forum.

You're experiencing garbled characters because the indexer is pulling the title from a source that either doesn't properly preserve UTF-8 encoding or doesn't support non-ASCII characters at all. This commonly happens for two reasons, and each reason has a reliable fix.

If the title comes from inside the HTML file: The HTML file may not declare UTF-8 encoding correctly, or the blob may not have the correct Content-Type or Content-Encoding headers. When this happens, the indexer reads the text using the wrong encoding and the Chinese characters become corrupted.

How you can do is:

The indexer will parse the HTML and extract </code>, <code><meta name="description"></code>, etc. If the HTML file declares the encoding and the blob reports UTF-8, the text will be indexed correctly. In the <code><head></code> of the HTML include a charset declaration, example <pre><code class="lang-html"><meta charset="utf-8"> <title>繁體中文標題示例

Ensure the blob’s Content-Type header includes UTF-8, example

Content-Type: text/html; charset=utf-8

You can set this in Azure Storage Explorer, Azure Portal, or when uploading via SDK/CLI. If the blob Content-Type is missing or set incorrectly the indexer may assume another encoding.

Ensure your indexer parsing mode is appropriate so the indexer extracts the </code> into the <code>content</code> or into the metadata properties it reports. Then map that extracted property to your <code>title</code> index field or extract from <code>content</code> using skillset. Check the indexer run output / document preview for the raw text the indexer extracted. You can use the above when you control the HTML files. If the title comes from blob metadata: (<code>metadata_title</code>): Azure Blob metadata values follow HTTP header rules, which traditionally do not allow raw non-ASCII characters. Because of this, storing Chinese characters directly in metadata causes them to be corrupted or appear garbled during indexing. To work around this limitation, developers typically base64-encode metadata values before storing them and then decode them during indexing to preserve the original characters. When uploading the blob, base64-encode the title and put that string into metadata, e.g. pseudo <pre><code class="lang-dockerfile">metadata_title = base64("繁體中文標題示例") => e.g. "5LqR5piv5Y2V5bqX5ZKM" </code></pre> In your indexer fieldMapping set <code>mappingFunction</code> to base64Decode. Example fieldMapping: <pre><code class="lang-json">{ "sourceFieldName": "metadata_title", "targetFieldName": "title", "mappingFunction": { "name": "base64Decode" } } </code></pre> Make sure your index field <code>title</code> is type <code>Edm.String</code>, searchable, and uses your zh-Hant analyzer (you already set analyzer <code>zh-Hant.microsoft</code>). After the indexer runs it will decode the base64 metadata and write the proper Unicode string into the index. See examples in docs + similar Q&A where base64Decode is used. Reference : <a href="https://stackoverflow.com/questions/71233714/ansi-encoded-text-file-in-blob-storage-has-corrupted-characters-when-reading-it" rel="ugc nofollow">https://stackoverflow.com/questions/71233714/ansi-encoded-text-file-in-blob-storage-has-corrupted-characters-when-reading-it</a> <a href="https://learn.microsoft.com/en-us/azure/search/search-blob-storage-integration" rel="ugc nofollow">https://learn.microsoft.com/en-us/azure/search/search-blob-storage-integration</a> <a href="https://learn.microsoft.com/en-us/azure/search/search-blob-metadata-properties" rel="ugc nofollow">https://learn.microsoft.com/en-us/azure/search/search-blob-metadata-properties</a> Kindly let us know if the above comment helps or you need further assistance on this issue. Please "upvote" if the information helped you. This will help us and others in the community as well.

Accepted Answer

Hello Hub=-For-=Dev,

Thanks for raising this question in Q&A forum.

I understand you are encountering garbled characters (mojibake) when indexing Chinese text from HTML files in Azure Blob Storage, despite configuring the zh-Hant.microsoft analyzer.

This issue is typically not related to the search analyzer (which tokenizes text for querying) but rather the encoding detection during the document cracking phase by the indexer. If the source HTML file does not declare its encoding or is saved in a format the indexer misinterprets (e.g., GB2312 read as UTF-8 or vice versa), the characters will be garbled before they ever reach your index.

Here are the steps to resolve this:

Enforce Encoding: You can force the indexer to use a specific encoding configuration if you know your source files are consistent (e.g., UTF-8). In your indexer definition, under parameters, set the configuration property:
```
    "parameters": {

      "configuration": {

        "encoding": "utf-8" // or "gb18030" if your source is legacy Chinese encoding

      }

    }
```
Verify Source File Meta Tags: Ensure your HTML files actually contain the correct charset meta tag. The Azure AI Search blob indexer tries to detect encoding from the Byte Order Mark (BOM) first, then the Content-Type header, and then HTML meta tags. Example:
Base64 Encoding (If passing content manually): If you were pushing data (which you aren't, but for reference), ensure you aren't double-encoding strings. Since you are using the Blob indexer, this is less likely, but ensure the blob content type in Azure Storage is set correctly (e.g., text/html).

If helps, approve the answer.

Best Regards,

Jerald Felix

Answer

Actually two methods help, but RAMAMURTHY told me to upvote the answer so I did.

Thanks all!

Share via

How to make sure indexer can index Chinese words correctly instead of showing garbled characters?

1 additional answer

Your answer