How to make sure indexer can index Chinese words correctly instead of showing garbled characters?

Hub=-For-=Dev 80 Reputation points
2025-11-20T10:03:42.5866667+00:00

Current setup:

  • I have AI Search service setup
  • I have defined an index (let say we have a field named title),
    • In the title field of our index, I have already configured analyzer as zh-Hant.microsoft.
  • I have configured a datasource and an indexer.
    • I have defined a fieldMapping as below. Please note metadata_title is the content metadata of HTML file
              {
                "sourceFieldName": "metadata_title",
                "targetFieldName": "title",
                "mappingFunction": null
              },
      

The flow is basically like below

  1. We put a HTML page with Chinese characters' title in Azure Blob Storage
  2. We manually run the indexer to crawl the title of the HTML page
  3. However, when I search in Search Explorer, I see garbled characters.

User's image

Can I have some advices on that?

Azure AI Search
Azure AI Search
An Azure search service with built-in artificial intelligence capabilities that enrich information to help identify and explore relevant content at scale.
0 comments No comments
{count} votes

Answer accepted by question author
  1. RAMAMURTHY MAKARAPU 1,125 Reputation points Microsoft External Staff Moderator
    2025-11-20T11:27:07.1633333+00:00

    Hi @Hub=-For-=Dev,

    Thanks for posting your question in the Microsoft Q&A forum.

    You're experiencing garbled characters because the indexer is pulling the title from a source that either doesn't properly preserve UTF-8 encoding or doesn't support non-ASCII characters at all. This commonly happens for two reasons, and each reason has a reliable fix.

    If the title comes from inside the HTML file: The HTML file may not declare UTF-8 encoding correctly, or the blob may not have the correct Content-Type or Content-Encoding headers. When this happens, the indexer reads the text using the wrong encoding and the Chinese characters become corrupted.

    How you can do is:

    The indexer will parse the HTML and extract <title>, <meta name="description">, etc. If the HTML file declares the encoding and the blob reports UTF-8, the text will be indexed correctly.

    In the <head> of the HTML include a charset declaration, example

    <meta charset="utf-8">
    <title>繁體中文標題示例</title>
    

    Ensure the blob’s Content-Type header includes UTF-8, example

    Content-Type: text/html; charset=utf-8
    

    You can set this in Azure Storage Explorer, Azure Portal, or when uploading via SDK/CLI. If the blob Content-Type is missing or set incorrectly the indexer may assume another encoding.

    Ensure your indexer parsing mode is appropriate so the indexer extracts the <title> into the content or into the metadata properties it reports. Then map that extracted property to your title index field or extract from content using skillset. Check the indexer run output / document preview for the raw text the indexer extracted.

    You can use the above when you control the HTML files.

    If the title comes from blob metadata: (metadata_title): Azure Blob metadata values follow HTTP header rules, which traditionally do not allow raw non-ASCII characters. Because of this, storing Chinese characters directly in metadata causes them to be corrupted or appear garbled during indexing. To work around this limitation, developers typically base64-encode metadata values before storing them and then decode them during indexing to preserve the original characters.

    When uploading the blob, base64-encode the title and put that string into metadata, e.g. pseudo

    metadata_title = base64("繁體中文標題示例")  => e.g. "5LqR5piv5Y2V5bqX5ZKM"
    

    In your indexer fieldMapping set mappingFunction to base64Decode. Example fieldMapping:

    {
      "sourceFieldName": "metadata_title",
      "targetFieldName": "title",
      "mappingFunction": { "name": "base64Decode" } }
    
    

    Make sure your index field title is type Edm.String, searchable, and uses your zh-Hant analyzer (you already set analyzer zh-Hant.microsoft). After the indexer runs it will decode the base64 metadata and write the proper Unicode string into the index. See examples in docs + similar Q&A where base64Decode is used.

    Reference :

    https://stackoverflow.com/questions/71233714/ansi-encoded-text-file-in-blob-storage-has-corrupted-characters-when-reading-it

    https://learn.microsoft.com/en-us/azure/search/search-blob-storage-integration

    https://learn.microsoft.com/en-us/azure/search/search-blob-metadata-properties

    Kindly let us know if the above comment helps or you need further assistance on this issue.

    Please "upvote" if the information helped you. This will help us and others in the community as well.

    1 person found this answer helpful.
    0 comments No comments

Answer accepted by question author
  1. Jerald Felix 9,835 Reputation points
    2025-11-20T10:57:56.6533333+00:00

    Hello Hub=-For-=Dev,

    Thanks for raising this question in Q&A forum.

    I understand you are encountering garbled characters (mojibake) when indexing Chinese text from HTML files in Azure Blob Storage, despite configuring the zh-Hant.microsoft analyzer.

    This issue is typically not related to the search analyzer (which tokenizes text for querying) but rather the encoding detection during the document cracking phase by the indexer. If the source HTML file does not declare its encoding or is saved in a format the indexer misinterprets (e.g., GB2312 read as UTF-8 or vice versa), the characters will be garbled before they ever reach your index.

    Here are the steps to resolve this:

    1. Enforce Encoding: You can force the indexer to use a specific encoding configuration if you know your source files are consistent (e.g., UTF-8). In your indexer definition, under parameters, set the configuration property:
      
          "parameters": {
      
            "configuration": {
      
              "encoding": "utf-8" // or "gb18030" if your source is legacy Chinese encoding
      
            }
      
          }
      
      
    2. Verify Source File Meta Tags: Ensure your HTML files actually contain the correct charset meta tag. The Azure AI Search blob indexer tries to detect encoding from the Byte Order Mark (BOM) first, then the Content-Type header, and then HTML meta tags. Example: <meta charset="UTF-8">
    3. Base64 Encoding (If passing content manually): If you were pushing data (which you aren't, but for reference), ensure you aren't double-encoding strings. Since you are using the Blob indexer, this is less likely, but ensure the blob content type in Azure Storage is set correctly (e.g., text/html).

    If helps, approve the answer.

    Best Regards,

    Jerald Felix

    0 comments No comments

1 additional answer

Sort by: Most helpful
  1. Hub=-For-=Dev 80 Reputation points
    2025-11-20T13:51:16.7433333+00:00

    Actually two methods help, but RAMAMURTHY told me to upvote the answer so I did.

    Thanks all!

    0 comments No comments

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.