Hi @Hub=-For-=Dev,
Thanks for posting your question in the Microsoft Q&A forum.
You're experiencing garbled characters because the indexer is pulling the title from a source that either doesn't properly preserve UTF-8 encoding or doesn't support non-ASCII characters at all. This commonly happens for two reasons, and each reason has a reliable fix.
If the title comes from inside the HTML file: The HTML file may not declare UTF-8 encoding correctly, or the blob may not have the correct Content-Type or Content-Encoding headers. When this happens, the indexer reads the text using the wrong encoding and the Chinese characters become corrupted.
How you can do is:
The indexer will parse the HTML and extract <title>, <meta name="description">, etc. If the HTML file declares the encoding and the blob reports UTF-8, the text will be indexed correctly.
In the <head> of the HTML include a charset declaration, example
<meta charset="utf-8">
<title>繁體中文標題示例</title>
Ensure the blob’s Content-Type header includes UTF-8, example
Content-Type: text/html; charset=utf-8
You can set this in Azure Storage Explorer, Azure Portal, or when uploading via SDK/CLI. If the blob Content-Type is missing or set incorrectly the indexer may assume another encoding.
Ensure your indexer parsing mode is appropriate so the indexer extracts the <title> into the content or into the metadata properties it reports. Then map that extracted property to your title index field or extract from content using skillset. Check the indexer run output / document preview for the raw text the indexer extracted.
You can use the above when you control the HTML files.
If the title comes from blob metadata: (metadata_title): Azure Blob metadata values follow HTTP header rules, which traditionally do not allow raw non-ASCII characters. Because of this, storing Chinese characters directly in metadata causes them to be corrupted or appear garbled during indexing. To work around this limitation, developers typically base64-encode metadata values before storing them and then decode them during indexing to preserve the original characters.
When uploading the blob, base64-encode the title and put that string into metadata, e.g. pseudo
metadata_title = base64("繁體中文標題示例") => e.g. "5LqR5piv5Y2V5bqX5ZKM"
In your indexer fieldMapping set mappingFunction to base64Decode. Example fieldMapping:
{
"sourceFieldName": "metadata_title",
"targetFieldName": "title",
"mappingFunction": { "name": "base64Decode" } }
Make sure your index field title is type Edm.String, searchable, and uses your zh-Hant analyzer (you already set analyzer zh-Hant.microsoft). After the indexer runs it will decode the base64 metadata and write the proper Unicode string into the index. See examples in docs + similar Q&A where base64Decode is used.
Reference :
https://learn.microsoft.com/en-us/azure/search/search-blob-storage-integration
https://learn.microsoft.com/en-us/azure/search/search-blob-metadata-properties
Kindly let us know if the above comment helps or you need further assistance on this issue.
Please "upvote" if the information helped you. This will help us and others in the community as well.