Udostępnij przez


PreTokenizer.CreateWordOrNonWord Method

Definition

Create a new instance of the PreTokenizer class which split the text at the word or non-word boundary. The word is a set of alphabet, numeric, and underscore characters.

public static Microsoft.ML.Tokenizers.PreTokenizer CreateWordOrNonWord(System.Collections.Generic.IReadOnlyDictionary<string,int>? specialTokens = default);
static member CreateWordOrNonWord : System.Collections.Generic.IReadOnlyDictionary<string, int> -> Microsoft.ML.Tokenizers.PreTokenizer
Public Shared Function CreateWordOrNonWord (Optional specialTokens As IReadOnlyDictionary(Of String, Integer) = Nothing) As PreTokenizer

Parameters

specialTokens
IReadOnlyDictionary<String,Int32>

The dictionary containing the special tokens and their corresponding ids.

Returns

The pre-tokenizer that splits the text at the word boundary.

Remarks

This pre-tokenizer uses the regex pattern "\w+|[^\w\s]+" to split the text into tokens.

Applies to