PreTokenizer.CreateWordOrNonWord Method
Definition
Important
Some information relates to prerelease product that may be substantially modified before it’s released. Microsoft makes no warranties, express or implied, with respect to the information provided here.
Create a new instance of the PreTokenizer class which split the text at the word or non-word boundary. The word is a set of alphabet, numeric, and underscore characters.
public static Microsoft.ML.Tokenizers.PreTokenizer CreateWordOrNonWord(System.Collections.Generic.IReadOnlyDictionary<string,int>? specialTokens = default);
static member CreateWordOrNonWord : System.Collections.Generic.IReadOnlyDictionary<string, int> -> Microsoft.ML.Tokenizers.PreTokenizer
Public Shared Function CreateWordOrNonWord (Optional specialTokens As IReadOnlyDictionary(Of String, Integer) = Nothing) As PreTokenizer
Parameters
- specialTokens
- IReadOnlyDictionary<String,Int32>
The dictionary containing the special tokens and their corresponding ids.
Returns
The pre-tokenizer that splits the text at the word boundary.
Remarks
This pre-tokenizer uses the regex pattern "\w+|[^\w\s]+" to split the text into tokens.