PreTokenizer.CreateWordOrNonWord Method

Definition

Namespace:: Microsoft.ML.Tokenizers

Assembly:: Microsoft.ML.Tokenizers.dll

Package:: Microsoft.ML.Tokenizers v1.0.1

Package:: Microsoft.ML.Tokenizers v0.22.0

Package:: Microsoft.ML.Tokenizers v2.0.0-preview.1.25125.4

Source:: PreTokenizer.cs

Source:: PreTokenizer.cs

Source:: PreTokenizer.cs

Important

Some information relates to prerelease product that may be substantially modified before it’s released. Microsoft makes no warranties, express or implied, with respect to the information provided here.

Create a new instance of the PreTokenizer class which split the text at the word or non-word boundary. The word is a set of alphabet, numeric, and underscore characters.

public static Microsoft.ML.Tokenizers.PreTokenizer CreateWordOrNonWord(System.Collections.Generic.IReadOnlyDictionary<string,int>? specialTokens = default);

static member CreateWordOrNonWord : System.Collections.Generic.IReadOnlyDictionary<string, int> -> Microsoft.ML.Tokenizers.PreTokenizer

Public Shared Function CreateWordOrNonWord (Optional specialTokens As IReadOnlyDictionary(Of String, Integer) = Nothing) As PreTokenizer

Parameters

specialTokens: IReadOnlyDictionary<String,Int32>

The dictionary containing the special tokens and their corresponding ids.

Returns

PreTokenizer

The pre-tokenizer that splits the text at the word boundary.

Remarks

This pre-tokenizer uses the regex pattern "\w+|[^\w\s]+" to split the text into tokens.

Applies to