WordPieceTokenizer Class

Definition

Namespace:: Microsoft.ML.Tokenizers

Assembly:: Microsoft.ML.Tokenizers.dll

Package:: Microsoft.ML.Tokenizers v1.0.1

Package:: Microsoft.ML.Tokenizers v0.22.0

Package:: Microsoft.ML.Tokenizers v2.0.0-preview.1.25125.4

Source:: WordPieceTokenizer.cs

Source:: WordPieceTokenizer.cs

Source:: WordPieceTokenizer.cs

Important

Some information relates to prerelease product that may be substantially modified before it’s released. Microsoft makes no warranties, express or implied, with respect to the information provided here.

Represent the WordPiece tokenizer.

public class WordPieceTokenizer : Microsoft.ML.Tokenizers.Tokenizer

type WordPieceTokenizer = class
    inherit Tokenizer

Public Class WordPieceTokenizer
Inherits Tokenizer

Inheritance: Object

Tokenizer
WordPieceTokenizer

Derived: Microsoft.ML.Tokenizers.BertTokenizer

Remarks

The WordPiece tokenizer is a sub-word tokenizer that is used in BERT and other transformer models. The implementation is based on the Hugging Face WordPiece tokenizer https://huggingface.co/docs/tokenizers/api/models#tokenizers.models.WordPiece.

Properties

ContinuingSubwordPrefix	Gets the prefix to use for sub-words that are not the first part of a word.
MaxInputCharsPerWord	Gets the maximum number of characters to authorize in a single word.
Normalizer	Gets the Normalizer in use by the Tokenizer.
PreTokenizer	Gets the PreTokenizer used by the Tokenizer.
SpecialTokens	Gets the special tokens and their corresponding ids.
UnknownToken	Gets the unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead.
UnknownTokenId	Gets the unknown token ID. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead.

Methods

CountTokens(ReadOnlySpan<Char>, Boolean, Boolean)	Get the number of tokens that the input text will be encoded to. (Inherited from Tokenizer)
CountTokens(String, Boolean, Boolean)	Get the number of tokens that the input text will be encoded to. (Inherited from Tokenizer)
CountTokens(String, ReadOnlySpan<Char>, EncodeSettings)	Get the number of tokens that the input text will be encoded to.
Create(Stream, WordPieceOptions)	Create a new instance of the WordPieceTokenizer class.
Create(String, WordPieceOptions)	Create a new instance of the WordPieceTokenizer class.
CreateAsync(Stream, WordPieceOptions, CancellationToken)	Create a new instance of the WordPieceTokenizer class asynchronously.
CreateAsync(String, WordPieceOptions, CancellationToken)	Create a new instance of the WordPieceTokenizer class asynchronously.
Decode(IEnumerable<Int32>, Boolean)	Decode the given ids, back to a String.
Decode(IEnumerable<Int32>, Span<Char>, Boolean, Int32, Int32)	Decode the given ids back to text and store the result in the `destination` span.
Decode(IEnumerable<Int32>, Span<Char>, Int32, Int32)	Decode the given ids back to text and store the result in the `destination` span.
Decode(IEnumerable<Int32>)	Decode the given ids, back to a String.
EncodeToIds(ReadOnlySpan<Char>, Boolean, Boolean)	Encodes input text to token Ids. (Inherited from Tokenizer)
EncodeToIds(ReadOnlySpan<Char>, Int32, String, Int32, Boolean, Boolean)	Encodes input text to token Ids up to maximum number of tokens. (Inherited from Tokenizer)
EncodeToIds(String, Boolean, Boolean)	Encodes input text to token Ids. (Inherited from Tokenizer)
EncodeToIds(String, Int32, String, Int32, Boolean, Boolean)	Encodes input text to token Ids up to maximum number of tokens. <param name="text">The text to encode.</param> (Inherited from Tokenizer)
EncodeToIds(String, ReadOnlySpan<Char>, EncodeSettings)	Encodes input text to token Ids.
EncodeToTokens(ReadOnlySpan<Char>, String, Boolean, Boolean)	Encodes input text to a list of EncodedTokens. (Inherited from Tokenizer)
EncodeToTokens(String, ReadOnlySpan<Char>, EncodeSettings)	Encodes input text to a list of EncodedTokens.
EncodeToTokens(String, String, Boolean, Boolean)	Encodes input text to a list of EncodedTokens. (Inherited from Tokenizer)
GetIndexByTokenCount(ReadOnlySpan<Char>, Int32, String, Int32, Boolean, Boolean)	Find the index of the maximum encoding capacity without surpassing the token limit. (Inherited from Tokenizer)
GetIndexByTokenCount(String, Int32, String, Int32, Boolean, Boolean)	Find the index of the maximum encoding capacity without surpassing the token limit. (Inherited from Tokenizer)
GetIndexByTokenCount(String, ReadOnlySpan<Char>, EncodeSettings, Boolean, String, Int32)	Find the index of the maximum encoding capacity without surpassing the token limit.
GetIndexByTokenCountFromEnd(ReadOnlySpan<Char>, Int32, String, Int32, Boolean, Boolean)	Find the index of the maximum encoding capacity without surpassing the token limit. (Inherited from Tokenizer)
GetIndexByTokenCountFromEnd(String, Int32, String, Int32, Boolean, Boolean)	Find the index of the maximum encoding capacity without surpassing the token limit. (Inherited from Tokenizer)

Applies to