WordPieceTokenizer Class
Definition
Important
Some information relates to prerelease product that may be substantially modified before it’s released. Microsoft makes no warranties, express or implied, with respect to the information provided here.
Represent the WordPiece tokenizer.
public class WordPieceTokenizer : Microsoft.ML.Tokenizers.Tokenizer
type WordPieceTokenizer = class
inherit Tokenizer
Public Class WordPieceTokenizer
Inherits Tokenizer
- Inheritance
- Derived
Remarks
The WordPiece tokenizer is a sub-word tokenizer that is used in BERT and other transformer models. The implementation is based on the Hugging Face WordPiece tokenizer https://huggingface.co/docs/tokenizers/api/models#tokenizers.models.WordPiece.
Properties
| ContinuingSubwordPrefix |
Gets the prefix to use for sub-words that are not the first part of a word. |
| MaxInputCharsPerWord |
Gets the maximum number of characters to authorize in a single word. |
| Normalizer |
Gets the Normalizer in use by the Tokenizer. |
| PreTokenizer |
Gets the PreTokenizer used by the Tokenizer. |
| SpecialTokens |
Gets the special tokens and their corresponding ids. |
| UnknownToken |
Gets the unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead. |
| UnknownTokenId |
Gets the unknown token ID. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead. |
Methods
| CountTokens(ReadOnlySpan<Char>, Boolean, Boolean) |
Get the number of tokens that the input text will be encoded to. (Inherited from Tokenizer) |
| CountTokens(String, Boolean, Boolean) |
Get the number of tokens that the input text will be encoded to. (Inherited from Tokenizer) |
| CountTokens(String, ReadOnlySpan<Char>, EncodeSettings) |
Get the number of tokens that the input text will be encoded to. |
| Create(Stream, WordPieceOptions) |
Create a new instance of the WordPieceTokenizer class. |
| Create(String, WordPieceOptions) |
Create a new instance of the WordPieceTokenizer class. |
| CreateAsync(Stream, WordPieceOptions, CancellationToken) |
Create a new instance of the WordPieceTokenizer class asynchronously. |
| CreateAsync(String, WordPieceOptions, CancellationToken) |
Create a new instance of the WordPieceTokenizer class asynchronously. |
| Decode(IEnumerable<Int32>, Boolean) |
Decode the given ids, back to a String. |
| Decode(IEnumerable<Int32>, Span<Char>, Boolean, Int32, Int32) |
Decode the given ids back to text and store the result in the |
| Decode(IEnumerable<Int32>, Span<Char>, Int32, Int32) |
Decode the given ids back to text and store the result in the |
| Decode(IEnumerable<Int32>) |
Decode the given ids, back to a String. |
| EncodeToIds(ReadOnlySpan<Char>, Boolean, Boolean) |
Encodes input text to token Ids. (Inherited from Tokenizer) |
| EncodeToIds(ReadOnlySpan<Char>, Int32, String, Int32, Boolean, Boolean) |
Encodes input text to token Ids up to maximum number of tokens. (Inherited from Tokenizer) |
| EncodeToIds(String, Boolean, Boolean) |
Encodes input text to token Ids. (Inherited from Tokenizer) |
| EncodeToIds(String, Int32, String, Int32, Boolean, Boolean) |
Encodes input text to token Ids up to maximum number of tokens. <param name="text">The text to encode.</param> (Inherited from Tokenizer) |
| EncodeToIds(String, ReadOnlySpan<Char>, EncodeSettings) |
Encodes input text to token Ids. |
| EncodeToTokens(ReadOnlySpan<Char>, String, Boolean, Boolean) |
Encodes input text to a list of EncodedTokens. (Inherited from Tokenizer) |
| EncodeToTokens(String, ReadOnlySpan<Char>, EncodeSettings) |
Encodes input text to a list of EncodedTokens. |
| EncodeToTokens(String, String, Boolean, Boolean) |
Encodes input text to a list of EncodedTokens. (Inherited from Tokenizer) |
| GetIndexByTokenCount(ReadOnlySpan<Char>, Int32, String, Int32, Boolean, Boolean) |
Find the index of the maximum encoding capacity without surpassing the token limit. (Inherited from Tokenizer) |
| GetIndexByTokenCount(String, Int32, String, Int32, Boolean, Boolean) |
Find the index of the maximum encoding capacity without surpassing the token limit. (Inherited from Tokenizer) |
| GetIndexByTokenCount(String, ReadOnlySpan<Char>, EncodeSettings, Boolean, String, Int32) |
Find the index of the maximum encoding capacity without surpassing the token limit. |
| GetIndexByTokenCountFromEnd(ReadOnlySpan<Char>, Int32, String, Int32, Boolean, Boolean) |
Find the index of the maximum encoding capacity without surpassing the token limit. (Inherited from Tokenizer) |
| GetIndexByTokenCountFromEnd(String, Int32, String, Int32, Boolean, Boolean) |
Find the index of the maximum encoding capacity without surpassing the token limit. (Inherited from Tokenizer) |