Udostępnij przez


WordPieceTokenizer Class

Definition

Represent the WordPiece tokenizer.

public class WordPieceTokenizer : Microsoft.ML.Tokenizers.Tokenizer
type WordPieceTokenizer = class
    inherit Tokenizer
Public Class WordPieceTokenizer
Inherits Tokenizer
Inheritance
WordPieceTokenizer
Derived

Remarks

The WordPiece tokenizer is a sub-word tokenizer that is used in BERT and other transformer models. The implementation is based on the Hugging Face WordPiece tokenizer https://huggingface.co/docs/tokenizers/api/models#tokenizers.models.WordPiece.

Properties

ContinuingSubwordPrefix

Gets the prefix to use for sub-words that are not the first part of a word.

MaxInputCharsPerWord

Gets the maximum number of characters to authorize in a single word.

Normalizer

Gets the Normalizer in use by the Tokenizer.

PreTokenizer

Gets the PreTokenizer used by the Tokenizer.

SpecialTokens

Gets the special tokens and their corresponding ids.

UnknownToken

Gets the unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead.

UnknownTokenId

Gets the unknown token ID. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead.

Methods

CountTokens(ReadOnlySpan<Char>, Boolean, Boolean)

Get the number of tokens that the input text will be encoded to.

(Inherited from Tokenizer)
CountTokens(String, Boolean, Boolean)

Get the number of tokens that the input text will be encoded to.

(Inherited from Tokenizer)
CountTokens(String, ReadOnlySpan<Char>, EncodeSettings)

Get the number of tokens that the input text will be encoded to.

Create(Stream, WordPieceOptions)

Create a new instance of the WordPieceTokenizer class.

Create(String, WordPieceOptions)

Create a new instance of the WordPieceTokenizer class.

CreateAsync(Stream, WordPieceOptions, CancellationToken)

Create a new instance of the WordPieceTokenizer class asynchronously.

CreateAsync(String, WordPieceOptions, CancellationToken)

Create a new instance of the WordPieceTokenizer class asynchronously.

Decode(IEnumerable<Int32>, Boolean)

Decode the given ids, back to a String.

Decode(IEnumerable<Int32>, Span<Char>, Boolean, Int32, Int32)

Decode the given ids back to text and store the result in the destination span.

Decode(IEnumerable<Int32>, Span<Char>, Int32, Int32)

Decode the given ids back to text and store the result in the destination span.

Decode(IEnumerable<Int32>)

Decode the given ids, back to a String.

EncodeToIds(ReadOnlySpan<Char>, Boolean, Boolean)

Encodes input text to token Ids.

(Inherited from Tokenizer)
EncodeToIds(ReadOnlySpan<Char>, Int32, String, Int32, Boolean, Boolean)

Encodes input text to token Ids up to maximum number of tokens.

(Inherited from Tokenizer)
EncodeToIds(String, Boolean, Boolean)

Encodes input text to token Ids.

(Inherited from Tokenizer)
EncodeToIds(String, Int32, String, Int32, Boolean, Boolean)

Encodes input text to token Ids up to maximum number of tokens.

<param name="text">The text to encode.</param> (Inherited from Tokenizer)
EncodeToIds(String, ReadOnlySpan<Char>, EncodeSettings)

Encodes input text to token Ids.

EncodeToTokens(ReadOnlySpan<Char>, String, Boolean, Boolean)

Encodes input text to a list of EncodedTokens.

(Inherited from Tokenizer)
EncodeToTokens(String, ReadOnlySpan<Char>, EncodeSettings)

Encodes input text to a list of EncodedTokens.

EncodeToTokens(String, String, Boolean, Boolean)

Encodes input text to a list of EncodedTokens.

(Inherited from Tokenizer)
GetIndexByTokenCount(ReadOnlySpan<Char>, Int32, String, Int32, Boolean, Boolean)

Find the index of the maximum encoding capacity without surpassing the token limit.

(Inherited from Tokenizer)
GetIndexByTokenCount(String, Int32, String, Int32, Boolean, Boolean)

Find the index of the maximum encoding capacity without surpassing the token limit.

(Inherited from Tokenizer)
GetIndexByTokenCount(String, ReadOnlySpan<Char>, EncodeSettings, Boolean, String, Int32)

Find the index of the maximum encoding capacity without surpassing the token limit.

GetIndexByTokenCountFromEnd(ReadOnlySpan<Char>, Int32, String, Int32, Boolean, Boolean)

Find the index of the maximum encoding capacity without surpassing the token limit.

(Inherited from Tokenizer)
GetIndexByTokenCountFromEnd(String, Int32, String, Int32, Boolean, Boolean)

Find the index of the maximum encoding capacity without surpassing the token limit.

(Inherited from Tokenizer)

Applies to