SentencePieceTokenizer Class
Definition
Important
Some information relates to prerelease product that may be substantially modified before it’s released. Microsoft makes no warranties, express or implied, with respect to the information provided here.
SentencePieceBpe is a tokenizer that splits the input into tokens using the SentencePiece Bpe model.
public class SentencePieceTokenizer : Microsoft.ML.Tokenizers.Tokenizer
type SentencePieceTokenizer = class
inherit Tokenizer
Public Class SentencePieceTokenizer
Inherits Tokenizer
- Inheritance
- Derived
Properties
| AddBeginningOfSentence |
Indicate emitting the beginning of sentence token during the encoding. |
| AddDummyPrefix |
Indicate emitting the prefix character U+2581 at the beginning of sentence token during the normalization and encoding. |
| AddEndOfSentence |
Indicate emitting the end of sentence token during the encoding. |
| BeginningOfSentenceId |
The id of the beginning of sentence token. |
| BeginningOfSentenceToken |
The beginning of sentence token. |
| ByteFallback |
Specifies whether the model will do a byte fallback when it encounters unknown tokens during the encoding process. |
| EndOfSentenceId |
The id of the end of sentence token. |
| EndOfSentenceToken |
The end of sentence token. |
| EscapeWhiteSpaces |
Indicate if the spaces should be replaced with character U+2581 during the normalization and encoding. |
| Normalizer |
Gets the Normalizer in use by the Tokenizer. |
| PreTokenizer |
Gets the PreTokenizer used by the Tokenizer. |
| SpecialTokens |
The special tokens. |
| TreatWhitespaceAsSuffix |
Indicate emitting the character U+2581 at the end of the last sentence token instead beginning of sentence token during the normalization and encoding. |
| UnknownId |
The id of the unknown token. |
| UnknownToken |
The unknown token. |
| Vocabulary |
The vocabulary of the model. |
Methods
| CountTokens(ReadOnlySpan<Char>, Boolean, Boolean, Boolean, Boolean, String, Int32, Int32) |
Get the number of tokens that the input text will be encoded to. |
| CountTokens(ReadOnlySpan<Char>, Boolean, Boolean, Boolean, Boolean) |
Get the number of tokens that the input text will be encoded to. |
| CountTokens(ReadOnlySpan<Char>, Boolean, Boolean) |
Get the number of tokens that the input text will be encoded to. (Inherited from Tokenizer) |
| CountTokens(String, Boolean, Boolean, Boolean, Boolean, String, Int32, Int32) |
Get the number of tokens that the input text will be encoded to. |
| CountTokens(String, Boolean, Boolean, Boolean, Boolean) |
Get the number of tokens that the input text will be encoded to. |
| CountTokens(String, Boolean, Boolean) |
Get the number of tokens that the input text will be encoded to. (Inherited from Tokenizer) |
| CountTokens(String, ReadOnlySpan<Char>, EncodeSettings) |
Get the number of tokens that the input text will be encoded to. |
| Create(Stream, Boolean, Boolean, IReadOnlyDictionary<String,Int32>) |
Creates an instance of SentencePieceTokenizer. The model stream should contain a SentencePiece model as specified in the following documentation: https://github.com/google/sentencepiece/blob/master/src/sentencepiece_model.proto. |
| Decode(IEnumerable<Int32>, Boolean) |
Decode the given ids, back to a String. |
| Decode(IEnumerable<Int32>, Span<Char>, Boolean, Int32, Int32) |
Decode the given ids back to text and store the result in the |
| Decode(IEnumerable<Int32>, Span<Char>, Int32, Int32) |
Decode the given ids back to text and store the result in the |
| Decode(IEnumerable<Int32>) |
Decode the given ids, back to a String. |
| EncodeToIds(ReadOnlySpan<Char>, Boolean, Boolean, Boolean, Boolean) |
Encodes input text to token Ids. |
| EncodeToIds(ReadOnlySpan<Char>, Boolean, Boolean, Int32, String, Int32, Boolean, Boolean) |
Encodes input text to token Ids up to maximum number of tokens. |
| EncodeToIds(ReadOnlySpan<Char>, Boolean, Boolean) |
Encodes input text to token Ids. (Inherited from Tokenizer) |
| EncodeToIds(ReadOnlySpan<Char>, Int32, String, Int32, Boolean, Boolean) |
Encodes input text to token Ids up to maximum number of tokens. (Inherited from Tokenizer) |
| EncodeToIds(String, Boolean, Boolean, Boolean, Boolean) |
Encodes input text to token Ids. |
| EncodeToIds(String, Boolean, Boolean, Int32, String, Int32, Boolean, Boolean) |
Encodes input text to token Ids up to maximum number of tokens. |
| EncodeToIds(String, Boolean, Boolean) |
Encodes input text to token Ids. (Inherited from Tokenizer) |
| EncodeToIds(String, Int32, String, Int32, Boolean, Boolean) |
Encodes input text to token Ids up to maximum number of tokens. <param name="text">The text to encode.</param> (Inherited from Tokenizer) |
| EncodeToIds(String, ReadOnlySpan<Char>, EncodeSettings) |
Encodes input text to token Ids. |
| EncodeToTokens(ReadOnlySpan<Char>, String, Boolean, Boolean, Boolean, Boolean) |
Encodes input text a list of EncodedTokens with string value of the token, id, and offset. |
| EncodeToTokens(ReadOnlySpan<Char>, String, Boolean, Boolean) |
Encodes input text to a list of EncodedTokens. (Inherited from Tokenizer) |
| EncodeToTokens(String, ReadOnlySpan<Char>, EncodeSettings) |
Encodes input text to a list of EncodedTokens. |
| EncodeToTokens(String, String, Boolean, Boolean, Boolean, Boolean) |
Encodes input text a list of EncodedTokens with string value of the token, id, and offset. |
| EncodeToTokens(String, String, Boolean, Boolean) |
Encodes input text to a list of EncodedTokens. (Inherited from Tokenizer) |
| GetIndexByTokenCount(ReadOnlySpan<Char>, Boolean, Boolean, Int32, String, Int32, Boolean, Boolean) |
Find the index of the maximum encoding capacity from the start within the text without surpassing the token limit. |
| GetIndexByTokenCount(ReadOnlySpan<Char>, Int32, String, Int32, Boolean, Boolean) |
Find the index of the maximum encoding capacity without surpassing the token limit. (Inherited from Tokenizer) |
| GetIndexByTokenCount(String, Boolean, Boolean, Int32, String, Int32, Boolean, Boolean) |
Find the index of the maximum encoding capacity from the start within the text without surpassing the token limit. |
| GetIndexByTokenCount(String, Int32, String, Int32, Boolean, Boolean) |
Find the index of the maximum encoding capacity without surpassing the token limit. (Inherited from Tokenizer) |
| GetIndexByTokenCount(String, ReadOnlySpan<Char>, EncodeSettings, Boolean, String, Int32) |
Find the index of the maximum encoding capacity without surpassing the token limit. |
| GetIndexByTokenCountFromEnd(ReadOnlySpan<Char>, Boolean, Boolean, Int32, Boolean, String, Int32) |
Find the index of the maximum encoding capacity from the end within the text without surpassing the token limit. |
| GetIndexByTokenCountFromEnd(ReadOnlySpan<Char>, Int32, String, Int32, Boolean, Boolean) |
Find the index of the maximum encoding capacity without surpassing the token limit. (Inherited from Tokenizer) |
| GetIndexByTokenCountFromEnd(String, Boolean, Boolean, Int32, Boolean, String, Int32) |
Find the index of the maximum encoding capacity from the end within the text without surpassing the token limit. |
| GetIndexByTokenCountFromEnd(String, Int32, String, Int32, Boolean, Boolean) |
Find the index of the maximum encoding capacity without surpassing the token limit. (Inherited from Tokenizer) |