SentencePieceTokenizer Class

Definition

Namespace:: Microsoft.ML.Tokenizers

Assembly:: Microsoft.ML.Tokenizers.dll

Package:: Microsoft.ML.Tokenizers v1.0.1

Package:: Microsoft.ML.Tokenizers v0.22.0

Package:: Microsoft.ML.Tokenizers v2.0.0-preview.1.25125.4

Source:: SentencePieceTokenizer.cs

Source:: SentencePieceTokenizer.cs

Source:: SentencePieceTokenizer.cs

Important

Some information relates to prerelease product that may be substantially modified before it’s released. Microsoft makes no warranties, express or implied, with respect to the information provided here.

SentencePieceBpe is a tokenizer that splits the input into tokens using the SentencePiece Bpe model.

public class SentencePieceTokenizer : Microsoft.ML.Tokenizers.Tokenizer

type SentencePieceTokenizer = class
    inherit Tokenizer

Public Class SentencePieceTokenizer
Inherits Tokenizer

Inheritance: Object

Tokenizer
SentencePieceTokenizer

Derived: Microsoft.ML.Tokenizers.LlamaTokenizer

Properties

AddBeginningOfSentence	Indicate emitting the beginning of sentence token during the encoding.
AddDummyPrefix	Indicate emitting the prefix character U+2581 at the beginning of sentence token during the normalization and encoding.
AddEndOfSentence	Indicate emitting the end of sentence token during the encoding.
BeginningOfSentenceId	The id of the beginning of sentence token.
BeginningOfSentenceToken	The beginning of sentence token.
ByteFallback	Specifies whether the model will do a byte fallback when it encounters unknown tokens during the encoding process.
EndOfSentenceId	The id of the end of sentence token.
EndOfSentenceToken	The end of sentence token.
EscapeWhiteSpaces	Indicate if the spaces should be replaced with character U+2581 during the normalization and encoding.
Normalizer	Gets the Normalizer in use by the Tokenizer.
PreTokenizer	Gets the PreTokenizer used by the Tokenizer.
SpecialTokens	The special tokens.
TreatWhitespaceAsSuffix	Indicate emitting the character U+2581 at the end of the last sentence token instead beginning of sentence token during the normalization and encoding.
UnknownId	The id of the unknown token.
UnknownToken	The unknown token.
Vocabulary	The vocabulary of the model.

Methods

CountTokens(ReadOnlySpan<Char>, Boolean, Boolean, Boolean, Boolean, String, Int32, Int32)	Get the number of tokens that the input text will be encoded to.
CountTokens(ReadOnlySpan<Char>, Boolean, Boolean, Boolean, Boolean)	Get the number of tokens that the input text will be encoded to.
CountTokens(ReadOnlySpan<Char>, Boolean, Boolean)	Get the number of tokens that the input text will be encoded to. (Inherited from Tokenizer)
CountTokens(String, Boolean, Boolean, Boolean, Boolean, String, Int32, Int32)	Get the number of tokens that the input text will be encoded to.
CountTokens(String, Boolean, Boolean, Boolean, Boolean)	Get the number of tokens that the input text will be encoded to.
CountTokens(String, Boolean, Boolean)	Get the number of tokens that the input text will be encoded to. (Inherited from Tokenizer)
CountTokens(String, ReadOnlySpan<Char>, EncodeSettings)	Get the number of tokens that the input text will be encoded to.
Create(Stream, Boolean, Boolean, IReadOnlyDictionary<String,Int32>)	Creates an instance of SentencePieceTokenizer. The model stream should contain a SentencePiece model as specified in the following documentation: https://github.com/google/sentencepiece/blob/master/src/sentencepiece_model.proto.
Decode(IEnumerable<Int32>, Boolean)	Decode the given ids, back to a String.
Decode(IEnumerable<Int32>, Span<Char>, Boolean, Int32, Int32)	Decode the given ids back to text and store the result in the `destination` span.
Decode(IEnumerable<Int32>, Span<Char>, Int32, Int32)	Decode the given ids back to text and store the result in the `destination` span.
Decode(IEnumerable<Int32>)	Decode the given ids, back to a String.
EncodeToIds(ReadOnlySpan<Char>, Boolean, Boolean, Boolean, Boolean)	Encodes input text to token Ids.
EncodeToIds(ReadOnlySpan<Char>, Boolean, Boolean, Int32, String, Int32, Boolean, Boolean)	Encodes input text to token Ids up to maximum number of tokens.
EncodeToIds(ReadOnlySpan<Char>, Boolean, Boolean)	Encodes input text to token Ids. (Inherited from Tokenizer)
EncodeToIds(ReadOnlySpan<Char>, Int32, String, Int32, Boolean, Boolean)	Encodes input text to token Ids up to maximum number of tokens. (Inherited from Tokenizer)
EncodeToIds(String, Boolean, Boolean, Boolean, Boolean)	Encodes input text to token Ids.
EncodeToIds(String, Boolean, Boolean, Int32, String, Int32, Boolean, Boolean)	Encodes input text to token Ids up to maximum number of tokens.
EncodeToIds(String, Boolean, Boolean)	Encodes input text to token Ids. (Inherited from Tokenizer)
EncodeToIds(String, Int32, String, Int32, Boolean, Boolean)	Encodes input text to token Ids up to maximum number of tokens. <param name="text">The text to encode.</param> (Inherited from Tokenizer)
EncodeToIds(String, ReadOnlySpan<Char>, EncodeSettings)	Encodes input text to token Ids.
EncodeToTokens(ReadOnlySpan<Char>, String, Boolean, Boolean, Boolean, Boolean)	Encodes input text a list of EncodedTokens with string value of the token, id, and offset.
EncodeToTokens(ReadOnlySpan<Char>, String, Boolean, Boolean)	Encodes input text to a list of EncodedTokens. (Inherited from Tokenizer)
EncodeToTokens(String, ReadOnlySpan<Char>, EncodeSettings)	Encodes input text to a list of EncodedTokens.
EncodeToTokens(String, String, Boolean, Boolean, Boolean, Boolean)	Encodes input text a list of EncodedTokens with string value of the token, id, and offset.
EncodeToTokens(String, String, Boolean, Boolean)	Encodes input text to a list of EncodedTokens. (Inherited from Tokenizer)
GetIndexByTokenCount(ReadOnlySpan<Char>, Boolean, Boolean, Int32, String, Int32, Boolean, Boolean)	Find the index of the maximum encoding capacity from the start within the text without surpassing the token limit.
GetIndexByTokenCount(ReadOnlySpan<Char>, Int32, String, Int32, Boolean, Boolean)	Find the index of the maximum encoding capacity without surpassing the token limit. (Inherited from Tokenizer)
GetIndexByTokenCount(String, Boolean, Boolean, Int32, String, Int32, Boolean, Boolean)	Find the index of the maximum encoding capacity from the start within the text without surpassing the token limit.
GetIndexByTokenCount(String, Int32, String, Int32, Boolean, Boolean)	Find the index of the maximum encoding capacity without surpassing the token limit. (Inherited from Tokenizer)
GetIndexByTokenCount(String, ReadOnlySpan<Char>, EncodeSettings, Boolean, String, Int32)	Find the index of the maximum encoding capacity without surpassing the token limit.
GetIndexByTokenCountFromEnd(ReadOnlySpan<Char>, Boolean, Boolean, Int32, Boolean, String, Int32)	Find the index of the maximum encoding capacity from the end within the text without surpassing the token limit.
GetIndexByTokenCountFromEnd(ReadOnlySpan<Char>, Int32, String, Int32, Boolean, Boolean)	Find the index of the maximum encoding capacity without surpassing the token limit. (Inherited from Tokenizer)
GetIndexByTokenCountFromEnd(String, Boolean, Boolean, Int32, Boolean, String, Int32)	Find the index of the maximum encoding capacity from the end within the text without surpassing the token limit.
GetIndexByTokenCountFromEnd(String, Int32, String, Int32, Boolean, Boolean)	Find the index of the maximum encoding capacity without surpassing the token limit. (Inherited from Tokenizer)

Applies to

Udostępnij przez

SentencePieceTokenizer Class

Definition

Properties

Methods

Applies to