Udostępnij przez


SentencePieceTokenizer Class

Definition

SentencePieceBpe is a tokenizer that splits the input into tokens using the SentencePiece Bpe model.

public class SentencePieceTokenizer : Microsoft.ML.Tokenizers.Tokenizer
type SentencePieceTokenizer = class
    inherit Tokenizer
Public Class SentencePieceTokenizer
Inherits Tokenizer
Inheritance
SentencePieceTokenizer
Derived

Properties

AddBeginningOfSentence

Indicate emitting the beginning of sentence token during the encoding.

AddDummyPrefix

Indicate emitting the prefix character U+2581 at the beginning of sentence token during the normalization and encoding.

AddEndOfSentence

Indicate emitting the end of sentence token during the encoding.

BeginningOfSentenceId

The id of the beginning of sentence token.

BeginningOfSentenceToken

The beginning of sentence token.

ByteFallback

Specifies whether the model will do a byte fallback when it encounters unknown tokens during the encoding process.

EndOfSentenceId

The id of the end of sentence token.

EndOfSentenceToken

The end of sentence token.

EscapeWhiteSpaces

Indicate if the spaces should be replaced with character U+2581 during the normalization and encoding.

Normalizer

Gets the Normalizer in use by the Tokenizer.

PreTokenizer

Gets the PreTokenizer used by the Tokenizer.

SpecialTokens

The special tokens.

TreatWhitespaceAsSuffix

Indicate emitting the character U+2581 at the end of the last sentence token instead beginning of sentence token during the normalization and encoding.

UnknownId

The id of the unknown token.

UnknownToken

The unknown token.

Vocabulary

The vocabulary of the model.

Methods

CountTokens(ReadOnlySpan<Char>, Boolean, Boolean, Boolean, Boolean, String, Int32, Int32)

Get the number of tokens that the input text will be encoded to.

CountTokens(ReadOnlySpan<Char>, Boolean, Boolean, Boolean, Boolean)

Get the number of tokens that the input text will be encoded to.

CountTokens(ReadOnlySpan<Char>, Boolean, Boolean)

Get the number of tokens that the input text will be encoded to.

(Inherited from Tokenizer)
CountTokens(String, Boolean, Boolean, Boolean, Boolean, String, Int32, Int32)

Get the number of tokens that the input text will be encoded to.

CountTokens(String, Boolean, Boolean, Boolean, Boolean)

Get the number of tokens that the input text will be encoded to.

CountTokens(String, Boolean, Boolean)

Get the number of tokens that the input text will be encoded to.

(Inherited from Tokenizer)
CountTokens(String, ReadOnlySpan<Char>, EncodeSettings)

Get the number of tokens that the input text will be encoded to.

Create(Stream, Boolean, Boolean, IReadOnlyDictionary<String,Int32>)

Creates an instance of SentencePieceTokenizer. The model stream should contain a SentencePiece model as specified in the following documentation: https://github.com/google/sentencepiece/blob/master/src/sentencepiece_model.proto.

Decode(IEnumerable<Int32>, Boolean)

Decode the given ids, back to a String.

Decode(IEnumerable<Int32>, Span<Char>, Boolean, Int32, Int32)

Decode the given ids back to text and store the result in the destination span.

Decode(IEnumerable<Int32>, Span<Char>, Int32, Int32)

Decode the given ids back to text and store the result in the destination span.

Decode(IEnumerable<Int32>)

Decode the given ids, back to a String.

EncodeToIds(ReadOnlySpan<Char>, Boolean, Boolean, Boolean, Boolean)

Encodes input text to token Ids.

EncodeToIds(ReadOnlySpan<Char>, Boolean, Boolean, Int32, String, Int32, Boolean, Boolean)

Encodes input text to token Ids up to maximum number of tokens.

EncodeToIds(ReadOnlySpan<Char>, Boolean, Boolean)

Encodes input text to token Ids.

(Inherited from Tokenizer)
EncodeToIds(ReadOnlySpan<Char>, Int32, String, Int32, Boolean, Boolean)

Encodes input text to token Ids up to maximum number of tokens.

(Inherited from Tokenizer)
EncodeToIds(String, Boolean, Boolean, Boolean, Boolean)

Encodes input text to token Ids.

EncodeToIds(String, Boolean, Boolean, Int32, String, Int32, Boolean, Boolean)

Encodes input text to token Ids up to maximum number of tokens.

EncodeToIds(String, Boolean, Boolean)

Encodes input text to token Ids.

(Inherited from Tokenizer)
EncodeToIds(String, Int32, String, Int32, Boolean, Boolean)

Encodes input text to token Ids up to maximum number of tokens.

<param name="text">The text to encode.</param> (Inherited from Tokenizer)
EncodeToIds(String, ReadOnlySpan<Char>, EncodeSettings)

Encodes input text to token Ids.

EncodeToTokens(ReadOnlySpan<Char>, String, Boolean, Boolean, Boolean, Boolean)

Encodes input text a list of EncodedTokens with string value of the token, id, and offset.

EncodeToTokens(ReadOnlySpan<Char>, String, Boolean, Boolean)

Encodes input text to a list of EncodedTokens.

(Inherited from Tokenizer)
EncodeToTokens(String, ReadOnlySpan<Char>, EncodeSettings)

Encodes input text to a list of EncodedTokens.

EncodeToTokens(String, String, Boolean, Boolean, Boolean, Boolean)

Encodes input text a list of EncodedTokens with string value of the token, id, and offset.

EncodeToTokens(String, String, Boolean, Boolean)

Encodes input text to a list of EncodedTokens.

(Inherited from Tokenizer)
GetIndexByTokenCount(ReadOnlySpan<Char>, Boolean, Boolean, Int32, String, Int32, Boolean, Boolean)

Find the index of the maximum encoding capacity from the start within the text without surpassing the token limit.

GetIndexByTokenCount(ReadOnlySpan<Char>, Int32, String, Int32, Boolean, Boolean)

Find the index of the maximum encoding capacity without surpassing the token limit.

(Inherited from Tokenizer)
GetIndexByTokenCount(String, Boolean, Boolean, Int32, String, Int32, Boolean, Boolean)

Find the index of the maximum encoding capacity from the start within the text without surpassing the token limit.

GetIndexByTokenCount(String, Int32, String, Int32, Boolean, Boolean)

Find the index of the maximum encoding capacity without surpassing the token limit.

(Inherited from Tokenizer)
GetIndexByTokenCount(String, ReadOnlySpan<Char>, EncodeSettings, Boolean, String, Int32)

Find the index of the maximum encoding capacity without surpassing the token limit.

GetIndexByTokenCountFromEnd(ReadOnlySpan<Char>, Boolean, Boolean, Int32, Boolean, String, Int32)

Find the index of the maximum encoding capacity from the end within the text without surpassing the token limit.

GetIndexByTokenCountFromEnd(ReadOnlySpan<Char>, Int32, String, Int32, Boolean, Boolean)

Find the index of the maximum encoding capacity without surpassing the token limit.

(Inherited from Tokenizer)
GetIndexByTokenCountFromEnd(String, Boolean, Boolean, Int32, Boolean, String, Int32)

Find the index of the maximum encoding capacity from the end within the text without surpassing the token limit.

GetIndexByTokenCountFromEnd(String, Int32, String, Int32, Boolean, Boolean)

Find the index of the maximum encoding capacity without surpassing the token limit.

(Inherited from Tokenizer)

Applies to