다음을 통해 공유


SentencePieceTokenizer.GetIndexByTokenCount Method

Definition

Overloads

GetIndexByTokenCount(ReadOnlySpan<Char>, Boolean, Boolean, Int32, String, Int32, Boolean, Boolean)

Find the index of the maximum encoding capacity from the start within the text without surpassing the token limit.

GetIndexByTokenCount(String, Boolean, Boolean, Int32, String, Int32, Boolean, Boolean)

Find the index of the maximum encoding capacity from the start within the text without surpassing the token limit.

GetIndexByTokenCount(String, ReadOnlySpan<Char>, EncodeSettings, Boolean, String, Int32)

Find the index of the maximum encoding capacity without surpassing the token limit.

GetIndexByTokenCount(ReadOnlySpan<Char>, Boolean, Boolean, Int32, String, Int32, Boolean, Boolean)

Source:
SentencePieceTokenizer.cs
Source:
SentencePieceTokenizer.cs
Source:
SentencePieceTokenizer.cs

Find the index of the maximum encoding capacity from the start within the text without surpassing the token limit.

public int GetIndexByTokenCount(ReadOnlySpan<char> text, bool addBeginningOfSentence, bool addEndOfSentence, int maxTokenCount, out string? normalizedText, out int tokenCount, bool considerPreTokenization = true, bool considerNormalization = true);
override this.GetIndexByTokenCount : ReadOnlySpan<char> * bool * bool * int * string * int * bool * bool -> int
Public Function GetIndexByTokenCount (text As ReadOnlySpan(Of Char), addBeginningOfSentence As Boolean, addEndOfSentence As Boolean, maxTokenCount As Integer, ByRef normalizedText As String, ByRef tokenCount As Integer, Optional considerPreTokenization As Boolean = true, Optional considerNormalization As Boolean = true) As Integer

Parameters

text
ReadOnlySpan<Char>

The text to encode.

addBeginningOfSentence
Boolean

Indicate emitting the beginning of sentence token during the encoding.

addEndOfSentence
Boolean

Indicate emitting the end of sentence token during the encoding.

maxTokenCount
Int32

The maximum token count to limit the encoding capacity.

normalizedText
String

If the tokenizer's normalization is enabled or <paramRef name="considerNormalization"></paramRef> is false, this will be set to <paramRef name="text"></paramRef> in its normalized form; otherwise, this value will be set to null.

tokenCount
Int32

The token count can be generated which should be smaller than the maximum token count.

considerPreTokenization
Boolean

Indicate whether to consider pre-tokenization before tokenization.

considerNormalization
Boolean

Indicate whether to consider normalization before tokenization.

Returns

The index of the maximum encoding capacity within the processed text without surpassing the token limit. It represents the index immediately following the last character to be included. In cases where no tokens fit, the result will be 0; conversely, if all tokens fit, the result will be length of the text or the normalizedText if the normalization is enabled.

Applies to

GetIndexByTokenCount(String, Boolean, Boolean, Int32, String, Int32, Boolean, Boolean)

Source:
SentencePieceTokenizer.cs
Source:
SentencePieceTokenizer.cs
Source:
SentencePieceTokenizer.cs

Find the index of the maximum encoding capacity from the start within the text without surpassing the token limit.

public int GetIndexByTokenCount(string text, bool addBeginningOfSentence, bool addEndOfSentence, int maxTokenCount, out string? normalizedText, out int tokenCount, bool considerPreTokenization = true, bool considerNormalization = true);
override this.GetIndexByTokenCount : string * bool * bool * int * string * int * bool * bool -> int
Public Function GetIndexByTokenCount (text As String, addBeginningOfSentence As Boolean, addEndOfSentence As Boolean, maxTokenCount As Integer, ByRef normalizedText As String, ByRef tokenCount As Integer, Optional considerPreTokenization As Boolean = true, Optional considerNormalization As Boolean = true) As Integer

Parameters

text
String

The text to encode.

addBeginningOfSentence
Boolean

Indicate emitting the beginning of sentence token during the encoding.

addEndOfSentence
Boolean

Indicate emitting the end of sentence token during the encoding.

maxTokenCount
Int32

The maximum token count to limit the encoding capacity.

normalizedText
String

If the tokenizer's normalization is enabled or <paramRef name="considerNormalization"></paramRef> is false, this will be set to <paramRef name="text"></paramRef> in its normalized form; otherwise, this value will be set to null.

tokenCount
Int32

The token count can be generated which should be smaller than the maximum token count.

considerPreTokenization
Boolean

Indicate whether to consider pre-tokenization before tokenization.

considerNormalization
Boolean

Indicate whether to consider normalization before tokenization.

Returns

The index of the maximum encoding capacity within the processed text without surpassing the token limit. It represents the index immediately following the last character to be included. In cases where no tokens fit, the result will be 0; conversely, if all tokens fit, the result will be length of the text or the normalizedText if the normalization is enabled.

Applies to

GetIndexByTokenCount(String, ReadOnlySpan<Char>, EncodeSettings, Boolean, String, Int32)

Source:
SentencePieceTokenizer.cs
Source:
SentencePieceTokenizer.cs
Source:
SentencePieceTokenizer.cs

Find the index of the maximum encoding capacity without surpassing the token limit.

protected override int GetIndexByTokenCount(string? text, ReadOnlySpan<char> textSpan, Microsoft.ML.Tokenizers.EncodeSettings settings, bool fromEnd, out string? normalizedText, out int tokenCount);
override this.GetIndexByTokenCount : string * ReadOnlySpan<char> * Microsoft.ML.Tokenizers.EncodeSettings * bool * string * int -> int
Protected Overrides Function GetIndexByTokenCount (text As String, textSpan As ReadOnlySpan(Of Char), settings As EncodeSettings, fromEnd As Boolean, ByRef normalizedText As String, ByRef tokenCount As Integer) As Integer

Parameters

text
String

The text to encode.

textSpan
ReadOnlySpan<Char>

The span of the text to encode which will be used if the text is null.

settings
EncodeSettings

The settings used to encode the text.

fromEnd
Boolean

Indicate whether to find the index from the end of the text.

normalizedText
String

If the tokenizer's normalization is enabled or <paramRef name="settings"></paramRef> has ConsiderNormalization is false, this will be set to <paramRef name="text"></paramRef> in its normalized form; otherwise, this value will be set to null.

tokenCount
Int32

The token count can be generated which should be smaller than the maximum token count.

Returns

The index of the maximum encoding capacity within the processed text without surpassing the token limit. If <paramRef name="fromEnd"></paramRef> is false, it represents the index immediately following the last character to be included. In cases where no tokens fit, the result will be 0; conversely, if all tokens fit, the result will be length of the input text or the normalizedText if the normalization is enabled. If <paramRef name="fromEnd"></paramRef> is true, it represents the index of the first character to be included. In cases where no tokens fit, the result will be the text length; conversely, if all tokens fit, the result will be zero.

Applies to