WordPieceTokenizer.GetIndexByTokenCount Method
Definition
Important
Some information relates to prerelease product that may be substantially modified before it’s released. Microsoft makes no warranties, express or implied, with respect to the information provided here.
Find the index of the maximum encoding capacity without surpassing the token limit.
protected override int GetIndexByTokenCount(string? text, ReadOnlySpan<char> textSpan, Microsoft.ML.Tokenizers.EncodeSettings settings, bool fromEnd, out string? normalizedText, out int tokenCount);
override this.GetIndexByTokenCount : string * ReadOnlySpan<char> * Microsoft.ML.Tokenizers.EncodeSettings * bool * string * int -> int
Protected Overrides Function GetIndexByTokenCount (text As String, textSpan As ReadOnlySpan(Of Char), settings As EncodeSettings, fromEnd As Boolean, ByRef normalizedText As String, ByRef tokenCount As Integer) As Integer
Parameters
- text
- String
The text to encode.
- textSpan
- ReadOnlySpan<Char>
The span of the text to encode which will be used if the text is null.
- settings
- EncodeSettings
The settings used to encode the text.
- fromEnd
- Boolean
Indicate whether to find the index from the end of the text.
- normalizedText
- String
If the tokenizer's normalization is enabled or <paramRef name="settings"></paramRef> has ConsiderNormalization is false, this will be set to <paramRef name="text"></paramRef> in its normalized form; otherwise, this value will be set to null.
- tokenCount
- Int32
The token count can be generated which should be smaller than the maximum token count.
Returns
The index of the maximum encoding capacity within the processed text without surpassing the token limit.
If <paramRef name="fromEnd"></paramRef> is false, it represents the index immediately following the last character to be included. In cases where no tokens fit, the result will be 0; conversely,
if all tokens fit, the result will be length of the input text or the normalizedText if the normalization is enabled.
If <paramRef name="fromEnd"></paramRef> is true, it represents the index of the first character to be included. In cases where no tokens fit, the result will be the text length; conversely,
if all tokens fit, the result will be zero.