EnglishRobertaTokenizer Class

Definition

Namespace:: Microsoft.ML.Tokenizers

Assembly:: Microsoft.ML.Tokenizers.dll

Package:: Microsoft.ML.Tokenizers v1.0.1

Package:: Microsoft.ML.Tokenizers v0.22.0

Package:: Microsoft.ML.Tokenizers v2.0.0-preview.1.25125.4

Source:: EnglishRobertaTokenizer.cs

Source:: EnglishRobertaTokenizer.cs

Source:: EnglishRobertaTokenizer.cs

Important

Some information relates to prerelease product that may be substantially modified before it’s released. Microsoft makes no warranties, express or implied, with respect to the information provided here.

Represent the Byte Pair Encoding model.

public sealed class EnglishRobertaTokenizer : Microsoft.ML.Tokenizers.Tokenizer

type EnglishRobertaTokenizer = class
    inherit Tokenizer

Public NotInheritable Class EnglishRobertaTokenizer
Inherits Tokenizer

Inheritance: Object

Tokenizer
EnglishRobertaTokenizer

Properties

FilterUnsupportedChars	Indicate if want to filter the unsupported characters during the decoding.
Normalizer	Gets the Normalizer in use by the Tokenizer.
PadIndex	Gets the index of the pad symbol inside the symbols list.
PreTokenizer	Gets the PreTokenizer used by the Tokenizer.
SymbolsCount	Gets the symbols list length.
Vocabulary	Gets the dictionary mapping tokens to Ids.

Methods

AddMaskSymbol(String)	Add the mask symbol to the symbols list.
ConvertIdsToOccurrenceRanks(IReadOnlyList<Int32>)	Convert a list of token Ids to highest occurrence rankings.
ConvertIdsToOccurrenceValues(IReadOnlyList<Int32>)	Convert a list of token Ids to highest occurrence values.
ConvertOccurrenceRanksToIds(IReadOnlyList<Int32>)	Convert a list of highest occurrence rankings to token Ids list .
CountTokens(ReadOnlySpan<Char>, Boolean, Boolean)	Get the number of tokens that the input text will be encoded to. (Inherited from Tokenizer)
CountTokens(String, Boolean, Boolean)	Get the number of tokens that the input text will be encoded to. (Inherited from Tokenizer)
CountTokens(String, ReadOnlySpan<Char>, EncodeSettings)	Get the number of tokens that the input text will be encoded to. (Inherited from Tokenizer)
Create(Stream, Stream, Stream, PreTokenizer, Normalizer, Boolean)	Create tokenizer's model object to use with the English Robert model.
Create(Stream, Stream, Stream)	Create tokenizer's model object to use with the English Robert model.
Create(String, String, String, PreTokenizer, Normalizer, Boolean)	Create tokenizer's model object to use with the English Robert model.
Create(String, String, String)	Create tokenizer's model object to use with the English Robert model.
Decode(IEnumerable<Int32>, Span<Char>, Int32, Int32)	Decode the given ids back to text and store the result in the `destination` span.
Decode(IEnumerable<Int32>)	Decode the given ids, back to a String.
EncodeToIds(ReadOnlySpan<Char>, Boolean, Boolean)	Encodes input text to token Ids. (Inherited from Tokenizer)
EncodeToIds(ReadOnlySpan<Char>, Int32, String, Int32, Boolean, Boolean)	Encodes input text to token Ids up to maximum number of tokens. (Inherited from Tokenizer)
EncodeToIds(String, Boolean, Boolean)	Encodes input text to token Ids. (Inherited from Tokenizer)
EncodeToIds(String, Int32, String, Int32, Boolean, Boolean)	Encodes input text to token Ids up to maximum number of tokens. <param name="text">The text to encode.</param> (Inherited from Tokenizer)
EncodeToIds(String, ReadOnlySpan<Char>, EncodeSettings)	Encodes input text to token Ids. (Inherited from Tokenizer)
EncodeToTokens(ReadOnlySpan<Char>, String, Boolean, Boolean)	Encodes input text to a list of EncodedTokens. (Inherited from Tokenizer)
EncodeToTokens(String, ReadOnlySpan<Char>, EncodeSettings)	Encodes input text to a list of EncodedTokens. (Inherited from Tokenizer)
EncodeToTokens(String, String, Boolean, Boolean)	Encodes input text to a list of EncodedTokens. (Inherited from Tokenizer)
GetIndexByTokenCount(ReadOnlySpan<Char>, Int32, String, Int32, Boolean, Boolean)	Find the index of the maximum encoding capacity without surpassing the token limit. (Inherited from Tokenizer)
GetIndexByTokenCount(String, Int32, String, Int32, Boolean, Boolean)	Find the index of the maximum encoding capacity without surpassing the token limit. (Inherited from Tokenizer)
GetIndexByTokenCount(String, ReadOnlySpan<Char>, EncodeSettings, Boolean, String, Int32)	Find the index of the maximum encoding capacity without surpassing the token limit. (Inherited from Tokenizer)
GetIndexByTokenCountFromEnd(ReadOnlySpan<Char>, Int32, String, Int32, Boolean, Boolean)	Find the index of the maximum encoding capacity without surpassing the token limit. (Inherited from Tokenizer)
GetIndexByTokenCountFromEnd(String, Int32, String, Int32, Boolean, Boolean)	Find the index of the maximum encoding capacity without surpassing the token limit. (Inherited from Tokenizer)
IsSupportedChar(Char)	Check if the character is supported by the tokenizer's model.

Applies to

Udostępnij przez

EnglishRobertaTokenizer Class

Definition

Properties

Methods

Applies to