SentencePieceTokenizer.Create Method
Definition
Important
Some information relates to prerelease product that may be substantially modified before it’s released. Microsoft makes no warranties, express or implied, with respect to the information provided here.
Creates an instance of SentencePieceTokenizer. The model stream should contain a SentencePiece model as specified in the following documentation: https://github.com/google/sentencepiece/blob/master/src/sentencepiece_model.proto.
public static Microsoft.ML.Tokenizers.SentencePieceTokenizer Create(System.IO.Stream modelStream, bool addBeginOfSentence = true, bool addEndOfSentence = false, System.Collections.Generic.IReadOnlyDictionary<string,int>? specialTokens = default);
static member Create : System.IO.Stream * bool * bool * System.Collections.Generic.IReadOnlyDictionary<string, int> -> Microsoft.ML.Tokenizers.SentencePieceTokenizer
Public Shared Function Create (modelStream As Stream, Optional addBeginOfSentence As Boolean = true, Optional addEndOfSentence As Boolean = false, Optional specialTokens As IReadOnlyDictionary(Of String, Integer) = Nothing) As SentencePieceTokenizer
Parameters
- modelStream
- Stream
The stream containing the SentencePiece Bpe or Unigram model.
- addBeginOfSentence
- Boolean
Indicate emitting the beginning of sentence token during the encoding.
- addEndOfSentence
- Boolean
Indicate emitting the end of sentence token during the encoding.
- specialTokens
- IReadOnlyDictionary<String,Int32>
The additional tokens to add to the vocabulary.
Returns
Remarks
When creating the tokenizer, ensure that the vocabulary stream is sourced from a trusted provider.