Spaza AI - Simple BPE Tokenizer

This package provides a set of tools for processing text using the GPT family of models. The GPT models process text using tokens, which are common sequences of characters found in text. The models understand the statistical relationships between these tokens, and excel at producing the next token in a sequence of tokens.

The getTokenCountForString function that takes an input string and tokenizes it. It returns the number of tokens in the resulting SPTokenContainer object.

The SPTokenContainer model is a simple data class that represents a container for the results of tokenizing a text input. It has three properties:

tokens: An list of integers that represents the encoded text. Each integer in the list corresponds to a unique token in the original text.
tokenCount: An integer that represents the number of tokens in the encoded text.
characterCount: An integer that represents the number of characters in the original text.

The SPTokenContainer model is used in the getTokenCountForString function to store the results of the tokenization process. It can be used to access the encoded tokens and their counts, as well as the original character count of the input text.

Features

The package provides a tool for tokenizing text inputs, which can be used to convert a piece of text into a sequence of tokens.

Getting started

Add the package to your project by adding the following line to your pubspec.yaml file:

dependencies:
  sp_ai_simple_bpe_tokenizer: ^0.0.1+1

Usage

void main() async {
  WidgetsFlutterBinding.ensureInitialized();
  
  String text = 'The quick brown fox jumps over the lazy dog';
  final tokenizedText = await SPAiSimpleBpeTokenizer().encodeString(text);
  print(tokenizedText);
}

Additional information

The package includes a basic vocabulary file that is used for the tokenization process. This file is located in the assets folder of the package, and is loaded into memory when the package is initialized. The vocabulary file contains 50,257 tokens, and is based on the GPT2 model vocabulary file.

Because of the limited vocabulary size, the tokenizer might miss some unique words or phrases that's not included in the vocabulary file. This can be mitigated by adding the missing tokens to the vocabulary file, and recompiling the package.

License

The getTokenCountForString function is a part of the gpt_flutter package, which is licensed under the MIT License.

Authors

www.spaza.com

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
assets/files		assets/files
example		example
ios		ios
lib		lib
.gitignore		.gitignore
.metadata		.metadata
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
analysis_options.yaml		analysis_options.yaml
pubspec.yaml		pubspec.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spaza AI - Simple BPE Tokenizer

Features

Getting started

Usage

Additional information

License

Authors

About

Releases

Packages

Contributors 2

Languages

License

Theunodb/Spaza-AI-Simple-BPE-Tokenizer

Folders and files

Latest commit

History

Repository files navigation

Spaza AI - Simple BPE Tokenizer

Features

Getting started

Usage

Additional information

License

Authors

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages