Large memory usage? #21

Vimiso · 2024-10-20T17:51:57Z

Take the given test:

$usage = memory()[1];

$provider = new \Yethee\Tiktoken\EncoderProvider;
$provider->setVocabCache(storage_path('app'));
$encoder = $provider->getForModel('gpt-4o-mini');

dd(memory()[1]-$usage); // 26mb!

26mb seems a bit much no? Especially considering the cached vocab is only 3.6mb.

The text was updated successfully, but these errors were encountered:

yethee · 2024-11-10T21:51:37Z

The token dictionary takes up most of the allocated memory. We need to keep the entire dictionary in memory so that encoding text into tokens and vice versa is efficient. Currently, the built-in array type is used for this. I have no idea how to reduce the amount of memory consumed in this place.

Profile

<?php

use Yethee\Tiktoken\EncoderProvider;

require_once 'vendor/autoload.php';

$provider = new EncoderProvider();
$encoder = $provider->get('<encoding>');

Top of memory usage: Vocab::fromStream()

Encoding: cl100k_base

*** SPX Report ***

Global stats:

  Called functions    :       81
  Distinct functions  :       50

  Wall time           :  161.9ms
  ZE memory usage     :   11.8MB

Flat profile:

 Wall time           | ZE memory usage     |
 Inc.     | *Exc.    | Inc.     | Exc.     | Called   | Function
----------+----------+----------+----------+----------+----------
   70.2ms |   59.0ms |  432.2KB |  418.5KB |       12 | {closure}
   42.1ms |   38.2ms |   10.8MB |    8.8MB |        1 | Yethee\Tiktoken\Vocab\Vocab::fromStream
   78.5ms |    5.9ms |  839.8KB |  363.7KB |        1 | ComposerAutoloaderInitac9bfb1d4166aeecccdb5d5dfb6f6537::getLoader
    5.0ms |    5.0ms |     120B |     120B |        1 | Yethee\Tiktoken\Vocab\Loader\DefaultVocabLoader::checkHash
    4.0ms |    4.0ms |    2.0MB |    2.0MB |        1 | Yethee\Tiktoken\Vocab\Vocab::__construct
    2.4ms |    2.4ms |   43.0KB |   43.0KB |        1 | ComposerAutoloaderInitac9bfb1d4166aeecccdb5d5dfb6f6537::loadClassLoader
   29.9us |   29.9us |       0B |       0B |        1 | /var/src/tiktoken/vendor/phpunit/phpunit/src/Framework/Assert/Functions.php
   42.1ms |   19.4us |   10.8MB |   -8.0KB |        1 | Yethee\Tiktoken\Vocab\Vocab::fromFile
   15.4us |   15.4us |     424B |     424B |        1 | Composer\Autoload\ClassLoader::initializeIncludeClosure
    5.7ms |   11.7us |     592B |       0B |        6 | Composer\Autoload\ClassLoader::findFile

Encoding: o200k_base

*** SPX Report ***

Global stats:

  Called functions    :       81
  Distinct functions  :       50

  Wall time           :  202.1ms
  ZE memory usage     :   22.7MB

Flat profile:

 Wall time           | ZE memory usage     |
 Inc.     | *Exc.    | Inc.     | Exc.     | Called   | Function
----------+----------+----------+----------+----------+----------
   84.6ms |   76.1ms |   21.8MB |   17.8MB |        1 | Yethee\Tiktoken\Vocab\Vocab::fromStream
   16.4ms |   14.6ms |   64.9KB |   65.1KB |        6 | 1@Composer\Autoload\{closure}
   10.8ms |   10.8ms |     120B |     120B |        1 | Yethee\Tiktoken\Vocab\Loader\DefaultVocabLoader::checkHash
    8.5ms |    8.5ms |    4.0MB |    4.0MB |        1 | Yethee\Tiktoken\Vocab\Vocab::__construct
    2.0ms |    2.0ms |   43.0KB |   43.0KB |        1 | ComposerAutoloaderInitac9bfb1d4166aeecccdb5d5dfb6f6537::loadClassLoader
   31.9us |   31.9us |       0B |       0B |        1 | /var/src/tiktoken/vendor/phpunit/phpunit/src/Framework/Assert/Functions.php
   84.7ms |   23.8us |   21.8MB |   -8.0KB |        1 | Yethee\Tiktoken\Vocab\Vocab::fromFile
    5.5ms |   10.6us |     592B |       0B |        6 | Composer\Autoload\ClassLoader::findFile
    6.8us |    6.8us |      48B |      48B |        1 | Yethee\Tiktoken\EncoderProvider::__construct
  106.4ms |    6.1us |   21.8MB |     432B |        1 | Yethee\Tiktoken\EncoderProvider::getVocab

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large memory usage? #21

Large memory usage? #21

Vimiso commented Oct 20, 2024

yethee commented Nov 10, 2024 •

edited

Loading

Encoding: cl100k_base

Encoding: o200k_base

Large memory usage? #21

Large memory usage? #21

Comments

Vimiso commented Oct 20, 2024

yethee commented Nov 10, 2024 • edited Loading

Encoding: cl100k_base

Encoding: o200k_base

yethee commented Nov 10, 2024 •

edited

Loading