Fast encode #1560

ArthurZucker · 2024-06-20T14:54:42Z

Try to make our code faster :)

From inital bench for GPT2:

20% of the time is spent in the pre_tokenizer when doing batch encoding
8% for no cache
xx% for added tokens (not 100% sure, gotta remove them and add them again, add other tokens as well)
removing ïng" reduce performances by 700% lol

Inital bench results:

    Finished `bench` profile [optimized] target(s) in 32.40s
     Running benches/bert_benchmark.rs (target/release/deps/bert_benchmark-978096f5c7d2a77c)
Gnuplot not found, using plotters backend
Benchmarking WordPiece BERT encode
Benchmarking WordPiece BERT encode: Warming up for 3.0000 s
Benchmarking WordPiece BERT encode: Collecting 20 samples in estimated 5.0031 s (284970 iterations)
Benchmarking WordPiece BERT encode: Analyzing
WordPiece BERT encode   time:   [17.399 µs 17.406 µs 17.416 µs]
                        change: [-2.1128% -1.9745% -1.8658%] (p = 0.00 < 0.05)
                        Performance has improved.
slope  [17.399 µs 17.416 µs] R^2            [0.9999587 0.9999530]
mean   [17.413 µs 17.439 µs] std. dev.      [21.867 ns 38.768 ns]
median [17.403 µs 17.448 µs] med. abs. dev. [9.7665 ns 47.692 ns]

Benchmarking WordPiece BERT encode batch
Benchmarking WordPiece BERT encode batch: Warming up for 3.0000 s
Benchmarking WordPiece BERT encode batch: Collecting 20 samples in estimated 5.5509 s (1890 iterations)
Benchmarking WordPiece BERT encode batch: Analyzing
WordPiece BERT encode batch
                        time:   [2.8891 ms 2.8920 ms 2.8945 ms]
                        change: [-19.384% -19.126% -18.887%] (p = 0.00 < 0.05)
                        Performance has improved.
slope  [2.8891 ms 2.8945 ms] R^2            [0.9998221 0.9998317]
mean   [2.8851 ms 2.8940 ms] std. dev.      [7.3858 µs 12.625 µs]
median [2.8833 ms 2.8963 ms] med. abs. dev. [4.4561 µs 16.018 µs]

Benchmarking WordPiece Train vocabulary (small)
Benchmarking WordPiece Train vocabulary (small): Warming up for 3.0000 s
Benchmarking WordPiece Train vocabulary (small): Collecting 10 samples in estimated 5.7655 s (220 iterations)
Benchmarking WordPiece Train vocabulary (small): Analyzing
WordPiece Train vocabulary (small)
                        time:   [25.873 ms 25.988 ms 26.085 ms]
                        change: [-1.5674% -0.8238% -0.0549%] (p = 0.06 > 0.05)
                        No change in performance detected.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild
slope  [25.873 ms 26.085 ms] R^2            [0.9990557 0.9991575]
mean   [25.924 ms 26.231 ms] std. dev.      [114.50 µs 334.27 µs]
median [25.868 ms 26.258 ms] med. abs. dev. [64.904 µs 427.20 µs]

Benchmarking WordPiece Train vocabulary (big)
Benchmarking WordPiece Train vocabulary (big): Warming up for 3.0000 s

Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 7.8s.
Benchmarking WordPiece Train vocabulary (big): Collecting 10 samples in estimated 7.8009 s (10 iterations)
Benchmarking WordPiece Train vocabulary (big): Analyzing
WordPiece Train vocabulary (big)
                        time:   [770.56 ms 775.31 ms 780.55 ms]
                        change: [-3.4898% -1.7298% -0.1796%] (p = 0.07 > 0.05)
                        No change in performance detected.
mean   [770.56 ms 780.55 ms] std. dev.      [3.8811 ms 11.788 ms]
median [769.56 ms 779.70 ms] med. abs. dev. [387.67 µs 13.909 ms]

     Running benches/bpe_benchmark.rs (target/release/deps/bpe_benchmark-4074fa6f48a53e0d)
Gnuplot not found, using plotters backend
Benchmarking BPE GPT2 encode
Benchmarking BPE GPT2 encode: Warming up for 3.0000 s
Benchmarking BPE GPT2 encode: Collecting 20 samples in estimated 5.0014 s (470190 iterations)
Benchmarking BPE GPT2 encode: Analyzing
BPE GPT2 encode         time:   [10.756 µs 10.764 µs 10.775 µs]
                        change: [-4.0388% -3.9526% -3.8664%] (p = 0.00 < 0.05)
                        Performance has improved.
slope  [10.756 µs 10.775 µs] R^2            [0.9999103 0.9998945]
mean   [10.757 µs 10.769 µs] std. dev.      [9.0797 ns 18.260 ns]
median [10.752 µs 10.767 µs] med. abs. dev. [4.5995 ns 20.446 ns]

Benchmarking BPE GPT2 encode batch
Benchmarking BPE GPT2 encode batch: Warming up for 3.0000 s
Benchmarking BPE GPT2 encode batch: Collecting 20 samples in estimated 5.0345 s (1470 iterations)
Benchmarking BPE GPT2 encode batch: Analyzing
BPE GPT2 encode batch   time:   [3.3300 ms 3.3363 ms 3.3424 ms]
                        change: [-7.9889% -7.7761% -7.5698%] (p = 0.00 < 0.05)
                        Performance has improved.
slope  [3.3300 ms 3.3424 ms] R^2            [0.9995406 0.9995482]
mean   [3.3300 ms 3.3407 ms] std. dev.      [8.6931 µs 15.591 µs]
median [3.3271 ms 3.3445 ms] med. abs. dev. [6.2592 µs 19.177 µs]

Benchmarking BPE GPT2 encode, no cache
Benchmarking BPE GPT2 encode, no cache: Warming up for 3.0000 s
Benchmarking BPE GPT2 encode, no cache: Collecting 20 samples in estimated 5.0026 s (273420 iterations)
Benchmarking BPE GPT2 encode, no cache: Analyzing
BPE GPT2 encode, no cache
                        time:   [18.453 µs 18.462 µs 18.469 µs]
                        change: [-1.5596% -1.1521% -0.8580%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 1 outliers among 20 measurements (5.00%)
  1 (5.00%) high severe
slope  [18.453 µs 18.469 µs] R^2            [0.9999700 0.9999731]
mean   [18.453 µs 18.494 µs] std. dev.      [13.041 ns 83.137 ns]
median [18.450 µs 18.470 µs] med. abs. dev. [9.6415 ns 29.639 ns]

Benchmarking BPE GPT2 encode batch, no cache
Benchmarking BPE GPT2 encode batch, no cache: Warming up for 3.0000 s
Benchmarking BPE GPT2 encode batch, no cache: Collecting 20 samples in estimated 5.5411 s (1680 iterations)
Benchmarking BPE GPT2 encode batch, no cache: Analyzing
BPE GPT2 encode batch, no cache
                        time:   [3.2316 ms 3.2393 ms 3.2461 ms]
                        change: [-17.699% -17.461% -17.213%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 20 measurements (5.00%)
  1 (5.00%) high mild
slope  [3.2316 ms 3.2461 ms] R^2            [0.9993385 0.9993742]
mean   [3.2317 ms 3.2450 ms] std. dev.      [10.441 µs 19.642 µs]
median [3.2313 ms 3.2424 ms] med. abs. dev. [4.9099 µs 24.161 µs]

Benchmarking BPE Train vocabulary (small)
Benchmarking BPE Train vocabulary (small): Warming up for 3.0000 s
Benchmarking BPE Train vocabulary (small): Collecting 10 samples in estimated 5.3267 s (220 iterations)
Benchmarking BPE Train vocabulary (small): Analyzing
BPE Train vocabulary (small)
                        time:   [24.407 ms 24.447 ms 24.481 ms]
                        change: [+1.0549% +1.5442% +1.9649%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) low mild
slope  [24.407 ms 24.481 ms] R^2            [0.9998512 0.9998621]
mean   [24.336 ms 24.488 ms] std. dev.      [47.237 µs 181.39 µs]
median [24.375 ms 24.509 ms] med. abs. dev. [9.3173 µs 202.16 µs]

Benchmarking BPE Train vocabulary (big)
Benchmarking BPE Train vocabulary (big): Warming up for 3.0000 s

Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 7.8s.
Benchmarking BPE Train vocabulary (big): Collecting 10 samples in estimated 7.7637 s (10 iterations)
Benchmarking BPE Train vocabulary (big): Analyzing
BPE Train vocabulary (big)
                        time:   [774.49 ms 794.30 ms 815.15 ms]
                        change: [-0.4729% +2.2009% +4.8075%] (p = 0.14 > 0.05)
                        No change in performance detected.
mean   [774.49 ms 815.15 ms] std. dev.      [21.389 ms 38.322 ms]
median [765.31 ms 832.69 ms] med. abs. dev. [2.5689 ms 52.865 ms]

     Running benches/layout_benchmark.rs (target/release/deps/layout_benchmark-5c3c3bf9f881b17f)
Gnuplot not found, using plotters backend
Benchmarking TemplateProcessing single encode
Benchmarking TemplateProcessing single encode: Warming up for 3.0000 s
Benchmarking TemplateProcessing single encode: Collecting 20 samples in estimated 5.0002 s (5480580 iterations)
Benchmarking TemplateProcessing single encode: Analyzing
TemplateProcessing single encode
                        time:   [609.74 ns 610.92 ns 613.16 ns]
                        change: [-33.129% -31.103% -29.605%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 20 measurements (15.00%)
  3 (15.00%) high severe
slope  [609.74 ns 613.16 ns] R^2            [0.9983154 0.9980014]
mean   [611.30 ns 618.26 ns] std. dev.      [2.5256 ns 11.079 ns]
median [609.62 ns 613.20 ns] med. abs. dev. [857.98 ps 5.5359 ns]

Benchmarking TemplateProcessing pair encode
Benchmarking TemplateProcessing pair encode: Warming up for 3.0000 s
Benchmarking TemplateProcessing pair encode: Collecting 20 samples in estimated 5.0000 s (2875110 iterations)
Benchmarking TemplateProcessing pair encode: Analyzing
TemplateProcessing pair encode
                        time:   [1.3108 µs 1.3141 µs 1.3181 µs]
                        change: [-40.953% -38.634% -36.768%] (p = 0.00 < 0.05)
                        Performance has improved.
slope  [1.3108 µs 1.3181 µs] R^2            [0.9986700 0.9985652]
mean   [1.3154 µs 1.3266 µs] std. dev.      [9.0906 ns 15.799 ns]
median [1.3126 µs 1.3249 µs] med. abs. dev. [5.7862 ns 20.688 ns]

     Running benches/unigram_benchmark.rs (target/release/deps/unigram_benchmark-b1d455b46edaf1cb)
Gnuplot not found, using plotters backend
Benchmarking Unigram Train vocabulary (small)
Benchmarking Unigram Train vocabulary (small): Warming up for 3.0000 s
Benchmarking Unigram Train vocabulary (small): Collecting 10 samples in estimated 5.0702 s (770 iterations)
Benchmarking Unigram Train vocabulary (small): Analyzing
Unigram Train vocabulary (small)
                        time:   [6.4149 ms 6.4314 ms 6.4432 ms]
                        change: [-2.6275% -1.9549% -1.2429%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 10 measurements (20.00%)
  2 (20.00%) high mild
slope  [6.4149 ms 6.4432 ms] R^2            [0.9996525 0.9997142]
mean   [6.4168 ms 6.4718 ms] std. dev.      [20.996 µs 60.882 µs]
median [6.4078 ms 6.4779 ms] med. abs. dev. [6.2033 µs 81.149 µs]

Benchmarking Unigram Train vocabulary (medium)
Benchmarking Unigram Train vocabulary (medium): Warming up for 3.0000 s

Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 6.3s.
Benchmarking Unigram Train vocabulary (medium): Collecting 10 samples in estimated 6.3170 s (10 iterations)
Benchmarking Unigram Train vocabulary (medium): Analyzing
Unigram Train vocabulary (medium)
                        time:   [632.07 ms 634.25 ms 636.24 ms]
                        change: [+1.3224% +1.7548% +2.1929%] (p = 0.00 < 0.05)
                        Performance has regressed.
mean   [632.07 ms 636.24 ms] std. dev.      [1.7710 ms 4.3467 ms]
median [631.05 ms 636.90 ms] med. abs. dev. [469.35 µs 5.9250 ms]

I am checking whether the pre_tokenizer takes a lot of time or not, but mostly I am seing that our merging algorithm is the bottleneck now (appart from this current fix, which earns ~20%). Will dive!

HuggingFaceDocBuilderDev · 2024-06-20T14:57:15Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArthurZucker marked this pull request as ready for review July 24, 2024 18:17

ArthurZucker added 17 commits July 24, 2024 20:18

initial commit

8bd9f33

sounds fun

9b5f433

what I hope for

079040e

nit

fddbbf8

oiups

0e61735

nit

085b068

Is this what's expected?

459fe62

just testing some sutff

5b62103

add test

6499eb2

important bench?

d50ee79

update

c2b3655

push fast path

7038928

better bench

11dc00a

nit

7708bfc

push

214e117

revert and cleanup

b6bdcb8

revert

86f08f6

ArthurZucker force-pushed the fast-encode branch from 08a9ecf to 86f08f6 Compare July 24, 2024 18:21

This was referenced Jul 24, 2024

return pytorch tensors like in transformers? #1578

Closed

Why the tokenizer is slower than tiktoken? #1519

Open

ArthurZucker closed this Aug 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fast encode #1560

Fast encode #1560

ArthurZucker commented Jun 20, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Jun 20, 2024

Fast encode #1560

Fast encode #1560

Conversation

ArthurZucker commented Jun 20, 2024 • edited Loading

HuggingFaceDocBuilderDev commented Jun 20, 2024

ArthurZucker commented Jun 20, 2024 •

edited

Loading