Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fast encode #1560

Closed
wants to merge 17 commits into from
Closed

Fast encode #1560

wants to merge 17 commits into from

Conversation

ArthurZucker
Copy link
Collaborator

@ArthurZucker ArthurZucker commented Jun 20, 2024

Try to make our code faster :)

From inital bench for GPT2:

  • 20% of the time is spent in the pre_tokenizer when doing batch encoding
  • 8% for no cache
  • xx% for added tokens (not 100% sure, gotta remove them and add them again, add other tokens as well)
  • removing ïng" reduce performances by 700% lol

Inital bench results:

    Finished `bench` profile [optimized] target(s) in 32.40s
     Running benches/bert_benchmark.rs (target/release/deps/bert_benchmark-978096f5c7d2a77c)
Gnuplot not found, using plotters backend
Benchmarking WordPiece BERT encode
Benchmarking WordPiece BERT encode: Warming up for 3.0000 s
Benchmarking WordPiece BERT encode: Collecting 20 samples in estimated 5.0031 s (284970 iterations)
Benchmarking WordPiece BERT encode: Analyzing
WordPiece BERT encode   time:   [17.399 µs 17.406 µs 17.416 µs]
                        change: [-2.1128% -1.9745% -1.8658%] (p = 0.00 < 0.05)
                        Performance has improved.
slope  [17.399 µs 17.416 µs] R^2            [0.9999587 0.9999530]
mean   [17.413 µs 17.439 µs] std. dev.      [21.867 ns 38.768 ns]
median [17.403 µs 17.448 µs] med. abs. dev. [9.7665 ns 47.692 ns]

Benchmarking WordPiece BERT encode batch
Benchmarking WordPiece BERT encode batch: Warming up for 3.0000 s
Benchmarking WordPiece BERT encode batch: Collecting 20 samples in estimated 5.5509 s (1890 iterations)
Benchmarking WordPiece BERT encode batch: Analyzing
WordPiece BERT encode batch
                        time:   [2.8891 ms 2.8920 ms 2.8945 ms]
                        change: [-19.384% -19.126% -18.887%] (p = 0.00 < 0.05)
                        Performance has improved.
slope  [2.8891 ms 2.8945 ms] R^2            [0.9998221 0.9998317]
mean   [2.8851 ms 2.8940 ms] std. dev.      [7.3858 µs 12.625 µs]
median [2.8833 ms 2.8963 ms] med. abs. dev. [4.4561 µs 16.018 µs]

Benchmarking WordPiece Train vocabulary (small)
Benchmarking WordPiece Train vocabulary (small): Warming up for 3.0000 s
Benchmarking WordPiece Train vocabulary (small): Collecting 10 samples in estimated 5.7655 s (220 iterations)
Benchmarking WordPiece Train vocabulary (small): Analyzing
WordPiece Train vocabulary (small)
                        time:   [25.873 ms 25.988 ms 26.085 ms]
                        change: [-1.5674% -0.8238% -0.0549%] (p = 0.06 > 0.05)
                        No change in performance detected.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild
slope  [25.873 ms 26.085 ms] R^2            [0.9990557 0.9991575]
mean   [25.924 ms 26.231 ms] std. dev.      [114.50 µs 334.27 µs]
median [25.868 ms 26.258 ms] med. abs. dev. [64.904 µs 427.20 µs]

Benchmarking WordPiece Train vocabulary (big)
Benchmarking WordPiece Train vocabulary (big): Warming up for 3.0000 s

Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 7.8s.
Benchmarking WordPiece Train vocabulary (big): Collecting 10 samples in estimated 7.8009 s (10 iterations)
Benchmarking WordPiece Train vocabulary (big): Analyzing
WordPiece Train vocabulary (big)
                        time:   [770.56 ms 775.31 ms 780.55 ms]
                        change: [-3.4898% -1.7298% -0.1796%] (p = 0.07 > 0.05)
                        No change in performance detected.
mean   [770.56 ms 780.55 ms] std. dev.      [3.8811 ms 11.788 ms]
median [769.56 ms 779.70 ms] med. abs. dev. [387.67 µs 13.909 ms]

     Running benches/bpe_benchmark.rs (target/release/deps/bpe_benchmark-4074fa6f48a53e0d)
Gnuplot not found, using plotters backend
Benchmarking BPE GPT2 encode
Benchmarking BPE GPT2 encode: Warming up for 3.0000 s
Benchmarking BPE GPT2 encode: Collecting 20 samples in estimated 5.0014 s (470190 iterations)
Benchmarking BPE GPT2 encode: Analyzing
BPE GPT2 encode         time:   [10.756 µs 10.764 µs 10.775 µs]
                        change: [-4.0388% -3.9526% -3.8664%] (p = 0.00 < 0.05)
                        Performance has improved.
slope  [10.756 µs 10.775 µs] R^2            [0.9999103 0.9998945]
mean   [10.757 µs 10.769 µs] std. dev.      [9.0797 ns 18.260 ns]
median [10.752 µs 10.767 µs] med. abs. dev. [4.5995 ns 20.446 ns]

Benchmarking BPE GPT2 encode batch
Benchmarking BPE GPT2 encode batch: Warming up for 3.0000 s
Benchmarking BPE GPT2 encode batch: Collecting 20 samples in estimated 5.0345 s (1470 iterations)
Benchmarking BPE GPT2 encode batch: Analyzing
BPE GPT2 encode batch   time:   [3.3300 ms 3.3363 ms 3.3424 ms]
                        change: [-7.9889% -7.7761% -7.5698%] (p = 0.00 < 0.05)
                        Performance has improved.
slope  [3.3300 ms 3.3424 ms] R^2            [0.9995406 0.9995482]
mean   [3.3300 ms 3.3407 ms] std. dev.      [8.6931 µs 15.591 µs]
median [3.3271 ms 3.3445 ms] med. abs. dev. [6.2592 µs 19.177 µs]

Benchmarking BPE GPT2 encode, no cache
Benchmarking BPE GPT2 encode, no cache: Warming up for 3.0000 s
Benchmarking BPE GPT2 encode, no cache: Collecting 20 samples in estimated 5.0026 s (273420 iterations)
Benchmarking BPE GPT2 encode, no cache: Analyzing
BPE GPT2 encode, no cache
                        time:   [18.453 µs 18.462 µs 18.469 µs]
                        change: [-1.5596% -1.1521% -0.8580%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 1 outliers among 20 measurements (5.00%)
  1 (5.00%) high severe
slope  [18.453 µs 18.469 µs] R^2            [0.9999700 0.9999731]
mean   [18.453 µs 18.494 µs] std. dev.      [13.041 ns 83.137 ns]
median [18.450 µs 18.470 µs] med. abs. dev. [9.6415 ns 29.639 ns]

Benchmarking BPE GPT2 encode batch, no cache
Benchmarking BPE GPT2 encode batch, no cache: Warming up for 3.0000 s
Benchmarking BPE GPT2 encode batch, no cache: Collecting 20 samples in estimated 5.5411 s (1680 iterations)
Benchmarking BPE GPT2 encode batch, no cache: Analyzing
BPE GPT2 encode batch, no cache
                        time:   [3.2316 ms 3.2393 ms 3.2461 ms]
                        change: [-17.699% -17.461% -17.213%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 20 measurements (5.00%)
  1 (5.00%) high mild
slope  [3.2316 ms 3.2461 ms] R^2            [0.9993385 0.9993742]
mean   [3.2317 ms 3.2450 ms] std. dev.      [10.441 µs 19.642 µs]
median [3.2313 ms 3.2424 ms] med. abs. dev. [4.9099 µs 24.161 µs]

Benchmarking BPE Train vocabulary (small)
Benchmarking BPE Train vocabulary (small): Warming up for 3.0000 s
Benchmarking BPE Train vocabulary (small): Collecting 10 samples in estimated 5.3267 s (220 iterations)
Benchmarking BPE Train vocabulary (small): Analyzing
BPE Train vocabulary (small)
                        time:   [24.407 ms 24.447 ms 24.481 ms]
                        change: [+1.0549% +1.5442% +1.9649%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) low mild
slope  [24.407 ms 24.481 ms] R^2            [0.9998512 0.9998621]
mean   [24.336 ms 24.488 ms] std. dev.      [47.237 µs 181.39 µs]
median [24.375 ms 24.509 ms] med. abs. dev. [9.3173 µs 202.16 µs]

Benchmarking BPE Train vocabulary (big)
Benchmarking BPE Train vocabulary (big): Warming up for 3.0000 s

Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 7.8s.
Benchmarking BPE Train vocabulary (big): Collecting 10 samples in estimated 7.7637 s (10 iterations)
Benchmarking BPE Train vocabulary (big): Analyzing
BPE Train vocabulary (big)
                        time:   [774.49 ms 794.30 ms 815.15 ms]
                        change: [-0.4729% +2.2009% +4.8075%] (p = 0.14 > 0.05)
                        No change in performance detected.
mean   [774.49 ms 815.15 ms] std. dev.      [21.389 ms 38.322 ms]
median [765.31 ms 832.69 ms] med. abs. dev. [2.5689 ms 52.865 ms]

     Running benches/layout_benchmark.rs (target/release/deps/layout_benchmark-5c3c3bf9f881b17f)
Gnuplot not found, using plotters backend
Benchmarking TemplateProcessing single encode
Benchmarking TemplateProcessing single encode: Warming up for 3.0000 s
Benchmarking TemplateProcessing single encode: Collecting 20 samples in estimated 5.0002 s (5480580 iterations)
Benchmarking TemplateProcessing single encode: Analyzing
TemplateProcessing single encode
                        time:   [609.74 ns 610.92 ns 613.16 ns]
                        change: [-33.129% -31.103% -29.605%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 20 measurements (15.00%)
  3 (15.00%) high severe
slope  [609.74 ns 613.16 ns] R^2            [0.9983154 0.9980014]
mean   [611.30 ns 618.26 ns] std. dev.      [2.5256 ns 11.079 ns]
median [609.62 ns 613.20 ns] med. abs. dev. [857.98 ps 5.5359 ns]

Benchmarking TemplateProcessing pair encode
Benchmarking TemplateProcessing pair encode: Warming up for 3.0000 s
Benchmarking TemplateProcessing pair encode: Collecting 20 samples in estimated 5.0000 s (2875110 iterations)
Benchmarking TemplateProcessing pair encode: Analyzing
TemplateProcessing pair encode
                        time:   [1.3108 µs 1.3141 µs 1.3181 µs]
                        change: [-40.953% -38.634% -36.768%] (p = 0.00 < 0.05)
                        Performance has improved.
slope  [1.3108 µs 1.3181 µs] R^2            [0.9986700 0.9985652]
mean   [1.3154 µs 1.3266 µs] std. dev.      [9.0906 ns 15.799 ns]
median [1.3126 µs 1.3249 µs] med. abs. dev. [5.7862 ns 20.688 ns]

     Running benches/unigram_benchmark.rs (target/release/deps/unigram_benchmark-b1d455b46edaf1cb)
Gnuplot not found, using plotters backend
Benchmarking Unigram Train vocabulary (small)
Benchmarking Unigram Train vocabulary (small): Warming up for 3.0000 s
Benchmarking Unigram Train vocabulary (small): Collecting 10 samples in estimated 5.0702 s (770 iterations)
Benchmarking Unigram Train vocabulary (small): Analyzing
Unigram Train vocabulary (small)
                        time:   [6.4149 ms 6.4314 ms 6.4432 ms]
                        change: [-2.6275% -1.9549% -1.2429%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 10 measurements (20.00%)
  2 (20.00%) high mild
slope  [6.4149 ms 6.4432 ms] R^2            [0.9996525 0.9997142]
mean   [6.4168 ms 6.4718 ms] std. dev.      [20.996 µs 60.882 µs]
median [6.4078 ms 6.4779 ms] med. abs. dev. [6.2033 µs 81.149 µs]

Benchmarking Unigram Train vocabulary (medium)
Benchmarking Unigram Train vocabulary (medium): Warming up for 3.0000 s

Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 6.3s.
Benchmarking Unigram Train vocabulary (medium): Collecting 10 samples in estimated 6.3170 s (10 iterations)
Benchmarking Unigram Train vocabulary (medium): Analyzing
Unigram Train vocabulary (medium)
                        time:   [632.07 ms 634.25 ms 636.24 ms]
                        change: [+1.3224% +1.7548% +2.1929%] (p = 0.00 < 0.05)
                        Performance has regressed.
mean   [632.07 ms 636.24 ms] std. dev.      [1.7710 ms 4.3467 ms]
median [631.05 ms 636.90 ms] med. abs. dev. [469.35 µs 5.9250 ms]

I am checking whether the pre_tokenizer takes a lot of time or not, but mostly I am seing that our merging algorithm is the bottleneck now (appart from this current fix, which earns ~20%). Will dive!

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@ArthurZucker ArthurZucker marked this pull request as ready for review July 24, 2024 18:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants