Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decode regression #1564

Open
daulet opened this issue Jul 10, 2024 · 11 comments
Open

Decode regression #1564

daulet opened this issue Jul 10, 2024 · 11 comments

Comments

@daulet
Copy link

daulet commented Jul 10, 2024

This is not a recent regression, and perhaps it won't be fixed for that reason, but I thought I'd file it anyway.

I maintain Go bindings for this library, and by sheer luck I had benchmarks when I started. At some point I've noticed a regression in decoding, but only now got around to investigating it. Long story short, I've bisected this repo and root caused it to this PR. Below is a benchmark used to find it.

Regression details:

decode                  time:   [3.9277 µs 3.9409 µs 3.9558 µs]
                        change: [+241.37% +242.64% +244.06%] (p = 0.00 < 0.05)
                        Performance has regressed.

While decode is pretty fast (order of microseconds), +240% slowdown is fairly big and I wonder if we can gain back that performance.

Benchmark code (tokenizers/benches/decode_benchmark.rs):

use criterion::{black_box, criterion_group, criterion_main, Criterion};
use tokenizers::tokenizer::Tokenizer;

fn decode(tokenizer:&Tokenizer, ids_slice: Vec<u32>, skip_special_tokens: bool) -> String {
    tokenizer.decode(ids_slice, skip_special_tokens).expect("failed to decode input")
}

fn criterion_benchmark(c: &mut Criterion) {
    let tokenizer = Tokenizer::from_file("./test/data/bert-base-uncased.json").expect("failed to create tokenizer");
    c.bench_function("decode", 
    |b| b.iter(
        || decode(&tokenizer, black_box([2829, 4419, 14523, 2058, 1996, 13971, 3899].to_vec()), black_box(true))));
}

criterion_group!(benches, criterion_benchmark);
criterion_main!(benches);

Add this to Cargo.toml and run with cargo bench decode.

[[bench]]
name = "decode_benchmark"
harness = false

The tokenizer file is copied from here.

@ArthurZucker
Copy link
Collaborator

Ow wow, thanks a mile for the clear report!
Yeah 240% is not really good! Okay, I'lll investigate but in the mean time if you have a PR with a fix, would be super good!

@ArthurZucker
Copy link
Collaborator

I'll add this to benches! 🆙

@ArthurZucker
Copy link
Collaborator

I checked ou the commit you mention, had to create a custom branch to test: test-old-decode, cherry picked the test that I added in fast-decode-fix has just the benchmark, will post the results here!

@ArthurZucker
Copy link
Collaborator

image for now seems like the issue was fixed in between

@daulet
Copy link
Author

daulet commented Jul 13, 2024

for now seems like the issue was fixed in between

you mean you don't see a regression at head? I just synced up to confirm performance drop. Don't look at change % as that seems to report changes since last run of the benchmark, just compare absolute values, e.g. here is what I got on two consecutive runs on head:

decode                  time:   [3.8542 µs 3.8795 µs 3.9159 µs]
                        change: [+234.61% +236.83% +240.08%] (p = 0.00 < 0.05)
                        Performance has regressed.
decode                  time:   [3.7643 µs 3.7762 µs 3.7891 µs]
                        change: [-3.8222% -2.8277% -2.1576%] (p = 0.00 < 0.05)
                        Performance has improved.

@ArthurZucker
Copy link
Collaborator

I'll check again! I was getting 630ns for both branches but let me make sure of that!

@daulet
Copy link
Author

daulet commented Jul 13, 2024

oh, i just checked test-old-code branch, and it contains the commit (#938) that introduced the regression. You want to branch one commit earlier than that :)

@ArthurZucker
Copy link
Collaborator

Yeah I cherry picked it I think, to run before / after I'll checka gain

@daulet
Copy link
Author

daulet commented Aug 9, 2024

Since encode perf was improved in 0.20 release, any plans to look at this?

@ArthurZucker
Copy link
Collaborator

Yep for sure 😉 Ad you've seen was more focused on encode but will pick this back up

@daulet
Copy link
Author

daulet commented Nov 5, 2024

@ArthurZucker any updates? :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants