Tokenizer Training Errors: pyo3_runtime.PanicException: called `Result::unwrap()` on an `Err` value: TryFromIntError(()) #1698

Chimaco37 · 2024-12-10T19:47:32Z

Hi, I was trying to train a Unigram tokenizer with DNA sequence data. And this is the code I use to train:

print("Unigram tokenizer")
tokenizer = Tokenizer(Unigram())
special_tokens = ["<cls>", "<sep>", "<unk>", "<pad>", "<mask>", "<s>", "</s>"]
trainer = UnigramTrainer(special_tokens=special_tokens, vocab_size=Unigram_vocab_size, unk_token="<unk>", show_progress=True)
tokenizer.train_from_iterator(iterator=all_seqs, trainer=trainer)
tokenizer.save(join(output_folder, "SentencePiece_"+ str(vocab_size) + ".json"))

At first, I encountered this error:
(base) [[email protected] slurm_stderr]$ cat slurm-12831381.out
thread '' panicked at /home/runner/work/tokenizers/tokenizers/tokenizers/src/models/unigram/trainer.rs:228:53:
called Result::unwrap() on an Err value: Internal
note: run with RUST_BACKTRACE=1 environment variable to display a backtrace
Traceback (most recent call last):
File "/home/shfa523g/Chi_Internship/tokenizers_scripts/generate_tokenizers_parallel.py", line 90, in
tokenizer.train_from_iterator(iterator=all_seqs, trainer=trainer)
pyo3_runtime.PanicException: called Result::unwrap() on an Err value: Internal

Then I found a similar issue description and did as what it said, the issue is here:
#821 (comment)

After I tried what it says, it gives me a new issue:
(base) [[email protected] slurm_stderr]$ cat slurm-12861633.out
thread '' panicked at /home/shfa523g/.cargo/registry/src/index.crates.io-6f17d22bba15001f/esaxx-rs-0.1.10/src/esa.rs:70:50:
called Result::unwrap() on an Err value: TryFromIntError(())
note: run with RUST_BACKTRACE=1 environment variable to display a backtrace
Traceback (most recent call last):
File "/home/shfa523g/Chi_Internship/tokenizers_scripts/generate_tokenizers_parallel.py", line 90, in
tokenizer.train_from_iterator(iterator=all_seqs, trainer=trainer)
pyo3_runtime.PanicException: called Result::unwrap() on an Err value: TryFromIntError(())

I asked ChatGPT, it says it might be the sequences it took are too long, then I reduce it to a very small number but the same error keeps happening.

Please take a look, thank you very much!

The text was updated successfully, but these errors were encountered:

Chimaco37 · 2024-12-11T09:57:57Z

Update:
The sequence number I reduced is the chunk size I put in, the total amount of length didn't change.
The reason I mention this is because when I reduce the whole dataset from 24 chromosomes into 1 single chromosome, it works and generate correct output.
So the problem now is: How can I train the Unigram tokenizer on a large scale dataset (whole 24 chromosomes)? Please help me.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizer Training Errors: pyo3_runtime.PanicException: called `Result::unwrap()` on an `Err` value: TryFromIntError(()) #1698

Tokenizer Training Errors: pyo3_runtime.PanicException: called `Result::unwrap()` on an `Err` value: TryFromIntError(()) #1698

Chimaco37 commented Dec 10, 2024

Chimaco37 commented Dec 11, 2024

Tokenizer Training Errors: pyo3_runtime.PanicException: called Result::unwrap() on an Err value: TryFromIntError(()) #1698

Tokenizer Training Errors: pyo3_runtime.PanicException: called Result::unwrap() on an Err value: TryFromIntError(()) #1698

Comments

Chimaco37 commented Dec 10, 2024

Chimaco37 commented Dec 11, 2024

Tokenizer Training Errors: pyo3_runtime.PanicException: called `Result::unwrap()` on an `Err` value: TryFromIntError(()) #1698

Tokenizer Training Errors: pyo3_runtime.PanicException: called `Result::unwrap()` on an `Err` value: TryFromIntError(()) #1698