tokenize_document() doesn't work if the model does not support fast tokenizer #29

Bhuvanesh-Verma · 2024-01-09T17:22:16Z

Model: cl-tohoku/bert-base-japanese
Package Requirements: ipadic, fugashi (pip install ipadic fugashi)

Code to replicate the issue:

from pie_modules.document.processing import tokenize_document
from transformers import AutoTokenizer
from pytorch_ie.documents import TokenBasedDocument, TextBasedDocument

tokenizer = AutoTokenizer.from_pretrained("cl-tohoku/bert-base-japanese")
text_document = TextBasedDocument(text="東北大学で")

tokenized_docs = tokenize_document(
        text_document,
        tokenizer=tokenizer,
        result_document_type= TokenBasedDocument
    )

Error Message:

Traceback (most recent call last):
  ....
    tokenized_docs = tokenize_document(
  File "..../python3.9/site-packages/pie_modules/document/processing/tokenization.py", line 314, in tokenize_document
    for batch_encoding in tokenized_text.encodings:
TypeError: 'NoneType' object is not iterable

Update: Current fix using huggingface/transformers#12381

The text was updated successfully, but these errors were encountered:

ArneBinder added the bug Something isn't working label Jan 9, 2024

ArneBinder linked a pull request Jan 11, 2024 that will close this issue

fix tokenize_document() with slow tokenizer #34

Open

2 tasks

ArneBinder changed the title ~~tokenize_document() doesn't work if the model do not support fast tokenizer~~ tokenize_document() doesn't work if the model does not support fast tokenizer Apr 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tokenize_document() doesn't work if the model does not support fast tokenizer #29

tokenize_document() doesn't work if the model does not support fast tokenizer #29

Bhuvanesh-Verma commented Jan 9, 2024 •

edited

Loading

tokenize_document() doesn't work if the model does not support fast tokenizer #29

tokenize_document() doesn't work if the model does not support fast tokenizer #29

Comments

Bhuvanesh-Verma commented Jan 9, 2024 • edited Loading

Bhuvanesh-Verma commented Jan 9, 2024 •

edited

Loading