Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extractors that generate transformers.tokenization_utils_base.BatchEncoding will cause error before training #143

Open
ali-abz opened this issue Mar 29, 2021 · 1 comment

Comments

@ali-abz
Copy link
Contributor

ali-abz commented Mar 29, 2021

Hi there.
I had an Extractor which was kind of copy of passagebert and textbert and I thought instead of decompressing tokenizer's result and embed them into a new dictionary like:

{
'positive_ids': tensor([1,2,3,...]),
'positive_mask': tensor([1,1,1,...]),
'positive_segments': tensor([1,1,1,...]),
}

it would be much better if I pass the tokenizer's results without any decompressing and reshaping. Bert tokenizer will yield a transformers.tokenization_utils_base.BatchEncoding object which is a dictionary-like structure and can be passed to the model like bert_model(**tokens) as you already know.
I assumed that I could just pass this object type and the code will run with no problem. something like this:

{
'positive_ids_and_mask': self.my_tokenizer('This is a test sentence'),
}

But it was not the case. In the pytorch trainer line 93, an error will be raised:

batch = {k: v.to(self.device) if not isinstance(v, list) else v for k, v in batch.items()}

AttributeError: 'dict' object has no attribute 'to'

v here became a dict and it is not a transformers.tokenization_utils_base.BatchEncoding anymore so there is no to attribure.
I investigated a little bit and I'm pretty sure the problem is caused by this line:

train_dataloader = torch.utils.data.DataLoader(

pytorch's DataLoader will accept transformers.tokenization_utils_base.BatchEncoding but will yield a dictionary. Here is a show case:

>>> data = transformers.tokenization_utils_base.BatchEncoding({"test": [1,2,3]})
>>> type(data)
transformers.tokenization_utils_base.BatchEncoding
>>> for x in torch.utils.data.DataLoader([data]):
>>>         print(x)
>>>         print(type(x))
{'test': [tensor([1]), tensor([2]), tensor([3])]}
<class 'dict'>

I manually changed pytorch trainer code so it can convert dict to transformers.tokenization_utils_base.BatchEncoding but this is just a solution for my task and will cause problem for other non-bert models.

@andrewyates
Copy link
Member

Thanks for pointing this out. I don't remember why we avoid using the dict from hgf's tokenizer class directly, but this is something we should look into in the future when upgrading the version of transformers. It may not be necessary to call to on the tensors directly if this is happening inside the DataLoader.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants