You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi there.
I had an Extractor which was kind of copy of passagebert and textbert and I thought instead of decompressing tokenizer's result and embed them into a new dictionary like:
it would be much better if I pass the tokenizer's results without any decompressing and reshaping. Bert tokenizer will yield a transformers.tokenization_utils_base.BatchEncoding object which is a dictionary-like structure and can be passed to the model like bert_model(**tokens) as you already know.
I assumed that I could just pass this object type and the code will run with no problem. something like this:
{
'positive_ids_and_mask': self.my_tokenizer('This is a test sentence'),
}
But it was not the case. In the pytorch trainer line 93, an error will be raised:
AttributeError: 'dict' object has no attribute 'to'
v here became a dict and it is not a transformers.tokenization_utils_base.BatchEncoding anymore so there is no to attribure.
I investigated a little bit and I'm pretty sure the problem is caused by this line:
pytorch's DataLoader will accept transformers.tokenization_utils_base.BatchEncoding but will yield a dictionary. Here is a show case:
>>> data = transformers.tokenization_utils_base.BatchEncoding({"test": [1,2,3]})
>>> type(data)
transformers.tokenization_utils_base.BatchEncoding
>>> for x in torch.utils.data.DataLoader([data]):
>>> print(x)
>>> print(type(x))
{'test': [tensor([1]), tensor([2]), tensor([3])]}
<class 'dict'>
I manually changed pytorch trainer code so it can convert dict to transformers.tokenization_utils_base.BatchEncoding but this is just a solution for my task and will cause problem for other non-bert models.
The text was updated successfully, but these errors were encountered:
Thanks for pointing this out. I don't remember why we avoid using the dict from hgf's tokenizer class directly, but this is something we should look into in the future when upgrading the version of transformers. It may not be necessary to call to on the tensors directly if this is happening inside the DataLoader.
Hi there.
I had an Extractor which was kind of copy of
passagebert
andtextbert
and I thought instead of decompressing tokenizer's result and embed them into a new dictionary like:it would be much better if I pass the tokenizer's results without any decompressing and reshaping. Bert tokenizer will yield a
transformers.tokenization_utils_base.BatchEncoding
object which is a dictionary-like structure and can be passed to the model likebert_model(**tokens)
as you already know.I assumed that I could just pass this object type and the code will run with no problem. something like this:
But it was not the case. In the pytorch trainer line 93, an error will be raised:
capreolus/capreolus/trainer/pytorch.py
Line 93 in 0121f6e
v
here became adict
and it is not atransformers.tokenization_utils_base.BatchEncoding
anymore so there is noto
attribure.I investigated a little bit and I'm pretty sure the problem is caused by this line:
capreolus/capreolus/trainer/pytorch.py
Line 223 in 0121f6e
pytorch's DataLoader will accept
transformers.tokenization_utils_base.BatchEncoding
but will yield a dictionary. Here is a show case:I manually changed pytorch trainer code so it can convert
dict
totransformers.tokenization_utils_base.BatchEncoding
but this is just a solution for my task and will cause problem for other non-bert models.The text was updated successfully, but these errors were encountered: