Extractors that generate transformers.tokenization_utils_base.BatchEncoding will cause error before training #143

ali-abz · 2021-03-29T11:58:07Z

Hi there.
I had an Extractor which was kind of copy of passagebert and textbert and I thought instead of decompressing tokenizer's result and embed them into a new dictionary like:

{
'positive_ids': tensor([1,2,3,...]),
'positive_mask': tensor([1,1,1,...]),
'positive_segments': tensor([1,1,1,...]),
}

it would be much better if I pass the tokenizer's results without any decompressing and reshaping. Bert tokenizer will yield a transformers.tokenization_utils_base.BatchEncoding object which is a dictionary-like structure and can be passed to the model like bert_model(**tokens) as you already know.
I assumed that I could just pass this object type and the code will run with no problem. something like this:

{
'positive_ids_and_mask': self.my_tokenizer('This is a test sentence'),
}

But it was not the case. In the pytorch trainer line 93, an error will be raised:

capreolus/capreolus/trainer/pytorch.py

Line 93 in 0121f6e

    
           batch = {k: v.to(self.device) if not isinstance(v, list) else v for k, v in batch.items()}

AttributeError: 'dict' object has no attribute 'to'

v here became a dict and it is not a transformers.tokenization_utils_base.BatchEncoding anymore so there is no to attribure.
I investigated a little bit and I'm pretty sure the problem is caused by this line:

capreolus/capreolus/trainer/pytorch.py

Line 223 in 0121f6e

train_dataloader = torch.utils.data.DataLoader(

pytorch's DataLoader will accept transformers.tokenization_utils_base.BatchEncoding but will yield a dictionary. Here is a show case:

>>> data = transformers.tokenization_utils_base.BatchEncoding({"test": [1,2,3]})
>>> type(data)
transformers.tokenization_utils_base.BatchEncoding
>>> for x in torch.utils.data.DataLoader([data]):
>>>         print(x)
>>>         print(type(x))
{'test': [tensor([1]), tensor([2]), tensor([3])]}
<class 'dict'>

I manually changed pytorch trainer code so it can convert dict to transformers.tokenization_utils_base.BatchEncoding but this is just a solution for my task and will cause problem for other non-bert models.

The text was updated successfully, but these errors were encountered:

andrewyates · 2021-03-31T10:31:48Z

Thanks for pointing this out. I don't remember why we avoid using the dict from hgf's tokenizer class directly, but this is something we should look into in the future when upgrading the version of transformers. It may not be necessary to call to on the tensors directly if this is happening inside the DataLoader.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extractors that generate transformers.tokenization_utils_base.BatchEncoding will cause error before training #143

Extractors that generate transformers.tokenization_utils_base.BatchEncoding will cause error before training #143

ali-abz commented Mar 29, 2021

andrewyates commented Mar 31, 2021

Extractors that generate transformers.tokenization_utils_base.BatchEncoding will cause error before training #143

Extractors that generate transformers.tokenization_utils_base.BatchEncoding will cause error before training #143

Comments

ali-abz commented Mar 29, 2021

andrewyates commented Mar 31, 2021