Clarification regarding the implementation and training of LaTr #3

uakarsh · 2022-06-21T12:05:38Z

This thread contains the discussion of the implementation of LaTr with one of the authors of the same paper

The earlier discussion with the first author is mentioned here

uakarsh · 2022-06-21T12:14:51Z

Hi @furkanbiten, although I had provided the script for VQA, there were a few questions, which I wanted to get myself cleared

Q.1 For the embedding, did you define the embedding layer separately or went with T5's encoder and decoder embedding (I went with seperate embeddings)

Q.2 As you told that this is a classification task, how did you formulate it, i.e did you take each sentence as a class, or converted each of the words of answer into a token and then padded it to appropriate length?

Thoughts about Q2: If we got with the second approach, while performing validation, there are few words which are not present in the training set, so I was not able to come up about how to deal with it

And, if we go about tokenizing each word and then padding the answers to a desired length (for me it was 512), won't the class labels be much high, similar to MLM (Mask Language Modeling pretraining. For me, it was around 37k classes (I took all the words from validation and training answers, and assigned an id to each of them), and I got an out of memory error (you can run and see the same in the examples/textvqa part 4 notebook)

What I was thinking was to take some K top words (following the second approach), and then for rest all words, assign an token. Not sure, if that would work or not, but maybe it could help atleast train the model and help observe the results

Thanks,

furkanbiten · 2022-06-28T04:03:11Z

Hey,

A.1 We went with T5's word embedding layer, more specifically self.shared in Huggingface T5 implementation.

A.2 First of all, The answer can be a sentence or a word or simply a number. You simply use the tokenizer from huggingface. Since T5 uses SentencePiece tokenizer, each answer will be tokenized accordingly. So you use this one:

from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained("t5-small")

or

from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained("t5-base")

depending on which model weight you use to finetune.

So, with the tokenizer provided from huggingface, you don't need to construct a vocabulary or anything.

uakarsh · 2022-06-30T15:37:13Z

Thanks for your clarification. I guess I have tried to train the model, however currently I got 23% validation accuracy

The kaggle notebook for the same is attached here
The metrics and the progress report are attached here

However, just a few differences between your paper's implementation and mine (referring to Pg. 12, Fine-tuning section)

The batch of yours is 25, and mine is 1 (definitely not even 2 could be used for training)
Steps for mine are 50K, while in your case it is 100K

All the other steps including the warm-up and linear decay have been taken care of.

Can you suggest something, which could improve the performance? The configuration for the run (i.e in V4 and V6 is the same), i.e you can see it here

It is really a great learning experience for me, so thanks to you and the other authors for this paper.

Future Step:

I am thinking of using the current weights to train for more than 50K steps to see the results (maybe next week, since my GPU Quota for Kaggle is over)

Update 1:

I did the above steps again and got almost the same validation accuracy

Update 2:

Was able to achieve 28 percent accuracy and hope that continuing it would increase it more.

furkanbiten · 2022-07-13T12:35:41Z

Hey @uakarsh,

Sorry for the late reply, been really busy lately.

The first thing to realize is that if you put batch size 1, roughly speaking you "need" to train 25 times more iterations.
So that your model will see the whole data the same amount. In other words, think in terms of epochs instead of iterations.
So, my guess is that simply training more should get you to a reasonable accuracy.

That said, you have to be careful with the batch size 1. In my experience, it is usually harder for the model to converge with smaller batch sizes. I understand the limitations of the resources but you can do gradient accumulation before calculating the gradients. You will need more time to train but at least will be easier to make the model to converge.

uakarsh · 2022-07-15T06:36:55Z

Not an issue, take your time, I understand that your time is also valuable.

That's a really great idea, of accumulating gradient (did not think about it earlier). And, I made a demo of LaTr link. At least, it's amazing to observe the performance of the model on an unknown dataset.

And, would update here as soon as I get some new findings.

Cheers

furkanbiten · 2022-07-18T10:22:56Z

Cool! I will check out the demo.

Thanks.

uakarsh · 2022-12-15T04:49:56Z

Hi @furkanbiten, in case where you did pre-training of LaTr, can you let me know, the idea about it?

By idea, I mean, did you take a subset of IDL Dataset, and then overfitted the entire dataset? Or similar to other training procedures, wherein we split the data into train and validation, followed by saving the checkpoints where the validation loss is minimum. (I have implemented pre-training task, but not sure how to train the model in such a setup)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarification regarding the implementation and training of LaTr #3

Clarification regarding the implementation and training of LaTr #3

uakarsh commented Jun 21, 2022 •

edited

Loading

uakarsh commented Jun 21, 2022 •

edited

Loading

furkanbiten commented Jun 28, 2022

uakarsh commented Jun 30, 2022 •

edited

Loading

furkanbiten commented Jul 13, 2022

uakarsh commented Jul 15, 2022

furkanbiten commented Jul 18, 2022

uakarsh commented Dec 15, 2022

Clarification regarding the implementation and training of LaTr #3

Clarification regarding the implementation and training of LaTr #3

Comments

uakarsh commented Jun 21, 2022 • edited Loading

uakarsh commented Jun 21, 2022 • edited Loading

furkanbiten commented Jun 28, 2022

uakarsh commented Jun 30, 2022 • edited Loading

Future Step:

Update 1:

Update 2:

furkanbiten commented Jul 13, 2022

uakarsh commented Jul 15, 2022

furkanbiten commented Jul 18, 2022

uakarsh commented Dec 15, 2022

uakarsh commented Jun 21, 2022 •

edited

Loading

uakarsh commented Jun 21, 2022 •

edited

Loading

uakarsh commented Jun 30, 2022 •

edited

Loading