-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarification regarding the implementation and training of LaTr #3
Comments
Hi @furkanbiten, although I had provided the script for VQA, there were a few questions, which I wanted to get myself cleared Q.1 For the embedding, did you define the embedding layer separately or went with T5's encoder and decoder embedding (I went with seperate embeddings) Q.2 As you told that this is a classification task, how did you formulate it, i.e did you take each sentence as a class, or converted each of the words of answer into a token and then padded it to appropriate length? Thoughts about Q2: If we got with the second approach, while performing validation, there are few words which are not present in the training set, so I was not able to come up about how to deal with it And, if we go about tokenizing each word and then padding the answers to a desired length (for me it was 512), won't the class labels be much high, similar to MLM (Mask Language Modeling pretraining. For me, it was around 37k classes (I took all the words from validation and training answers, and assigned an id to each of them), and I got an out of memory error (you can run and see the same in the examples/textvqa part 4 notebook) What I was thinking was to take some K top words (following the second approach), and then for rest all words, assign an token. Not sure, if that would work or not, but maybe it could help atleast train the model and help observe the results Thanks, |
Hey, A.1 We went with T5's word embedding layer, more specifically A.2 First of all, The answer can be a sentence or a word or simply a number. You simply use the tokenizer from huggingface. Since T5 uses SentencePiece tokenizer, each answer will be tokenized accordingly. So you use this one:
or
depending on which model weight you use to finetune. So, with the tokenizer provided from huggingface, you don't need to construct a vocabulary or anything. |
Thanks for your clarification. I guess I have tried to train the model, however currently I got 23% validation accuracy
However, just a few differences between your paper's implementation and mine (referring to Pg. 12, Fine-tuning section)
All the other steps including the warm-up and linear decay have been taken care of. Can you suggest something, which could improve the performance? The configuration for the run (i.e in V4 and V6 is the same), i.e you can see it here It is really a great learning experience for me, so thanks to you and the other authors for this paper. Future Step:
Update 1:I did the above steps again and got almost the same validation accuracy Update 2:Was able to achieve 28 percent accuracy and hope that continuing it would increase it more. |
Hey @uakarsh, Sorry for the late reply, been really busy lately. The first thing to realize is that if you put batch size 1, roughly speaking you "need" to train 25 times more iterations. That said, you have to be careful with the batch size 1. In my experience, it is usually harder for the model to converge with smaller batch sizes. I understand the limitations of the resources but you can do gradient accumulation before calculating the gradients. You will need more time to train but at least will be easier to make the model to converge. |
Not an issue, take your time, I understand that your time is also valuable. That's a really great idea, of accumulating gradient (did not think about it earlier). And, I made a demo of LaTr link. At least, it's amazing to observe the performance of the model on an unknown dataset. And, would update here as soon as I get some new findings. Cheers |
Cool! I will check out the demo. Thanks. |
Hi @furkanbiten, in case where you did pre-training of LaTr, can you let me know, the idea about it? By idea, I mean, did you take a subset of IDL Dataset, and then overfitted the entire dataset? Or similar to other training procedures, wherein we split the data into train and validation, followed by saving the checkpoints where the validation loss is minimum. (I have implemented pre-training task, but not sure how to train the model in such a setup) |
This thread contains the discussion of the implementation of LaTr with one of the authors of the same paper
The earlier discussion with the first author is mentioned here
The text was updated successfully, but these errors were encountered: