The results without pre-training #2

Gyann-z · 2022-04-27T02:03:30Z

Thanks for your implementation. Have you tried TextVQA training without the layout-aware pre-training? Can you reproduce the results of the paper? E.g., LaTr-base achieves 44.06 on Rosetta-en and 52.29 on Amazon-OCR.

uakarsh · 2022-04-27T18:59:54Z

Actually, I want to try to reproduce the results of the paper, but the problems are:

I don't have access to Amazon OCR (that could lead to deviated results, although IDL Dataset can be used for pre-training and I have added a script for the same in the examples section)
While going through the paper, what I saw was, that they used heavy resources (like in ablation they mentioned the usage of 8 A100 GPUs), however I don't have such resources

However, as I have added the script pre-training portion, I would soon add the script for training on any custom dataset and hope that would help the community as well

And, I would surely update the repo and add weights, if I get any way to pre-train and fine-tune the model to achieve the results mentioned in the paper

Regards,

MIL-VLG · 2022-05-02T14:06:12Z

Thanks for your wonderful work. Looking forward to the training scripts on downstream textvqa datasets!

furkanbiten · 2022-06-12T16:50:48Z

Hey @uakarsh,

I am the first author of the LaTr. And thank you for the implementation since I couldn't publish the code (well it is Amazon's code).

Here are some things I can offer though:

After my internship, I was able to run Amazon-OCR on 26 Million pages of IDL, this can be found in my repo: https://github.com/furkanbiten/idl_data. This should give a reasonable improvement and very close to original numbers when used in pretraining.
Unfortunately, I don't have the resources in my uni., so I can't offer the pretraining weights however; if someone wants to do that I can review the code and try to give the hyperparameters as much as I remember.
You can ask or point me to any lines of code and I will try to answer as much as possible.

Sorry I can't do much more since it is mostly out of my hand.

PS: I will try to get the Amazon-OCR results on TextVQA and ST-VQA sometime soon hopefully.

uakarsh · 2022-06-13T05:19:53Z

Hi @furkanbiten

Thanks for your reply and appreciating the work. Looking forward to have a great conversation with you regarding the same doubts.

For your points

I have already included the pre-training script for IDL Dataset (definitely it is an amazing document dataset)
I think, this would require frequent iterations and I would be doing that as well, since this is my first time implementing a Scene Text VQA, and along with a multi-modal transformer so it would be great as well. Hopefully, wish to get the desired results and a simple demo
Would surely do that

I had two questions going on, so thought of asking

Visual Question Answering can be considered as either of the two task, 1. Classification or 2. Generalization (i.e just like classifying into tokens and then decoding it). So, in LaTr, did they took it as 2nd category or 1st?
In pre-training, they took spatial feature as well as image feature and concatinates it across dim = 1, while in the finetuning stage, they did the same for fine-tuning as well. Am I correct?

Regards,
Akarsh

furkanbiten · 2022-06-13T09:48:21Z

Hi again,

If you would like, we can move the discussion to a pinned issue so that it would get more visibility. Your call.

Here are the answers to the questions:

First, let's set the terminology straight. You are correct on the first one as VQA people treat the problem as the classification problem. This has been the usual approach for many VQA papers (almost all as far as I know). Second one which we try to do is what we call "vocabulary-free" decoding. Now, even this name has problems since we DO use vocabulary however; what we meant was that we DO NOT use the 5k most common vocabulary built from the training set. Instead, we use the vocabulary from T5 which is SentencePiece vocabulary. So, we are closer to "Generalization" as you mentioned but technically, we are still using the vocabulary from SentencePiece but not the fixed vocabulary built from training set. A better name is certainly needed.
I will assume couple of things but let me know if they are correct. By spatial features, I will assume you mean the embedding of bounding box features. In the pretraining and finetuning both, we embed bounding boxes of OCR (BB) with nn.Embedding (as you did correctly in the code) and then simply SUM BB features with word embeddings (obtained from T5 shared embeddings). (i) Now, we DO NOT use visual features in the pretraining. I think this is an important detail and maybe we missed to mention.
(ii) In the finetuning phase, the concatenation happens with image features from ViT, question embeddings summed with BB embeddings (we select the BB for questions to be 0, 0, 1000, 1000) and OCR tokens embeddings summed with their corresponding BBs embeddings.

I hope this makes it a bit clear.

uakarsh · 2022-06-14T16:11:25Z

Thank you for your detailed answer, a lot of things got clear, and yes you are right, I would be making a new issue, linking it with this issue's discussion. And, I am working right now, on making a step-by-step walkthrough of training LaTr on TextVQA, and hopefully, as we proceed, a lot of things will get clear in the way!!

Regards,
Akarsh

furkanbiten · 2022-06-17T08:46:40Z

Glad it helped! You can contact me or ask anything that is not clear or needs clarification at anytime.

Gyann-z · 2022-07-13T09:24:06Z

Thank you for your responses and contributions. @uakarsh

Gyann-z · 2022-07-13T09:27:21Z

Hey @furkanbiten ，

Thank you for your excellent work and detailed suggestions.

May I ask when you will release the Amazon-OCR results of TextVQA and ST-VQA datasets? I want to have a try.

furkanbiten · 2022-07-13T12:38:59Z

Hey @Gyann-z,
Thanks for the kind words, glad you liked the work.

I am actually trying to write my thesis and in the mean time trying to run Amazon-OCR on TextVQA and STVQA in my university since I couldn't get the data out of Amazon.
Of course, there are lots of errors running the OCR outside of Amazon and I am trying to fix them. Hopefully, in a week or two, I will be able to get it and create a new repo for it.

I will also ask @uakarsh to refer to the repo so that more people know about it.
For the moment, all I can say is stay tuned.

Gyann-z · 2022-07-14T03:14:13Z

Thanks! Looking forward to getting Amazon-OCR results soon.

furkanbiten · 2022-07-18T10:44:59Z

Hey @Gyann-z @uakarsh,

I have some good news. I finally had the time to run the Amazon-OCR on STVQA and TextVQA.

I have created a repo where you can find the small code snippet and the raw json file returned from Amazon-OCR pipeline.

Here is the repo: https://github.com/furkanbiten/stvqa_amazon_ocr

Let me know if you guys have any problem.

Gyann-z · 2022-07-19T00:54:44Z

Thank you very much! @furkanbiten That's really good news for me.

uakarsh mentioned this issue Jun 21, 2022

Clarification regarding the implementation and training of LaTr #3

Open

uakarsh self-assigned this Mar 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The results without pre-training #2

The results without pre-training #2

Gyann-z commented Apr 27, 2022

uakarsh commented Apr 27, 2022

MIL-VLG commented May 2, 2022

furkanbiten commented Jun 12, 2022

uakarsh commented Jun 13, 2022

furkanbiten commented Jun 13, 2022

uakarsh commented Jun 14, 2022 •

edited

Loading

furkanbiten commented Jun 17, 2022

Gyann-z commented Jul 13, 2022

Gyann-z commented Jul 13, 2022

furkanbiten commented Jul 13, 2022

Gyann-z commented Jul 14, 2022

furkanbiten commented Jul 18, 2022

Gyann-z commented Jul 19, 2022

The results without pre-training #2

The results without pre-training #2

Comments

Gyann-z commented Apr 27, 2022

uakarsh commented Apr 27, 2022

MIL-VLG commented May 2, 2022

furkanbiten commented Jun 12, 2022

uakarsh commented Jun 13, 2022

furkanbiten commented Jun 13, 2022

uakarsh commented Jun 14, 2022 • edited Loading

furkanbiten commented Jun 17, 2022

Gyann-z commented Jul 13, 2022

Gyann-z commented Jul 13, 2022

furkanbiten commented Jul 13, 2022

Gyann-z commented Jul 14, 2022

furkanbiten commented Jul 18, 2022

Gyann-z commented Jul 19, 2022

uakarsh commented Jun 14, 2022 •

edited

Loading