Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question on the quality of table annotations #19

Open
julianyulu opened this issue Dec 18, 2019 · 1 comment
Open

Question on the quality of table annotations #19

julianyulu opened this issue Dec 18, 2019 · 1 comment

Comments

@julianyulu
Copy link

Hi,

The Problem

I'd like to thank you for releasing this nice dataset. However, I found the quality of the annotation is actually not quite high, mainly two issues:

  1. Missing labels: no annotation found for an existing table
  2. Inaccurate annotations: some bbox does not cover the whole table region

Issue 1 has been mentioned by #9 , where the author answer by

some error may cause a little table unlabeled

However, I plotted the first 100 image ids and their annotations in /Detection_data/Word, and found 21 images out of 100 with missing annotations ( 1 or up to 3 tables were missing). Unless I'm extremely lucky to catch these problematic annotations from the first 100 plot, this issue does not only exist in 'a little table'.

To be specific, I post the imgIds for those 21 images:

3, 9, 10, 27, 32, 33, 39, 47, 51, 56, 57, 58, 59, 60, 61, 62, 73, 76, 77, 87, 95

As for issue 2, I found 3 images (out of 100 tested images) with incorrect annotations:

18, 62, 83

I understand from the paper that these annotations are generated by parsing the PDF/Word documents, and those document parsing code could not catch all the tables. I post this here only for providing researchers some info that they might care about.

Possible Fix

Issue 1 is actually not hard to fix. I have trained a model for table detection (trained on other datasets) with descent performance, I'd like to use this model to run one pass through all the data provided here and hopefully spot a large amount of missing annotations, then fix those manually. I'd be happy to share and discuss more.

FYI

I load the data with pycocotools, get annotations for each images using:

img_ann = coco.loadAnns(coco.getAnnIds(imgIds = image_id))

and plotted the annotations on a matplotlib figure using

coco.showAnns(img_ann)

The missing/incorrect annotations were then spotted by eye.

I'd be happy to discuss more and provide the testing jpynb if wanted.

Best,
Julian

@charmichokshi
Copy link

Hi @julianyulu

Do you manually check all the samples after running a table detection model for flagging the 'possible wrong annotations'? Or do you use loss or some metric as well to detect them automatically?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants