Question on the quality of table annotations #19

julianyulu · 2019-12-18T07:40:07Z

Hi,

The Problem

I'd like to thank you for releasing this nice dataset. However, I found the quality of the annotation is actually not quite high, mainly two issues:

Missing labels: no annotation found for an existing table
Inaccurate annotations: some bbox does not cover the whole table region

Issue 1 has been mentioned by #9 , where the author answer by

some error may cause a little table unlabeled

However, I plotted the first 100 image ids and their annotations in /Detection_data/Word, and found 21 images out of 100 with missing annotations ( 1 or up to 3 tables were missing). Unless I'm extremely lucky to catch these problematic annotations from the first 100 plot, this issue does not only exist in 'a little table'.

To be specific, I post the imgIds for those 21 images:

3, 9, 10, 27, 32, 33, 39, 47, 51, 56, 57, 58, 59, 60, 61, 62, 73, 76, 77, 87, 95

As for issue 2, I found 3 images (out of 100 tested images) with incorrect annotations:

18, 62, 83

I understand from the paper that these annotations are generated by parsing the PDF/Word documents, and those document parsing code could not catch all the tables. I post this here only for providing researchers some info that they might care about.

Possible Fix

Issue 1 is actually not hard to fix. I have trained a model for table detection (trained on other datasets) with descent performance, I'd like to use this model to run one pass through all the data provided here and hopefully spot a large amount of missing annotations, then fix those manually. I'd be happy to share and discuss more.

FYI

I load the data with pycocotools, get annotations for each images using:

img_ann = coco.loadAnns(coco.getAnnIds(imgIds = image_id))

and plotted the annotations on a matplotlib figure using

coco.showAnns(img_ann)

The missing/incorrect annotations were then spotted by eye.

I'd be happy to discuss more and provide the testing jpynb if wanted.

Best,
Julian

The text was updated successfully, but these errors were encountered:

charmichokshi · 2022-06-29T22:36:18Z

Hi @julianyulu

Do you manually check all the samples after running a table detection model for flagging the 'possible wrong annotations'? Or do you use loss or some metric as well to detect them automatically?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question on the quality of table annotations #19

Question on the quality of table annotations #19

julianyulu commented Dec 18, 2019

charmichokshi commented Jun 29, 2022

Question on the quality of table annotations #19

Question on the quality of table annotations #19

Comments

julianyulu commented Dec 18, 2019

The Problem

Possible Fix

FYI

charmichokshi commented Jun 29, 2022