You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'd like to thank you for releasing this nice dataset. However, I found the quality of the annotation is actually not quite high, mainly two issues:
Missing labels: no annotation found for an existing table
Inaccurate annotations: some bbox does not cover the whole table region
Issue 1 has been mentioned by #9 , where the author answer by
some error may cause a little table unlabeled
However, I plotted the first 100 image ids and their annotations in /Detection_data/Word, and found 21 images out of 100 with missing annotations ( 1 or up to 3 tables were missing). Unless I'm extremely lucky to catch these problematic annotations from the first 100 plot, this issue does not only exist in 'a little table'.
To be specific, I post the imgIds for those 21 images:
As for issue 2, I found 3 images (out of 100 tested images) with incorrect annotations:
18, 62, 83
I understand from the paper that these annotations are generated by parsing the PDF/Word documents, and those document parsing code could not catch all the tables. I post this here only for providing researchers some info that they might care about.
Possible Fix
Issue 1 is actually not hard to fix. I have trained a model for table detection (trained on other datasets) with descent performance, I'd like to use this model to run one pass through all the data provided here and hopefully spot a large amount of missing annotations, then fix those manually. I'd be happy to share and discuss more.
FYI
I load the data with pycocotools, get annotations for each images using:
Do you manually check all the samples after running a table detection model for flagging the 'possible wrong annotations'? Or do you use loss or some metric as well to detect them automatically?
Hi,
The Problem
I'd like to thank you for releasing this nice dataset. However, I found the quality of the annotation is actually not quite high, mainly two issues:
Issue 1 has been mentioned by #9 , where the author answer by
However, I plotted the first 100 image ids and their annotations in /Detection_data/Word, and found 21 images out of 100 with missing annotations ( 1 or up to 3 tables were missing). Unless I'm extremely lucky to catch these problematic annotations from the first 100 plot, this issue does not only exist in 'a little table'.
To be specific, I post the imgIds for those 21 images:
3, 9, 10, 27, 32, 33, 39, 47, 51, 56, 57, 58, 59, 60, 61, 62, 73, 76, 77, 87, 95
As for issue 2, I found 3 images (out of 100 tested images) with incorrect annotations:
18, 62, 83
I understand from the paper that these annotations are generated by parsing the PDF/Word documents, and those document parsing code could not catch all the tables. I post this here only for providing researchers some info that they might care about.
Possible Fix
Issue 1 is actually not hard to fix. I have trained a model for table detection (trained on other datasets) with descent performance, I'd like to use this model to run one pass through all the data provided here and hopefully spot a large amount of missing annotations, then fix those manually. I'd be happy to share and discuss more.
FYI
I load the data with pycocotools, get annotations for each images using:
img_ann = coco.loadAnns(coco.getAnnIds(imgIds = image_id))
and plotted the annotations on a matplotlib figure using
coco.showAnns(img_ann)
The missing/incorrect annotations were then spotted by eye.
I'd be happy to discuss more and provide the testing jpynb if wanted.
Best,
Julian
The text was updated successfully, but these errors were encountered: