Modification/filtering of the bounding boxes before OCR #1445

dchaplinsky · 2024-01-31T11:29:38Z

🚀 The feature

It would be great to have an opportunity to intervene in the middle of the pipeline and adjust/remove some bboxes found before running OCR on it. For example, small bboxes can be padded a bit or removed. Or, if we expect some bboxes in particular parts of the frame we might filter out the rest.

I've dug into the code, and it seems that all the magic happens in the forward of OCRPredictor and things are tightly coupled, but one might an optional callback to send the bboxes for the additional transformation/filtering midway.

Motivation, pitch

We experienced some issues when OCRing the text from the video. While the position of the text is mostly static on the screen, sometime text detection models fails to detect proper boundaries on some frames (while doing perfectly fine on others). Closer inspection shown that those poor results happened because identified bboxes was slightly smaller than needed, which in result cut one digit from the recognised text. It can be fixed with padding of those incorrect bboxes, especially given the fact that I know the correct bbox from the previous frame of the video.

Alternatives

I can of course subclass OCRPredictor and replace the forward method completely, but then I'll need to also replace all the mechanics that happens in the zoo.py.

Additional context

No response

The text was updated successfully, but these errors were encountered:

felixdittrich92 · 2024-02-01T08:59:45Z

Hi @dchaplinsky 👋

Have you already tried to lower the binarization threshold ?

predictor.det_predictor.model.postprocessor.bin_thresh = 0.1  # default is 0.3

I think it will be hard to implement something in the middle of the pipeline because normally you don't know the coordinates before.
Could you explain a bit more in detail how you think this would be a useful feature maybe a short description how you would like to use it from a user view ?

What we could do is to open the potproccesor's unclip_ratio and box_thresh also to users end for more control

dchaplinsky · 2024-02-01T10:08:27Z

Well, one can pass an optional callback which takes the list of bboxes and returns a modified list of bboxes.
Use cases:

I loosely know where my bounding boxes might be. I'll delete bounding boxes that is not located there (for example, in the middle of the frame on video).
More or less I know the size of the bbox I want to OCR. I might delete smaller or bigger bboxes, saving some compute on OCR.
I'd like to use only the biggest bbox to OCR, so I can discard all the rest.
I know that sometime detection model might produce a smaller bbox than I need, I can pad (or shrink) it as I wish before passing it to the OCR.
From previous frames, pages or other documents I know where bbox has to be on the screen. I might even add it manually if detection model missed it (for example, I'm using very lean detection model for performance reasons or I have a noisy video with OSD text).

Does it make sense?

felixdittrich92 · 2024-02-02T08:25:22Z

Hi @dchaplinsky
You had something like #1449 in mind ?
@odulcy-mindee What do you think ? (Maybe also a good way to make #988 possible -> the callback could be applied up to the logit vector (recognition models) where you could manipulate it on your own - this would be indeed for advanced users)
Only a short and dirty protoype yet 😅
CC @frgfm

dchaplinsky · 2024-02-02T10:22:46Z

@felixdittrich92, yes!

felixdittrich92 · 2024-02-02T11:20:25Z

However, we need to be careful not to add too much complex logic when designing, so I'd like to clear that up with the other two before we move on :)

frgfm · 2024-02-02T16:40:48Z

Hey there @dchaplinsky 👋

Thanks for the suggestion! In my experience, the best "interface" decisions we've made for docTR were the ones where we considered all things that should (vs. could) be customized and in which form factor.

There are multiple ways of doing this, some that require a bit more work on internals others which don't. So since you mentioned that need and motivation, could you try to come up with a short snippet on what you'd like to use?

e.g.

from doctr.io import DocumentFile
from doctr.models import ocr_predictor

# Configure
model = ocr_predictor(pretrained=True)
model.det_predictor.update_filter(....)
# model.det_predictor.add_hook(lambda boxes: ...)
# etc
# Inference
doc = DocumentFile.from_pdf("path/to/your/doc.pdf")
result = model(doc)

That will be easier to come up with an interface suggestion & see if that could fit in the roadmap :)

dchaplinsky · 2024-02-03T13:06:51Z

Yep, callbacks is the rabbit hole. Probably we can stick to something similar to transformers callbacks?

class AbstractCallback:
   def on_detection(self, bboxes: list[dict])->list[dict]:
       pass
   def on_recognition(self, pages):
       pass

And then hook them to the wrapper model: model.add_hook(MyCallback()).

felixdittrich92 · 2024-02-05T08:34:10Z

@dchaplinsky @frgfm CC @odulcy-mindee
Cleaned the prototype PR a bit for detection this should be fine wdyt ?

Adding this hook to the det_predictor / reco_predictor makes no sense in my mind because there are two places where it makes sense to apply some modifications or to manipulate things

into the pipeline 1 step before we crop and 1 step after removing the padding from the loc_preds (Should we add a check that the coords are still inside the page and the hook return is correct or keep it by the users for this advanced option ?)
In the recognition post_processor where we have access to (logits, embeddings (if exist) and vocab) -> not yet included that's something for another PR

dchaplinsky · 2024-02-07T11:23:56Z

Thanks, I'll take a look asap.

Here is the one example why one might want to interfere between det and reco stages:

The 7 digit clock on the top left of the screen is recognisable by many of models (parseq for sure), but the problem is: it is not detected (I've tried different architectures for the recognition stage and also used different binarization threshold).
Here is the one of the best detections I've got.

Obviously tasks like this can be done with some extra pretraining for the such fonts, I'll address it in a separate issue or PR.

felixdittrich92 · 2024-02-07T15:00:49Z

Oh i think for some scene text in the wild images mmocr will perform better !?
Because as the name says docTR - document text recognition our models are all pretrained on an mindee internal dataset which contains invoices, receipts and other text rich documents but no or maybe less wild text scenes 🤔

dchaplinsky · 2024-02-07T15:37:52Z

Unfortunately we cannot switch the OCR framework at this moment. On the other hand, 7 digit indicator here is not recorded on video, rather it's overlayed over the live video.

felixdittrich92 · 2024-02-08T07:42:35Z

Mh yeah what you could do is to train your own model for example on TotalText / ICDAR or any other dataset which is made for wild scene text detection

dchaplinsky · 2024-02-08T12:11:16Z

@felixdittrich92 just started to dig it. What is the best place to ask questions, discussions?

Thanks again for adding the callbacks!

felixT2K · 2024-02-08T13:27:29Z

https://github.com/mindee/doctr/discussions :)

Such ideas from a user perspective are always helpful 🤗

dchaplinsky added the type: enhancement Improvement label Jan 31, 2024

felixdittrich92 linked a pull request Feb 5, 2024 that will close this issue

[prototype] Extend detection result customization #1449

Merged

felixdittrich92 closed this as completed in #1449 Feb 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modification/filtering of the bounding boxes before OCR #1445

Modification/filtering of the bounding boxes before OCR #1445

dchaplinsky commented Jan 31, 2024

felixdittrich92 commented Feb 1, 2024

dchaplinsky commented Feb 1, 2024

felixdittrich92 commented Feb 2, 2024

dchaplinsky commented Feb 2, 2024

felixdittrich92 commented Feb 2, 2024

frgfm commented Feb 2, 2024

dchaplinsky commented Feb 3, 2024

felixdittrich92 commented Feb 5, 2024 •

edited

Loading

dchaplinsky commented Feb 7, 2024 •

edited

Loading

felixdittrich92 commented Feb 7, 2024

dchaplinsky commented Feb 7, 2024

felixdittrich92 commented Feb 8, 2024

dchaplinsky commented Feb 8, 2024

felixT2K commented Feb 8, 2024

Modification/filtering of the bounding boxes before OCR #1445

Modification/filtering of the bounding boxes before OCR #1445

Comments

dchaplinsky commented Jan 31, 2024

🚀 The feature

Motivation, pitch

Alternatives

Additional context

felixdittrich92 commented Feb 1, 2024

dchaplinsky commented Feb 1, 2024

felixdittrich92 commented Feb 2, 2024

dchaplinsky commented Feb 2, 2024

felixdittrich92 commented Feb 2, 2024

frgfm commented Feb 2, 2024

dchaplinsky commented Feb 3, 2024

felixdittrich92 commented Feb 5, 2024 • edited Loading

dchaplinsky commented Feb 7, 2024 • edited Loading

felixdittrich92 commented Feb 7, 2024

dchaplinsky commented Feb 7, 2024

felixdittrich92 commented Feb 8, 2024

dchaplinsky commented Feb 8, 2024

felixT2K commented Feb 8, 2024

felixdittrich92 commented Feb 5, 2024 •

edited

Loading

dchaplinsky commented Feb 7, 2024 •

edited

Loading