Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modification/filtering of the bounding boxes before OCR #1445

Closed
dchaplinsky opened this issue Jan 31, 2024 · 14 comments · Fixed by #1449
Closed

Modification/filtering of the bounding boxes before OCR #1445

dchaplinsky opened this issue Jan 31, 2024 · 14 comments · Fixed by #1449
Labels

Comments

@dchaplinsky
Copy link

🚀 The feature

It would be great to have an opportunity to intervene in the middle of the pipeline and adjust/remove some bboxes found before running OCR on it. For example, small bboxes can be padded a bit or removed. Or, if we expect some bboxes in particular parts of the frame we might filter out the rest.

I've dug into the code, and it seems that all the magic happens in the forward of OCRPredictor and things are tightly coupled, but one might an optional callback to send the bboxes for the additional transformation/filtering midway.

Motivation, pitch

We experienced some issues when OCRing the text from the video. While the position of the text is mostly static on the screen, sometime text detection models fails to detect proper boundaries on some frames (while doing perfectly fine on others). Closer inspection shown that those poor results happened because identified bboxes was slightly smaller than needed, which in result cut one digit from the recognised text. It can be fixed with padding of those incorrect bboxes, especially given the fact that I know the correct bbox from the previous frame of the video.

Alternatives

I can of course subclass OCRPredictor and replace the forward method completely, but then I'll need to also replace all the mechanics that happens in the zoo.py.

Additional context

No response

@dchaplinsky dchaplinsky added the type: enhancement Improvement label Jan 31, 2024
@felixdittrich92
Copy link
Contributor

Hi @dchaplinsky 👋

Have you already tried to lower the binarization threshold ?

predictor.det_predictor.model.postprocessor.bin_thresh = 0.1  # default is 0.3

I think it will be hard to implement something in the middle of the pipeline because normally you don't know the coordinates before.
Could you explain a bit more in detail how you think this would be a useful feature maybe a short description how you would like to use it from a user view ?

What we could do is to open the potproccesor's unclip_ratio and box_thresh also to users end for more control

@dchaplinsky
Copy link
Author

Well, one can pass an optional callback which takes the list of bboxes and returns a modified list of bboxes.
Use cases:

  1. I loosely know where my bounding boxes might be. I'll delete bounding boxes that is not located there (for example, in the middle of the frame on video).
  2. More or less I know the size of the bbox I want to OCR. I might delete smaller or bigger bboxes, saving some compute on OCR.
  3. I'd like to use only the biggest bbox to OCR, so I can discard all the rest.
  4. I know that sometime detection model might produce a smaller bbox than I need, I can pad (or shrink) it as I wish before passing it to the OCR.
  5. From previous frames, pages or other documents I know where bbox has to be on the screen. I might even add it manually if detection model missed it (for example, I'm using very lean detection model for performance reasons or I have a noisy video with OSD text).

Does it make sense?

@felixdittrich92
Copy link
Contributor

Hi @dchaplinsky
You had something like #1449 in mind ?
@odulcy-mindee What do you think ? (Maybe also a good way to make #988 possible -> the callback could be applied up to the logit vector (recognition models) where you could manipulate it on your own - this would be indeed for advanced users)
Only a short and dirty protoype yet 😅
CC @frgfm

@dchaplinsky
Copy link
Author

@felixdittrich92, yes!

@felixdittrich92
Copy link
Contributor

However, we need to be careful not to add too much complex logic when designing, so I'd like to clear that up with the other two before we move on :)

@frgfm
Copy link
Collaborator

frgfm commented Feb 2, 2024

Hey there @dchaplinsky 👋

Thanks for the suggestion! In my experience, the best "interface" decisions we've made for docTR were the ones where we considered all things that should (vs. could) be customized and in which form factor.

There are multiple ways of doing this, some that require a bit more work on internals others which don't. So since you mentioned that need and motivation, could you try to come up with a short snippet on what you'd like to use?

e.g.

from doctr.io import DocumentFile
from doctr.models import ocr_predictor

# Configure
model = ocr_predictor(pretrained=True)
model.det_predictor.update_filter(....)
# model.det_predictor.add_hook(lambda boxes: ...)
# etc
# Inference
doc = DocumentFile.from_pdf("path/to/your/doc.pdf")
result = model(doc)

That will be easier to come up with an interface suggestion & see if that could fit in the roadmap :)

@dchaplinsky
Copy link
Author

Yep, callbacks is the rabbit hole. Probably we can stick to something similar to transformers callbacks?

class AbstractCallback:
   def on_detection(self, bboxes: list[dict])->list[dict]:
       pass
   def on_recognition(self, pages):
       pass

And then hook them to the wrapper model: model.add_hook(MyCallback()).

@felixdittrich92 felixdittrich92 linked a pull request Feb 5, 2024 that will close this issue
@felixdittrich92
Copy link
Contributor

felixdittrich92 commented Feb 5, 2024

@dchaplinsky @frgfm CC @odulcy-mindee
Cleaned the prototype PR a bit for detection this should be fine wdyt ?

Adding this hook to the det_predictor / reco_predictor makes no sense in my mind because there are two places where it makes sense to apply some modifications or to manipulate things

  1. into the pipeline 1 step before we crop and 1 step after removing the padding from the loc_preds (Should we add a check that the coords are still inside the page and the hook return is correct or keep it by the users for this advanced option ?)
  2. In the recognition post_processor where we have access to (logits, embeddings (if exist) and vocab) -> not yet included that's something for another PR

@dchaplinsky
Copy link
Author

dchaplinsky commented Feb 7, 2024

Thanks, I'll take a look asap.

Here is the one example why one might want to interfere between det and reco stages:

Screenshot 2024-02-06 at 22 09 49

The 7 digit clock on the top left of the screen is recognisable by many of models (parseq for sure), but the problem is: it is not detected (I've tried different architectures for the recognition stage and also used different binarization threshold).
Here is the one of the best detections I've got.
Screenshot 2024-02-06 at 22 12 12

Obviously tasks like this can be done with some extra pretraining for the such fonts, I'll address it in a separate issue or PR.

@felixdittrich92
Copy link
Contributor

Oh i think for some scene text in the wild images mmocr will perform better !?
Because as the name says docTR - document text recognition our models are all pretrained on an mindee internal dataset which contains invoices, receipts and other text rich documents but no or maybe less wild text scenes 🤔

@dchaplinsky
Copy link
Author

Unfortunately we cannot switch the OCR framework at this moment. On the other hand, 7 digit indicator here is not recorded on video, rather it's overlayed over the live video.

@felixdittrich92
Copy link
Contributor

Mh yeah what you could do is to train your own model for example on TotalText / ICDAR or any other dataset which is made for wild scene text detection

@dchaplinsky
Copy link
Author

@felixdittrich92 just started to dig it. What is the best place to ask questions, discussions?

Thanks again for adding the callbacks!

@felixT2K
Copy link
Contributor

felixT2K commented Feb 8, 2024

https://github.com/mindee/doctr/discussions :)

Such ideas from a user perspective are always helpful 🤗

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants