-
Notifications
You must be signed in to change notification settings - Fork 448
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Modification/filtering of the bounding boxes before OCR #1445
Comments
Hi @dchaplinsky 👋 Have you already tried to lower the binarization threshold ?
I think it will be hard to implement something in the middle of the pipeline because normally you don't know the coordinates before. What we could do is to open the potproccesor's |
Well, one can pass an optional callback which takes the list of bboxes and returns a modified list of bboxes.
Does it make sense? |
Hi @dchaplinsky |
@felixdittrich92, yes! |
However, we need to be careful not to add too much complex logic when designing, so I'd like to clear that up with the other two before we move on :) |
Hey there @dchaplinsky 👋 Thanks for the suggestion! In my experience, the best "interface" decisions we've made for docTR were the ones where we considered all things that should (vs. could) be customized and in which form factor. There are multiple ways of doing this, some that require a bit more work on internals others which don't. So since you mentioned that need and motivation, could you try to come up with a short snippet on what you'd like to use? e.g. from doctr.io import DocumentFile
from doctr.models import ocr_predictor
# Configure
model = ocr_predictor(pretrained=True)
model.det_predictor.update_filter(....)
# model.det_predictor.add_hook(lambda boxes: ...)
# etc
# Inference
doc = DocumentFile.from_pdf("path/to/your/doc.pdf")
result = model(doc) That will be easier to come up with an interface suggestion & see if that could fit in the roadmap :) |
Yep, callbacks is the rabbit hole. Probably we can stick to something similar to transformers callbacks? class AbstractCallback:
def on_detection(self, bboxes: list[dict])->list[dict]:
pass
def on_recognition(self, pages):
pass And then hook them to the wrapper model: |
@dchaplinsky @frgfm CC @odulcy-mindee Adding this hook to the
|
Oh i think for some scene text in the wild images mmocr will perform better !? |
Unfortunately we cannot switch the OCR framework at this moment. On the other hand, 7 digit indicator here is not recorded on video, rather it's overlayed over the live video. |
Mh yeah what you could do is to train your own model for example on TotalText / ICDAR or any other dataset which is made for wild scene text detection |
@felixdittrich92 just started to dig it. What is the best place to ask questions, discussions? Thanks again for adding the callbacks! |
https://github.com/mindee/doctr/discussions :) Such ideas from a user perspective are always helpful 🤗 |
🚀 The feature
It would be great to have an opportunity to intervene in the middle of the pipeline and adjust/remove some bboxes found before running OCR on it. For example, small bboxes can be padded a bit or removed. Or, if we expect some bboxes in particular parts of the frame we might filter out the rest.
I've dug into the code, and it seems that all the magic happens in the
forward
ofOCRPredictor
and things are tightly coupled, but one might an optional callback to send the bboxes for the additional transformation/filtering midway.Motivation, pitch
We experienced some issues when OCRing the text from the video. While the position of the text is mostly static on the screen, sometime text detection models fails to detect proper boundaries on some frames (while doing perfectly fine on others). Closer inspection shown that those poor results happened because identified bboxes was slightly smaller than needed, which in result cut one digit from the recognised text. It can be fixed with padding of those incorrect bboxes, especially given the fact that I know the correct bbox from the previous frame of the video.
Alternatives
I can of course subclass
OCRPredictor
and replace theforward
method completely, but then I'll need to also replace all the mechanics that happens in the zoo.py.Additional context
No response
The text was updated successfully, but these errors were encountered: