-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
initial creation of ManipulatePreds class + tests #184
base: main
Are you sure you want to change the base?
Conversation
|
||
class ManipulatePreds: | ||
""" | ||
Class to help expand prediction text and indexing via OCR data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add an example usage here?
|
||
def expand_predictions(self, pred_start: int, pred_end: int) -> dict: | ||
""" | ||
Expand predictions and boundaries to match that of OCR data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
provide an example here of what you mean 100 -> $100.00 if $100.00 is the full token
|
||
# Find index value of bounded prediction | ||
for index, pred in enumerate(self.predictions): | ||
if pred["start"] == pred_start or pred["end"] == pred_end: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder here whether you should be looking for any overlap rather than an exact match on start/end-- I believe there is already a helper function for this somewhere
self.ocr_tokens = ocr_tokens | ||
self.predictions = preds | ||
|
||
def expand_predictions(self, pred_start: int, pred_end: int) -> dict: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is how I'd imagine using this function-- wouldn't you have a specific prediction in mind where you want this to run? So the function could be static and take list of ocr_tokens and pred and then expands then returns the pred (expanded if that's relevant-- updating the start/end indexes as well as the text value)? Let me know what your rationale was for pred start/end instead I might be missing something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"alks Sco"-- > ["talks", "Scott"] - > "talks Scott"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def expand_pred(some_pred: dict, tokens: List[dict]) -> expanded_pred (dict)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-> dict, bool (bool = True if updated)
pred_index = None | ||
|
||
# Find index value of bounded prediction | ||
for index, pred in enumerate(self.predictions): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you could eliminate this? I'm not sure how you'd have pred start/end in the first place without already knowing the pred you want to operate on
return self.predictions[pred_index] | ||
|
||
# Use overlapping boundaries and expand text / boundaries to match OCR data | ||
for token in self.ocr_tokens: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you might be able to reuse some token matching functionality from other classes-- and then expand from there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e.g. see if you can reuse this class (or if it needs some small tweaks, could include that?) https://github.com/IndicoDataSolutions/Indico-Solutions-Toolkit/blob/main/indico_toolkit/association/extracted_tokens.py
|
||
expanded_text = self._get_ocr_text(expanded_start, expanded_end) | ||
if expanded_text != ocr_text_initial: | ||
raise ValueError("Expanded text does not match the OCR text.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how could this be the case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would feel like if you match, then you're good
|
||
if expanded_text == ocr_text_initial: | ||
# Update prediction | ||
self.predictions[pred_index]["start"] = expanded_start |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we instantiate this class with all of the preds / tokens, then maybe this method woul dmake more sense operating against a particular label? (i.e. expand for all "Insured Name")-- to me, probably makes more sense to have this be a static method as described above-
if expanded_text != ocr_text_initial: | ||
raise ValueError("Expanded text does not match the OCR text.") | ||
|
||
if expanded_text == ocr_text_initial: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't need this check given condition above (I would always assume that these have to be equal given that you set them)
return self.predictions[pred_index] | ||
|
||
def is_token_nearby( | ||
self, ocr_start: int, ocr_end: int, search_tokens: List[str], distance: int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand what distance means here or when this function would be useful (also don't like the idea of entering in the ocr_start ocr_end (how would you know what those values should be?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the more relevant functionality would be "is SOME_TEXT contained within a token within X distance"
No description provided.