-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Auto detect "bad" trascription restults and shortcut the aggregation code #347
Comments
I'm currently trying to use the Poly line text reducer for the ACLS project. I was wondering if this problem could be related to the execution never completing. I think it would be helpful to use/add a verbose option in the code to know where the code gets stuck. I've attached sample of the log file that I have to show that the reduction seems to be working up until 99%.
It has been stuck at 99% for about an hour or so. In addition, using the Update: After leaving it run for a further 6 hours, it completed. |
Yeah, 6 hours for 1% sounds like this bug. Adding a new verbose level to print out the current subject ID might be a good way to figure out where it gets stuck so you know what subject ID will give back junk at least (and that could be used to figure out what classification from the extract file is messing it up). |
When the transcription aggregation parameters are not tuned correctly sometimes a large number of lines of text end up being seen as one line of text (and in the worst cases the entire page is seen as one line). When this happens the aggregation code attempts to align several hundred unique transcriptions, leading to slow down and if it hits 2mins a timeout.
As this case will return gibberish anyways, there is no harm stopping the aggregation early. This could be potentially be detected by looking at how many
witnesses
are added tocollatex
(https://github.com/zooniverse/aggregation-for-caesar/blob/master/panoptes_aggregation/reducers/optics_line_text_reducer.py#L201 and https://github.com/zooniverse/aggregation-for-caesar/blob/master/panoptes_aggregation/reducers/text_utils.py#L314), and putting a cap on that number (e.g. only let the first 25 be added for alignment).Impacts of this cap:
The text was updated successfully, but these errors were encountered: