Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto detect "bad" trascription restults and shortcut the aggregation code #347

Open
CKrawczyk opened this issue Aug 19, 2020 · 2 comments
Open

Comments

@CKrawczyk
Copy link
Collaborator

When the transcription aggregation parameters are not tuned correctly sometimes a large number of lines of text end up being seen as one line of text (and in the worst cases the entire page is seen as one line). When this happens the aggregation code attempts to align several hundred unique transcriptions, leading to slow down and if it hits 2mins a timeout.

As this case will return gibberish anyways, there is no harm stopping the aggregation early. This could be potentially be detected by looking at how many witnesses are added to collatex (https://github.com/zooniverse/aggregation-for-caesar/blob/master/panoptes_aggregation/reducers/optics_line_text_reducer.py#L201 and https://github.com/zooniverse/aggregation-for-caesar/blob/master/panoptes_aggregation/reducers/text_utils.py#L314), and putting a cap on that number (e.g. only let the first 25 be added for alignment).

Impacts of this cap:

  • There is a limit to how many people can transcribe a line and be included in the aggregation (might be hard to track down issues when there is actually 25+ transcriptions for a line (although at that point the research teams are not listening to our best practices).
  • The text task reducer uses the same code, so it will have the same limitations placed upon in
  • When gibberish (or a timeout) would have been returned slightly less bad gibberish will still be returned, but much faster.
@alnah005
Copy link
Contributor

alnah005 commented Aug 25, 2021

I'm currently trying to use the Poly line text reducer for the ACLS project. I was wondering if this problem could be related to the execution never completing. I think it would be helpful to use/add a verbose option in the code to know where the code gets stuck. I've attached sample of the log file that I have to show that the reduction seems to be working up until 99%.

Reducing: N/A% |                                               | ETA:  --:--:--
Reducing:   0% |                                               | ETA:   1:31:43
Reducing:   0% |                                               | ETA:   1:56:51
Reducing:   0% |                                               | ETA:   1:34:34
Reducing:   0% |                                               | ETA:   1:06:23
Reducing:   0% |                                               | ETA:   0:58:09
Reducing:   0% |                                               | ETA:   0:50:04
Reducing:   0% |                                               | ETA:   0:37:12

.....

Reducing:   7% |###                                            | ETA:   0:09:25
Reducing:   7% |###                                            | ETA:   0:09:24
Reducing:   7% |###                                            | ETA:   0:09:24
Reducing:   7% |###                                            | ETA:   0:09:24
Reducing:   7% |###                                            | ETA:   0:09:24
Reducing:   7% |###                                            | ETA:   0:09:24
Reducing:   7% |###                                            | ETA:   0:09:24
Reducing:   7% |###                                            | ETA:   0:09:23
Reducing:   7% |###                                            | ETA:   0:09:22
Reducing:   7% |###                                            | ETA:   0:09:21
Reducing:   7% |###                                            | ETA:   0:09:20
Reducing:   7% |###                                            | ETA:   0:09:20
Reducing:   7% |###                                            | ETA:   0:09:21

.....

Reducing:  46% |#####################                          | ETA:   0:07:38
Reducing:  46% |#####################                          | ETA:   0:07:38
Reducing:  46% |#####################                          | ETA:   0:07:38
Reducing:  46% |#####################                          | ETA:   0:07:38
Reducing:  46% |#####################                          | ETA:   0:07:38
Reducing:  46% |#####################                          | ETA:   0:07:38
Reducing:  46% |#####################                          | ETA:   0:07:37
Reducing:  46% |#####################                          | ETA:   0:07:37
Reducing:  46% |#####################                          | ETA:   0:07:37
Reducing:  46% |#####################                          | ETA:   0:07:37
Reducing:  46% |#####################                          | ETA:   0:07:38
Reducing:  46% |######################                         | ETA:   0:07:37
Reducing:  46% |######################                         | ETA:   0:07:36
Reducing:  46% |######################                         | ETA:   0:07:36
Reducing:  46% |######################                         | ETA:   0:07:36
Reducing:  46% |######################                         | ETA:   0:07:36
Reducing:  46% |######################                         | ETA:   0:07:35
Reducing:  47% |######################                         | ETA:   0:07:35
Reducing:  47% |######################                         | ETA:   0:07:35
Reducing:  47% |######################                         | ETA:   0:07:35
Reducing:  47% |######################                         | ETA:   0:07:35
Reducing:  47% |######################                         | ETA:   0:07:34

.....

Reducing:  99% |############################################## | ETA:   0:00:00
Reducing:  99% |############################################## | ETA:   0:00:00
Reducing:  99% |############################################## | ETA:   0:00:00
Reducing:  99% |############################################## | ETA:   0:00:00
Reducing:  99% |############################################## | ETA:   0:00:00
Reducing:  99% |############################################## | ETA:   0:00:00
Reducing:  99% |############################################## | ETA:   0:00:00
Reducing:  99% |############################################## | ETA:   0:00:00

It has been stuck at 99% for about an hour or so. In addition, using the -s option I was able to get a final output, I'm just not sure how confident I can be on the final result being complete.

Update:

After leaving it run for a further 6 hours, it completed.

@CKrawczyk
Copy link
Collaborator Author

Yeah, 6 hours for 1% sounds like this bug.

Adding a new verbose level to print out the current subject ID might be a good way to figure out where it gets stuck so you know what subject ID will give back junk at least (and that could be used to figure out what classification from the extract file is messing it up).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants