Auto detect "bad" trascription restults and shortcut the aggregation code #347

CKrawczyk · 2020-08-19T08:49:42Z

When the transcription aggregation parameters are not tuned correctly sometimes a large number of lines of text end up being seen as one line of text (and in the worst cases the entire page is seen as one line). When this happens the aggregation code attempts to align several hundred unique transcriptions, leading to slow down and if it hits 2mins a timeout.

As this case will return gibberish anyways, there is no harm stopping the aggregation early. This could be potentially be detected by looking at how many witnesses are added to collatex (https://github.com/zooniverse/aggregation-for-caesar/blob/master/panoptes_aggregation/reducers/optics_line_text_reducer.py#L201 and https://github.com/zooniverse/aggregation-for-caesar/blob/master/panoptes_aggregation/reducers/text_utils.py#L314), and putting a cap on that number (e.g. only let the first 25 be added for alignment).

Impacts of this cap:

There is a limit to how many people can transcribe a line and be included in the aggregation (might be hard to track down issues when there is actually 25+ transcriptions for a line (although at that point the research teams are not listening to our best practices).
The text task reducer uses the same code, so it will have the same limitations placed upon in
When gibberish (or a timeout) would have been returned slightly less bad gibberish will still be returned, but much faster.

The text was updated successfully, but these errors were encountered:

alnah005 · 2021-08-25T19:42:41Z

I'm currently trying to use the Poly line text reducer for the ACLS project. I was wondering if this problem could be related to the execution never completing. I think it would be helpful to use/add a verbose option in the code to know where the code gets stuck. I've attached sample of the log file that I have to show that the reduction seems to be working up until 99%.

Reducing: N/A% |                                               | ETA:  --:--:--
Reducing:   0% |                                               | ETA:   1:31:43
Reducing:   0% |                                               | ETA:   1:56:51
Reducing:   0% |                                               | ETA:   1:34:34
Reducing:   0% |                                               | ETA:   1:06:23
Reducing:   0% |                                               | ETA:   0:58:09
Reducing:   0% |                                               | ETA:   0:50:04
Reducing:   0% |                                               | ETA:   0:37:12

.....

Reducing:   7% |###                                            | ETA:   0:09:25
Reducing:   7% |###                                            | ETA:   0:09:24
Reducing:   7% |###                                            | ETA:   0:09:24
Reducing:   7% |###                                            | ETA:   0:09:24
Reducing:   7% |###                                            | ETA:   0:09:24
Reducing:   7% |###                                            | ETA:   0:09:24
Reducing:   7% |###                                            | ETA:   0:09:24
Reducing:   7% |###                                            | ETA:   0:09:23
Reducing:   7% |###                                            | ETA:   0:09:22
Reducing:   7% |###                                            | ETA:   0:09:21
Reducing:   7% |###                                            | ETA:   0:09:20
Reducing:   7% |###                                            | ETA:   0:09:20
Reducing:   7% |###                                            | ETA:   0:09:21

.....

Reducing:  46% |#####################                          | ETA:   0:07:38
Reducing:  46% |#####################                          | ETA:   0:07:38
Reducing:  46% |#####################                          | ETA:   0:07:38
Reducing:  46% |#####################                          | ETA:   0:07:38
Reducing:  46% |#####################                          | ETA:   0:07:38
Reducing:  46% |#####################                          | ETA:   0:07:38
Reducing:  46% |#####################                          | ETA:   0:07:37
Reducing:  46% |#####################                          | ETA:   0:07:37
Reducing:  46% |#####################                          | ETA:   0:07:37
Reducing:  46% |#####################                          | ETA:   0:07:37
Reducing:  46% |#####################                          | ETA:   0:07:38
Reducing:  46% |######################                         | ETA:   0:07:37
Reducing:  46% |######################                         | ETA:   0:07:36
Reducing:  46% |######################                         | ETA:   0:07:36
Reducing:  46% |######################                         | ETA:   0:07:36
Reducing:  46% |######################                         | ETA:   0:07:36
Reducing:  46% |######################                         | ETA:   0:07:35
Reducing:  47% |######################                         | ETA:   0:07:35
Reducing:  47% |######################                         | ETA:   0:07:35
Reducing:  47% |######################                         | ETA:   0:07:35
Reducing:  47% |######################                         | ETA:   0:07:35
Reducing:  47% |######################                         | ETA:   0:07:34

.....

Reducing:  99% |############################################## | ETA:   0:00:00
Reducing:  99% |############################################## | ETA:   0:00:00
Reducing:  99% |############################################## | ETA:   0:00:00
Reducing:  99% |############################################## | ETA:   0:00:00
Reducing:  99% |############################################## | ETA:   0:00:00
Reducing:  99% |############################################## | ETA:   0:00:00
Reducing:  99% |############################################## | ETA:   0:00:00
Reducing:  99% |############################################## | ETA:   0:00:00

It has been stuck at 99% for about an hour or so. In addition, using the -s option I was able to get a final output, I'm just not sure how confident I can be on the final result being complete.

Update:

After leaving it run for a further 6 hours, it completed.

CKrawczyk · 2021-09-06T09:52:39Z

Yeah, 6 hours for 1% sounds like this bug.

Adding a new verbose level to print out the current subject ID might be a good way to figure out where it gets stuck so you know what subject ID will give back junk at least (and that could be used to figure out what classification from the extract file is messing it up).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auto detect "bad" trascription restults and shortcut the aggregation code #347

Auto detect "bad" trascription restults and shortcut the aggregation code #347

CKrawczyk commented Aug 19, 2020

alnah005 commented Aug 25, 2021 •

edited

Loading

CKrawczyk commented Sep 6, 2021

Auto detect "bad" trascription restults and shortcut the aggregation code #347

Auto detect "bad" trascription restults and shortcut the aggregation code #347

Comments

CKrawczyk commented Aug 19, 2020

alnah005 commented Aug 25, 2021 • edited Loading

CKrawczyk commented Sep 6, 2021

alnah005 commented Aug 25, 2021 •

edited

Loading