Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ISSUE-3: OCR specific Processor and new features/processing option #11

Merged
merged 24 commits into from
Dec 15, 2020

Conversation

DiegoPino
Copy link
Member

See #3 but really a lot more.

  • This changes the way the execute so we can timeout (proc_open, etc)
  • Move from PreSave to PostSave so we have the already persistent Entity and its ID
  • Better check on SBF JSON type when deciding what to ru
  • changes in annotations and some functions.Still complex but once we get the exact mix I will close some option and make the UI simpler
  • Hierarchical Processors via Drag and Drop
  • Queu Worker (Indexing) got a lot of Love, checks, cleanups, passes more data, gets more data, checks more
    -OCR specific Processor. Its Atomic so does only one page per Queue entry. May be Connected to one that passes the pages (parent) and can spawn number_of_pages number of children (the ocrs are children). But for testing directly it will just OCR the first page

All this is WIP so we can test @giancarlobi and will clean up tomorrow. HOCR to miniOCR does not work for me. I tried to adapt your code but failed. I may have missed something since I went to a faster xmlwriter in memory method. Too tired. Will fix tomorrow

Depends on esmero/strawberryfield#113

- This adds the ability to kill processes that are taking more time
//@todo
each processor needs to also pass back a $io->output->garbage so we can clean up any left over files
because we need the node->id()... i always testing with existing ones. This now happens all when all is already saved. So we do not push data back into the node.
We may want to have a PreSave one too to read flags like "force process" and then delete them before all gets saved.
…lex still)

And enforces also time based kill -9
@giancarlobi you may need a drush cr before trying this out
Many conditionals and double false/sanity checks before even trying to run
We pass more data, we get more data back we push more data

This depends @giancarlobi on esmero/strawberryfield#113
miniOCR code i tried to adapt is not working yet. This needs anther processor in chain that passes (invokes this as child) once per page. The idea is that each processor is Atomic. So this one only deals with a single page. Too tired to keep coding. Let's see if i figure out what is missing from the XML transformation. Probably something stupid on my side.
I'm also having now with this trouble reading the value back. Its a string, but PHP on decode things its an object.. gosh
@DiegoPino DiegoPino self-assigned this Nov 23, 2020
$coos = explode(" ", substr($page['title'], 5));
if (count($coos)) {
$w->startElement("p");
$w->writeAttribute("id", $pageid);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DiegoPino If I remember well, here we need xml:id as state in a not documented issue in plugin

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Is that a bug of the plugin? Still on the xmlwriter code I wrote I don't seem to be able to output the 'text' of the elements out. This whole part needs 1-2 hours more of code. Will see if I can add a namespace id and will let you know.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not a bug, I understood only not well documented. Now appears:
image

@giancarlobi
Copy link
Contributor

@DiegoPino If needed, here the script version to make output relative:

<?php
$val = getopt("i:p:");

$xml = simplexml_load_file($val['i']);

echo "<?xml version='1.0' encoding='UTF-8'?>" . "\n";
echo '<ocr>' . "\n";
foreach ($xml->body->children() as $page){
  $coos = explode(" ", substr($page['title'], 5));
  $pwidth = (float) $coos[2];
  $pheight = (float) $coos[3];
  echo '<p xml:id="' . $val['p'] . '" wh="' . $pwidth . " " . $pheight . '">' . "\n";
  echo '<b>' . "\n";
  foreach ($page->children() as $line){
        echo '<l>';
    foreach ($line->children() as $word){
        $wcoos = explode(" ", $word['title']);
        $x0 = (float) $wcoos[1];
        $y0 = (float) $wcoos[2];
        $x1 = (float) $wcoos[3];
        $y1 = (float) $wcoos[4];

        $l = round(($x0 / $pwidth) ,3);
        $t = round(($y0 / $pheight) ,3);
        $w = round((($x1 - $x0) / $pwidth) ,3);
        $h = round((($y1 - $y0) / $pheight) ,3);

        echo '<w x="' . $l . ' ' . $t . ' ' . $w . ' ' . $h . '">' . $word . '</w> ';
    }
    echo '</l>' . "\n";
  }
  echo '</b>' . "\n";
  echo '</p>' . "\n";
}
echo '</ocr>' . "\n";
?>

@giancarlobi
Copy link
Contributor

@DiegoPino Also, probably you have already noted, hOCR bbox has x0,y0,x1,y1 while MiniOCR uses l (= x0), t (= y0), w (= x1-x0) and h (= y1-y0)

$wcoos = explode(" ", $word['title']);
if (count($wcoos)) {
$w->startElement("w");
$w->writeAttribute("x", $wcoos[1] . ' ' . $wcoos[2] . ' ' . $wcoos[3] . ' ' . $wcoos[4]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of $wcoos[3] we have to put $wcoos[3]-$wcoos[1]
Instead of $wcoos[4] we have to put $wcoos[4]-$wcoos[2]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok will do that

It was a namespace (ns)
@giancarlobi ready to test
Move from // To be used by miniOCR as id in the form of {nodeuuid}/canvas/{fileuuid}/p{pagenumber}
to pagenumber
If so, skipps processing.
@giancarlobi to test this you need to reindex and make sure you have a checksum field for your strawberryfield_flavor_datasource data source in Solr.
Then run this. Should fill the first time
If you save the node again, second time/ infinite you should get a message saying its already in the index,
Now go to raw edit and change on letter in your checksum. Run again.
Should process again

//@todo
clear old solr documents. Checksum is great. Also on reindex read from Node stored HOCR, etc.

This is working well!
@giancarlobi
Copy link
Contributor

@DiegoPino An issue with MiniOCR, as reported here dbmdz/solr-ocrhighlighting#49 (comment) the decimal values have to start with . and not 0. due to plugin logic, if not Solr indexing doesn't work.

@giancarlobi thanks! addressing the code review
@giancarlobi there is a lot of not needed code i can optimize here
but it works. Will share my config!
@giancarlobi thanks for finding the BUG!
Now its fixed. I update SBF ISSUE-112 too to match this. Tested quickly and all pages are different
Running the same again now also works correctly, no reindex.
Thanks!!
…want to...

... make this a service with a method i can invoke and share. Simpler.
Added file update capabitilies. Processor generates a File entity, the worker adds it to the Node. @giancarlobi need to document and explain how we avoid double processing here. We mass more data around, but it saves us from overprocessing so that is good?
Gets some constructor assignments so we can get the temp:// folder defined by the system
If output as a file is desired a new %output file can be given. A lot of string manipulation to get the file names right back and forth and we may want to clean/generalized and also make some long code into methods we can reuse. But it actually works! WARC into WACZ without any issues!
…ral DCS cleanup

Still need to remove all the error_logs and make the whole reporting on issues/missing properties more consistent.

Tested and this is working on searchapi indexing, File adding and processor chaining.
@giancarlobi
Copy link
Contributor

@DiegoPino I'd remember you to update strawberry_runners.info.yml to make it compliant with Drupal 9, I don't know if this PR is the right place, only a gently reminder.

@giancarlobi
Copy link
Contributor

Great, thanks for that @DiegoPino

@DiegoPino
Copy link
Member Author

@giancarlobi will merge to make testing easier

@DiegoPino DiegoPino merged commit 92e4b41 into master Dec 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants