ISSUE-3: OCR specific Processor and new features/processing option #11

DiegoPino · 2020-11-23T05:26:46Z

See #3 but really a lot more.

This changes the way the execute so we can timeout (proc_open, etc)
Move from PreSave to PostSave so we have the already persistent Entity and its ID
Better check on SBF JSON type when deciding what to ru
changes in annotations and some functions.Still complex but once we get the exact mix I will close some option and make the UI simpler
Hierarchical Processors via Drag and Drop
Queu Worker (Indexing) got a lot of Love, checks, cleanups, passes more data, gets more data, checks more
-OCR specific Processor. Its Atomic so does only one page per Queue entry. May be Connected to one that passes the pages (parent) and can spawn number_of_pages number of children (the ocrs are children). But for testing directly it will just OCR the first page

All this is WIP so we can test @giancarlobi and will clean up tomorrow. HOCR to miniOCR does not work for me. I tried to adapt your code but failed. I may have missed something since I went to a faster xmlwriter in memory method. Too tired. Will fix tomorrow

Depends on esmero/strawberryfield#113

@todo

- This adds the ability to kill processes that are taking more time //@todo each processor needs to also pass back a $io->output->garbage so we can clean up any left over files

because we need the node->id()... i always testing with existing ones. This now happens all when all is already saved. So we do not push data back into the node. We may want to have a PreSave one too to read flags like "force process" and then delete them before all gets saved.

…lex still) And enforces also time based kill -9

@giancarlobi

@giancarlobi you may need a drush cr before trying this out

@giancarlobi

Many conditionals and double false/sanity checks before even trying to run We pass more data, we get more data back we push more data This depends @giancarlobi on esmero/strawberryfield#113

miniOCR code i tried to adapt is not working yet. This needs anther processor in chain that passes (invokes this as child) once per page. The idea is that each processor is Atomic. So this one only deals with a single page. Too tired to keep coding. Let's see if i figure out what is missing from the XML transformation. Probably something stupid on my side. I'm also having now with this trouble reading the value back. Its a string, but PHP on decode things its an object.. gosh

giancarlobi · 2020-11-23T10:09:50Z

src/Plugin/StrawberryRunnersPostProcessor/OcrPostProcessor.php

+      $coos = explode(" ", substr($page['title'], 5));
+      if (count($coos)) {
+        $w->startElement("p");
+        $w->writeAttribute("id", $pageid);


@DiegoPino If I remember well, here we need xml:id as state in a not documented issue in plugin

OK. Is that a bug of the plugin? Still on the xmlwriter code I wrote I don't seem to be able to output the 'text' of the elements out. This whole part needs 1-2 hours more of code. Will see if I can add a namespace id and will let you know.

It is not a bug, I understood only not well documented. Now appears:

giancarlobi · 2020-11-23T10:16:56Z

@DiegoPino If needed, here the script version to make output relative:

<?php
$val = getopt("i:p:");

$xml = simplexml_load_file($val['i']);

echo "<?xml version='1.0' encoding='UTF-8'?>" . "\n";
echo '<ocr>' . "\n";
foreach ($xml->body->children() as $page){
  $coos = explode(" ", substr($page['title'], 5));
  $pwidth = (float) $coos[2];
  $pheight = (float) $coos[3];
  echo '<p xml:id="' . $val['p'] . '" wh="' . $pwidth . " " . $pheight . '">' . "\n";
  echo '<b>' . "\n";
  foreach ($page->children() as $line){
        echo '<l>';
    foreach ($line->children() as $word){
        $wcoos = explode(" ", $word['title']);
        $x0 = (float) $wcoos[1];
        $y0 = (float) $wcoos[2];
        $x1 = (float) $wcoos[3];
        $y1 = (float) $wcoos[4];

        $l = round(($x0 / $pwidth) ,3);
        $t = round(($y0 / $pheight) ,3);
        $w = round((($x1 - $x0) / $pwidth) ,3);
        $h = round((($y1 - $y0) / $pheight) ,3);

        echo '<w x="' . $l . ' ' . $t . ' ' . $w . ' ' . $h . '">' . $word . '</w> ';
    }
    echo '</l>' . "\n";
  }
  echo '</b>' . "\n";
  echo '</p>' . "\n";
}
echo '</ocr>' . "\n";
?>

giancarlobi · 2020-11-23T11:08:25Z

@DiegoPino Also, probably you have already noted, hOCR bbox has x0,y0,x1,y1 while MiniOCR uses l (= x0), t (= y0), w (= x1-x0) and h (= y1-y0)

giancarlobi · 2020-11-23T11:14:17Z

src/Plugin/StrawberryRunnersPostProcessor/OcrPostProcessor.php

+            $wcoos = explode(" ", $word['title']);
+            if (count($wcoos)) {
+              $w->startElement("w");
+              $w->writeAttribute("x", $wcoos[1] . ' ' . $wcoos[2] . ' ' . $wcoos[3] . ' ' . $wcoos[4]);


Instead of $wcoos[3] we have to put $wcoos[3]-$wcoos[1]
Instead of $wcoos[4] we have to put $wcoos[4]-$wcoos[2]

Ok will do that

@giancarlobi

It was a namespace (ns) @giancarlobi ready to test

Move from // To be used by miniOCR as id in the form of {nodeuuid}/canvas/{fileuuid}/p{pagenumber} to pagenumber

removed all data too

@giancarlobi

If so, skipps processing. @giancarlobi to test this you need to reindex and make sure you have a checksum field for your strawberryfield_flavor_datasource data source in Solr. Then run this. Should fill the first time If you save the node again, second time/ infinite you should get a message saying its already in the index, Now go to raw edit and change on letter in your checksum. Run again. Should process again //@todo clear old solr documents. Checksum is great. Also on reindex read from Node stored HOCR, etc. This is working well!

giancarlobi · 2020-11-24T10:18:24Z

@DiegoPino An issue with MiniOCR, as reported here dbmdz/solr-ocrhighlighting#49 (comment) the decimal values have to start with . and not 0. due to plugin logic, if not Solr indexing doesn't work.

@giancarlobi

@giancarlobi thanks! addressing the code review

@giancarlobi

@giancarlobi there is a lot of not needed code i can optimize here but it works. Will share my config!

@giancarlobi

@giancarlobi thanks for finding the BUG! Now its fixed. I update SBF ISSUE-112 too to match this. Tested quickly and all pages are different Running the same again now also works correctly, no reindex. Thanks!!

…want to... ... make this a service with a method i can invoke and share. Simpler.

@giancarlobi

Added file update capabitilies. Processor generates a File entity, the worker adds it to the Node. @giancarlobi need to document and explain how we avoid double processing here. We mass more data around, but it saves us from overprocessing so that is good?

Gets some constructor assignments so we can get the temp:// folder defined by the system

If output as a file is desired a new %output file can be given. A lot of string manipulation to get the file names right back and forth and we may want to clean/generalized and also make some long code into methods we can reuse. But it actually works! WARC into WACZ without any issues!

@giancarlobi

Nice catch @giancarlobi

src/Plugin/QueueWorker/AbstractPostProcessorQueueWorker.php

…ral DCS cleanup Still need to remove all the error_logs and make the whole reporting on issues/missing properties more consistent. Tested and this is working on searchapi indexing, File adding and processor chaining.

giancarlobi · 2020-12-07T10:19:27Z

@DiegoPino I'd remember you to update strawberry_runners.info.yml to make it compliant with Drupal 9, I don't know if this PR is the right place, only a gently reminder.

@giancarlobi

@giancarlobi as requested

giancarlobi · 2020-12-07T14:08:59Z

Great, thanks for that @DiegoPino

@giancarlobi

@giancarlobi for you friend

DiegoPino · 2020-12-15T19:21:06Z

@giancarlobi will merge to make testing easier

DiegoPino added 6 commits November 22, 2020 23:22

proc_open and killing of processes taking more than X seconds

d0d4a6d

- This adds the ability to kill processes that are taking more time //@todo each processor needs to also pass back a $io->output->garbage so we can clean up any left over files

Updates processor annotations (this may change, i feel its a bit comp…

b332c59

…lex still) And enforces also time based kill -9

Updates Service to run on PostSave

618a56c

@giancarlobi you may need a drush cr before trying this out

So many updates on our Queue Worker

e144f71

Many conditionals and double false/sanity checks before even trying to run We pass more data, we get more data back we push more data This depends @giancarlobi on esmero/strawberryfield#113

DiegoPino requested a review from giancarlobi November 23, 2020 05:26

DiegoPino self-assigned this Nov 23, 2020

giancarlobi reviewed Nov 23, 2020

View reviewed changes

DiegoPino added 6 commits November 23, 2020 11:23

OK. I got XPath is working

391101b

It was a namespace (ns) @giancarlobi ready to test

Route Fix for D9

224c440

Address page id question from giancarlo

617b64a

Move from // To be used by miniOCR as id in the form of {nodeuuid}/canvas/{fileuuid}/p{pagenumber} to pagenumber

Add search_api_solr to composer dependency

f1073bd

Correctly process page ratio and parse things out for miniCOR

0442676

removed all data too

DiegoPino added 8 commits November 24, 2020 09:14

Remove leading 0s from miniOCR dimensions

2726210

@giancarlobi thanks! addressing the code review

Chained processors working

eaf0a47

@giancarlobi there is a lot of not needed code i can optimize here but it works. Will share my config!

Fixed generic key store key, now all pages are actually different

5c0d688

@giancarlobi thanks for finding the BUG! Now its fixed. I update SBF ISSUE-112 too to match this. Tested quickly and all pages are different Running the same again now also works correctly, no reindex. Thanks!!

Basically a LOT: For now 2 events subs, share the same code so i may …

f6247d7

…want to... ... make this a service with a method i can invoke and share. Simpler.

Update StrawberryRunnersPostProcessorPluginBase.php

10e8631

Gets some constructor assignments so we can get the temp:// folder defined by the system

remove deprecated D9 for temp storage

04e5f11

Nice catch @giancarlobi

giancarlobi requested changes Dec 5, 2020

View reviewed changes

src/Plugin/QueueWorker/AbstractPostProcessorQueueWorker.php Outdated Show resolved Hide resolved

DiegoPino added 2 commits December 6, 2020 15:33

Address Code review from @giancarlobi

02abe54

giancarlobi approved these changes Dec 7, 2020

View reviewed changes

Drupal 9 in the .info

4c54c94

@giancarlobi as requested

Update hook for missing entity

91ccc23

@giancarlobi for you friend

DiegoPino merged commit 92e4b41 into master Dec 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ISSUE-3: OCR specific Processor and new features/processing option #11

ISSUE-3: OCR specific Processor and new features/processing option #11

DiegoPino commented Nov 23, 2020

giancarlobi Nov 23, 2020

DiegoPino Nov 23, 2020

giancarlobi Nov 23, 2020

giancarlobi commented Nov 23, 2020

giancarlobi commented Nov 23, 2020

giancarlobi Nov 23, 2020

DiegoPino Nov 23, 2020

giancarlobi commented Nov 24, 2020

giancarlobi commented Dec 7, 2020

giancarlobi commented Dec 7, 2020

DiegoPino commented Dec 15, 2020

ISSUE-3: OCR specific Processor and new features/processing option #11

ISSUE-3: OCR specific Processor and new features/processing option #11

Conversation

DiegoPino commented Nov 23, 2020

giancarlobi Nov 23, 2020

Choose a reason for hiding this comment

DiegoPino Nov 23, 2020

Choose a reason for hiding this comment

giancarlobi Nov 23, 2020

Choose a reason for hiding this comment

giancarlobi commented Nov 23, 2020

giancarlobi commented Nov 23, 2020

giancarlobi Nov 23, 2020

Choose a reason for hiding this comment

DiegoPino Nov 23, 2020

Choose a reason for hiding this comment

giancarlobi commented Nov 24, 2020

giancarlobi commented Dec 7, 2020

giancarlobi commented Dec 7, 2020

DiegoPino commented Dec 15, 2020