-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ISSUE-3: OCR specific Processor and new features/processing option #11
Conversation
- This adds the ability to kill processes that are taking more time //@todo each processor needs to also pass back a $io->output->garbage so we can clean up any left over files
because we need the node->id()... i always testing with existing ones. This now happens all when all is already saved. So we do not push data back into the node. We may want to have a PreSave one too to read flags like "force process" and then delete them before all gets saved.
…lex still) And enforces also time based kill -9
@giancarlobi you may need a drush cr before trying this out
Many conditionals and double false/sanity checks before even trying to run We pass more data, we get more data back we push more data This depends @giancarlobi on esmero/strawberryfield#113
miniOCR code i tried to adapt is not working yet. This needs anther processor in chain that passes (invokes this as child) once per page. The idea is that each processor is Atomic. So this one only deals with a single page. Too tired to keep coding. Let's see if i figure out what is missing from the XML transformation. Probably something stupid on my side. I'm also having now with this trouble reading the value back. Its a string, but PHP on decode things its an object.. gosh
$coos = explode(" ", substr($page['title'], 5)); | ||
if (count($coos)) { | ||
$w->startElement("p"); | ||
$w->writeAttribute("id", $pageid); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@DiegoPino If I remember well, here we need xml:id as state in a not documented issue in plugin
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. Is that a bug of the plugin? Still on the xmlwriter code I wrote I don't seem to be able to output the 'text' of the elements out. This whole part needs 1-2 hours more of code. Will see if I can add a namespace id and will let you know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@DiegoPino If needed, here the script version to make output relative: <?php
$val = getopt("i:p:");
$xml = simplexml_load_file($val['i']);
echo "<?xml version='1.0' encoding='UTF-8'?>" . "\n";
echo '<ocr>' . "\n";
foreach ($xml->body->children() as $page){
$coos = explode(" ", substr($page['title'], 5));
$pwidth = (float) $coos[2];
$pheight = (float) $coos[3];
echo '<p xml:id="' . $val['p'] . '" wh="' . $pwidth . " " . $pheight . '">' . "\n";
echo '<b>' . "\n";
foreach ($page->children() as $line){
echo '<l>';
foreach ($line->children() as $word){
$wcoos = explode(" ", $word['title']);
$x0 = (float) $wcoos[1];
$y0 = (float) $wcoos[2];
$x1 = (float) $wcoos[3];
$y1 = (float) $wcoos[4];
$l = round(($x0 / $pwidth) ,3);
$t = round(($y0 / $pheight) ,3);
$w = round((($x1 - $x0) / $pwidth) ,3);
$h = round((($y1 - $y0) / $pheight) ,3);
echo '<w x="' . $l . ' ' . $t . ' ' . $w . ' ' . $h . '">' . $word . '</w> ';
}
echo '</l>' . "\n";
}
echo '</b>' . "\n";
echo '</p>' . "\n";
}
echo '</ocr>' . "\n";
?> |
@DiegoPino Also, probably you have already noted, hOCR bbox has x0,y0,x1,y1 while MiniOCR uses l (= x0), t (= y0), w (= x1-x0) and h (= y1-y0) |
$wcoos = explode(" ", $word['title']); | ||
if (count($wcoos)) { | ||
$w->startElement("w"); | ||
$w->writeAttribute("x", $wcoos[1] . ' ' . $wcoos[2] . ' ' . $wcoos[3] . ' ' . $wcoos[4]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of $wcoos[3] we have to put $wcoos[3]-$wcoos[1]
Instead of $wcoos[4] we have to put $wcoos[4]-$wcoos[2]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok will do that
It was a namespace (ns) @giancarlobi ready to test
Move from // To be used by miniOCR as id in the form of {nodeuuid}/canvas/{fileuuid}/p{pagenumber} to pagenumber
removed all data too
If so, skipps processing. @giancarlobi to test this you need to reindex and make sure you have a checksum field for your strawberryfield_flavor_datasource data source in Solr. Then run this. Should fill the first time If you save the node again, second time/ infinite you should get a message saying its already in the index, Now go to raw edit and change on letter in your checksum. Run again. Should process again //@todo clear old solr documents. Checksum is great. Also on reindex read from Node stored HOCR, etc. This is working well!
@DiegoPino An issue with MiniOCR, as reported here dbmdz/solr-ocrhighlighting#49 (comment) the decimal values have to start with . and not 0. due to plugin logic, if not Solr indexing doesn't work. |
@giancarlobi thanks! addressing the code review
@giancarlobi there is a lot of not needed code i can optimize here but it works. Will share my config!
@giancarlobi thanks for finding the BUG! Now its fixed. I update SBF ISSUE-112 too to match this. Tested quickly and all pages are different Running the same again now also works correctly, no reindex. Thanks!!
…want to... ... make this a service with a method i can invoke and share. Simpler.
Added file update capabitilies. Processor generates a File entity, the worker adds it to the Node. @giancarlobi need to document and explain how we avoid double processing here. We mass more data around, but it saves us from overprocessing so that is good?
Gets some constructor assignments so we can get the temp:// folder defined by the system
If output as a file is desired a new %output file can be given. A lot of string manipulation to get the file names right back and forth and we may want to clean/generalized and also make some long code into methods we can reuse. But it actually works! WARC into WACZ without any issues!
…ral DCS cleanup Still need to remove all the error_logs and make the whole reporting on issues/missing properties more consistent. Tested and this is working on searchapi indexing, File adding and processor chaining.
@DiegoPino I'd remember you to update strawberry_runners.info.yml to make it compliant with Drupal 9, I don't know if this PR is the right place, only a gently reminder. |
@giancarlobi as requested
Great, thanks for that @DiegoPino |
@giancarlobi for you friend
@giancarlobi will merge to make testing easier |
See #3 but really a lot more.
-OCR specific Processor. Its Atomic so does only one page per Queue entry. May be Connected to one that passes the pages (parent) and can spawn number_of_pages number of children (the ocrs are children). But for testing directly it will just OCR the first page
All this is WIP so we can test @giancarlobi and will clean up tomorrow. HOCR to miniOCR does not work for me. I tried to adapt your code but failed. I may have missed something since I went to a faster
xmlwriter
in memory method. Too tired. Will fix tomorrowDepends on esmero/strawberryfield#113