Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HOCR my old friend: enable full HOCR pipeline for IAbookreader #105

Open
DiegoPino opened this issue Nov 19, 2020 · 23 comments
Open

HOCR my old friend: enable full HOCR pipeline for IAbookreader #105

DiegoPino opened this issue Nov 19, 2020 · 23 comments
Assignees
Labels
enhancement New feature or request IIIF Specs/Manifests/Implementations Javascript Favourite language of a PHP developer

Comments

@DiegoPino
Copy link
Member

DiegoPino commented Nov 19, 2020

See #11 related to this and all the work that Giancarlo has been doing in the last week

The plan

  1. Open an issue for all this ✔️
  2. Code an endpoint with configuration (because i need to know which search_api / solr/ core i need to call
  3. Add the the endpoint for now a fake response
  4. Add a config/override for the IAbookreader with the endpoint so it uses t
  5. Make an override of the search callback override and log it out
    and i may want all the binaries commands we need to call from 1) PDF to 2) multiple miniOCR (edited)
  6. Make the sbr processor for this (strawberry_runners)
  7. Deploy https://github.com/dbmdz/solr-ocrhighlighting with Giancarlo's Schema here Loading OCR fragments from S3 dbmdz/solr-ocrhighlighting#49 (comment). I may want to ask what is the best way. get from GitHub and we may documentation.
@DiegoPino DiegoPino added enhancement New feature or request IIIF Specs/Manifests/Implementations Javascript Favourite language of a PHP developer labels Nov 19, 2020
@giancarlobi
Copy link
Collaborator

About 4. I added here

some more options to make IAB uses my endpoint:

                            maxWidth: 800,
                            imagesBaseURL: 'https://cdn.jsdelivr.net/gh/internetarchive/[email protected]/BookReader/images/',
+                            server: 'archipelago.byterfly.eu',
+                            bookId: 'TheBookID',
+                            searchInsideUrl: '/endpoint.php',

@giancarlobi
Copy link
Collaborator

About 7. I prefer the field type "text_ocr_stored"
For plugin install:

  1. download last jar from https://github.com/dbmdz/solr-ocrhighlighting/releases
  2. copy to /opt/solr/contrib/ocrsearch/lib/
  3. add to solrconfig.xml
<lib dir="${solr.install.dir:../../../..}/contrib/ocrsearch/lib" regex=".*\.jar" />

<searchComponent class="de.digitalcollections.solrocr.solr.OcrHighlightComponent" name="ocrHighlight" />
  1. edit solrconfig_extra.xml and set right order of highlighter into select:
<requestHandler name="/select" class="solr.SearchHandler">
  <lst name="defaults">
    <str name="defType">lucene</str>
    <str name="df">id</str>
    <str name="echoParams">explicit</str>
    <str name="omitHeader">true</str>
    <str name="timeAllowed">${solr.selectSearchHandler.timeAllowed:-1}</str>
    <str name="spellcheck">false</str>
  </lst>
  <arr name="last-components">
    <str>ocrHighlight</str>
    <str>highlight</str>
    <str>spellcheck</str>
    <str>elevator</str>
  </arr>
</requestHandler>
  1. edit schema_extra_types.xml and add new type (NB this is for inline store of hOCR/MiniOCR)
    <fieldtype name="text_ocr_stored" class="solr.TextField" storeOffsetsWithPositions="true" termVectors="true">
      <analyzer type="index">
        <charFilter class="de.digitalcollections.solrocr.lucene.filters.OcrCharFilterFactory" />
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.StopFilterFactory"/>
        <filter class="solr.PorterStemFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.StopFilterFactory"/>
        <filter class="solr.PorterStemFilterFactory"/>
      </analyzer>
    </fieldtype>
  1. edit schema_extra_fields.xml and add :
<field name="ocr_text_stored" type="text_ocr_stored" multiValued="false" indexed="true" stored="true" />

@giancarlobi
Copy link
Collaborator

About 7. To check if Solr plugin works you can update Solr doc with Solr post tool /opt/solr/bin/post and this json file:

{
    "id": "ocrdoc-1-stored",
    "ocr_text_stored": "<?xml version='1.0' encoding='UTF-8'?>
<ocr>
<p xml:id=\"0\" wh=\"1836 2596\">
<b>
<l><w x=\"385 631 566 666\">ISTITUTO<\/w> <w x=\"583 631 621 666\">DI<\/w> <w x=\"639 631 820 666\">RICERCA<\/w> <w x=\"837 631 972 666\">SULLA<\/w> <w x=\"989 631 1190 666\">CRESCITA<\/w> <w x=\"1205 631 1459 666\">ECONOMICA<\/w> <w x=\"1477 631 1738 666\">SOSTENIBILE<\/w> <\/l>
<l><w x=\"451 683 675 718\">RESEARCH<\/w> <w x=\"693 683 903 718\">INSTITUTE<\/w> <w x=\"922 683 980 718\">ON<\/w> <w x=\"999 683 1288 718\">SUSTAINABLE<\/w> <w x=\"1304 683 1528 718\">ECONOMIC<\/w> <w x=\"1546 683 1736 718\">GROWTH<\/w> <\/l>
<l><w x=\"633 1532 1000 1603\">Numero<\/w> <w x=\"1032 1531 1104 1618\">6,<\/w> <w x=\"1140 1528 1486 1622\">maggio<\/w> <w x=\"1515 1531 1740 1603\">2018<\/w> <\/l>
<l><w x=\"1371 1980 1482 2009\">Follow<\/w> <w x=\"1494 1980 1549 2009\">the<\/w> <w x=\"1565 1979 1697 2017\">Byterfly<\/w> <\/l>
<l><w x=\"1226 2041 1287 2070\">and<\/w> <w x=\"1302 2042 1396 2078\">enjoy<\/w> <w x=\"1408 2049 1493 2078\">open<\/w> <w x=\"1508 2041 1695 2078\">knowledge<\/w> <\/l>
<l><w x=\"1082 2155 1293 2183\">GIANCARLO<\/w> <w x=\"1304 2155 1457 2189\">BIRELLO,<\/w> <w x=\"1469 2156 1577 2183\">ANNA<\/w> <w x=\"1590 2156 1698 2183\">PERIN<\/w> <\/l>
<l><w x=\"1323 128 1402 156\">ISSN<\/w> <w x=\"1413 126 1536 164\">(print):<\/w> <w x=\"1546 128 1734 156\">2421-5783<\/w> <\/l>
<l><w x=\"1288 181 1368 209\">ISSN<\/w> <w x=\"1379 179 1434 216\">(on<\/w> <w x=\"1447 179 1536 216\">line):<\/w> <w x=\"1546 181 1734 209\">2421-5562<\/w> <\/l>
<l><w x=\"548 878 1734 1128\">Rapporto<\/w> <\/l>
<l><w x=\"805 1151 1734 1358\">Tecnico<\/w> <\/l>
<\/b>
<\/p>
<\/ocr>"
}

Then query by this select:
..../select?hl.ocr.fl=ocr_text_stored&hl=true&q=ocr_text_stored%3Amaggio&hl.ocr.absoluteHighlights=on
You must see something like this as result:

{
  "response":{"numFound":1,"start":0,"numFoundExact":true,"docs":[
      {
        "id":"ocrdoc-1-stored",
        "ocr_text_stored":"<?xml version='1.0' encoding='UTF-8'?>\n<ocr>\n<p xml:id=\"0\" wh=\"1836 2596\">\n<b>\n<l><w x=\"385 631 566 666\">ISTITUTO</w> <w x=\"583 631 621 666\">DI</w> <w x=\"639 631 820 666\">RICERCA</w> <w x=\"837 631 972 666\">SULLA</w> <w x=\"989 631 1190 666\">CRESCITA</w> <w x=\"1205 631 1459 666\">ECONOMICA</w> <w x=\"1477 631 1738 666\">SOSTENIBILE</w> </l>\n<l><w x=\"451 683 675 718\">RESEARCH</w> <w x=\"693 683 903 718\">INSTITUTE</w> <w x=\"922 683 980 718\">ON</w> <w x=\"999 683 1288 718\">SUSTAINABLE</w> <w x=\"1304 683 1528 718\">ECONOMIC</w> <w x=\"1546 683 1736 718\">GROWTH</w> </l>\n<l><w x=\"633 1532 1000 1603\">Numero</w> <w x=\"1032 1531 1104 1618\">6,</w> <w x=\"1140 1528 1486 1622\">maggio</w> <w x=\"1515 1531 1740 1603\">2018</w> </l>\n<l><w x=\"1371 1980 1482 2009\">Follow</w> <w x=\"1494 1980 1549 2009\">the</w> <w x=\"1565 1979 1697 2017\">Byterfly</w> </l>\n<l><w x=\"1226 2041 1287 2070\">and</w> <w x=\"1302 2042 1396 2078\">enjoy</w> <w x=\"1408 2049 1493 2078\">open</w> <w x=\"1508 2041 1695 2078\">knowledge</w> </l>\n<l><w x=\"1082 2155 1293 2183\">GIANCARLO</w> <w x=\"1304 2155 1457 2189\">BIRELLO,</w> <w x=\"1469 2156 1577 2183\">ANNA</w> <w x=\"1590 2156 1698 2183\">PERIN</w> </l>\n<l><w x=\"1323 128 1402 156\">ISSN</w> <w x=\"1413 126 1536 164\">(print):</w> <w x=\"1546 128 1734 156\">2421-5783</w> </l>\n<l><w x=\"1288 181 1368 209\">ISSN</w> <w x=\"1379 179 1434 216\">(on</w> <w x=\"1447 179 1536 216\">line):</w> <w x=\"1546 181 1734 209\">2421-5562</w> </l>\n<l><w x=\"548 878 1734 1128\">Rapporto</w> </l>\n<l><w x=\"805 1151 1734 1358\">Tecnico</w> </l>\n</b>\n</p>\n</ocr>",
        "timestamp":"2020-11-17T17:05:21.248Z",
        "_version_":1683627936321634304}]
  },
  "highlighting":{
    "ocrdoc-1-stored":{
      "id":["<em>ocrdoc-1-stored</em>"]}},
  "ocrHighlighting":{
    "ocrdoc-1-stored":{
      "ocr_text_stored":{
        "snippets":[{
            "text":"ISTITUTO DI RICERCA SULLA CRESCITA ECONOMICA SOSTENIBILE RESEARCH INSTITUTE ON SUSTAINABLE ECONOMIC GROWTH Numero 6, <em>maggio</em> 2018 Follow the Byterfly and enjoy open knowledge",
            "score":42.31104,
            "pages":[{
                "id":"0",
                "width":1836,
                "height":2596}],
            "regions":[{
                "ulx":385,
                "uly":631,
                "lrx":3282,
                "lry":4127,
                "text":"ISTITUTO DI RICERCA SULLA CRESCITA ECONOMICA SOSTENIBILE RESEARCH INSTITUTE ON SUSTAINABLE ECONOMIC GROWTH Numero 6, <em>maggio</em> 2018 Follow the Byterfly and enjoy open knowledge",
                "pageIdx":0}],
            "highlights":[[{
                  "ulx":1140,
                  "uly":1528,
                  "lrx":2626,
                  "lry":3150,
                  "text":"maggio",
                  "parentRegionIdx":0}]]}],
        "numTotal":1}}},
  "highlighting":{}}

@giancarlobi
Copy link
Collaborator

giancarlobi commented Nov 21, 2020

@DiegoPino some thoughts about PDF/IIF/resolution. I start from final note: I think that using indentify/pdfinfo for image WxH we are loosing resolution of original image.
I check that by 2 pdf: one generated from DOCX (https://archipelago.byterfly.eu/node/29) and one generated by abbey from TIFF (300 dpi) (https://archipelago.byterfly.eu/do/750aeedb-9a86-4bdd-bf93-4a5377e149af).
For pdf_docx identify report 595x842 pts (793x1123 px) and for pdf_tiff 325x491 pts (439x651 px)
If I query Cantaloupe for first page and full resolution /full/full/0/default.jpg?page=1 I get:
pdf_docx (1240x1753 px) and pdf_tiff (686x1016 px)
That are the same values I get from info.json from Cantaloupe.
So my conclusion is that we have to use info.json width and height (the first ones returned in main array) and not the ones returned by identify.
Just a note, to discuss and what is better and simple to manage.

@giancarlobi
Copy link
Collaborator

And a related note: Do we really need to store into SBF JSON WxD for each page? what about a pdf with 1000 pages? I think we can save that space.

@DiegoPino
Copy link
Member Author

DiegoPino commented Nov 21, 2020 via email

@giancarlobi
Copy link
Collaborator

If we use contaloupe jnfo.json response, that works also with tiff and not only for pdf and we don't lost resolution.
If we define a default resolution we can lost tiff dpi and we have to manage portrait/landscape page orientation too.
I don't found any better tools to manage and get pdf dimensions, that are not the dimensions/pixel of included image and nothing to do with cantaloupe response.
Well, to think and to discuss, friend, have a nice day.

@giancarlobi
Copy link
Collaborator

In addition, as you already asserted, for book made by tiff (i.e. https://archipelago.byterfly.eu/node/18) then Cantaloupe info.json returns 2481x3508 px as the original tiff dimensions are, so limit to a fixed value (i.e. 1200 width) means lost resolution respect tiff stored.

@giancarlobi
Copy link
Collaborator

giancarlobi commented Nov 21, 2020

And also this: Cantaloupe pdf rasterized image depends on this conf param
processor.dpi = 150
as reported in doc here https://cantaloupe-project.github.io/manual/4.1/processors.html#PdfBoxProcessor
So, for PDF the right WxD that we have to use also depend on cantaloupe conf.

@giancarlobi
Copy link
Collaborator

@DiegoPino I was thinking more about how Archipelago have to manage ADO paged objects. Evaluating how viewers (first of all IAB but also valid for Mirador) manage images, the high importance of IIIF and Manifest, the performance of Solr indexing/query, the availability of a (it seems) so good plugin for hOCR/MiniOCR for Solr and some personal feelings, I made this (new) idea of manage ADO paged context:

  • Solr doc have to store ADO reference + page reference + width and height (as returned by cantaloupe info.json) + MiniOCR

  • We don't have to store anything or almost anything of above into SBF-JSON (i.e. thinking a book with really many pages)

  • IIIF manifest has to be "hardcoded" that is, a service with really few settings, passing to the service ADO ref it returns the manifest making a query to Solr for page WxH, it could be an IIIF manifest endpoint public available

  • Solr doc update has to be managed by a dedicated (at the beginning, not customizable) service/flavours executed after ADO creation, this can support a yes/no option for the user or something the user decide to executed later or just after ADO ingest

  • Manage hOCR by zip it's a good choice but as we store all into Solr docs, zip storing could be not really needed, almost we can store into Solr a checksum to evaluate if something changes

  • regarding IAB, it uses manifest WxD as default settings (manifest returned by service based on Solr query) so when search the IAB search endpoint has to A) query Solr for term searched filtering by ADO ref B) transform coordinates multiplicating relatives value * width(height) returned.
    We can choose to store into Solr absolute coordinate values, this save a calculation into IAB search endpoint but I don't have clear if this is good also for Mirador, to check

Well, more things to discuss ... have a nice Sunday, amigo

@DiegoPino
Copy link
Member Author

DiegoPino commented Nov 22, 2020 via email

@giancarlobi
Copy link
Collaborator

@DiegoPino Great for a call tomorrow or Tuesday, I probably can explain better, i.e. I don't want something hardcoded and specific for a use case, instead I mean something working with any kind of viewer.
Take care, amigo

@DiegoPino
Copy link
Member Author

@giancarlobi tomorrow Tuesday, 9:AM EST, 3:00 PM Milan, does that work? Thx!

@giancarlobi
Copy link
Collaborator

@giancarlobi tomorrow Tuesday, 9:AM EST, 3:00 PM Milan, does that work? Thx!

Perfect, amigo!

@giancarlobi
Copy link
Collaborator

@DiegoPino I was thinking about IAB and the WxH that uses as reference, the same we have to use to calculate highlighting boxes.
You already updated twig template for manifest that now returns WxH as included into SBF-JSON flv:identify width and height. But identify WxH don't correspond to the ones returned by IIIF info.json so why not store info,json WxH into SBF-JSON instead of the ones returned by identify? I think that all images (jpg, tiff, PDF) are mainly managed by cantaloupe so that make sense ... or not?
In addition, we can also add that WxH (info.json) into miniocr instead of the ones returned by tesseract, we can do that because coordinate values are stored as relative value. This makes all WxD (json.info, SBF-JSON. miniocr,...) consistent.
Just an idea more, friend.

@giancarlobi
Copy link
Collaborator

@DiegoPino An addition here, I tried a useful tool Apache PDFBox , the same that Cantaloupe uses to convert PDF to JPG.
I was able to run it by command line and using the same dpi that cantaloupe uses (see in cantaloupe configuration processor.dpi) I can retrieve by identify the SAME WxD that info.json returns without query Cantaloupe.
To test:

  • download package here wget https://downloads.apache.org/pdfbox/2.0.21/pdfbox-app-2.0.21.jar (you need openjdk-8-jdk)
  • convert a PDF page to JPG by
    java -jar pdfbox-app-2.0.21.jar PDFToImage -imageType jpg -page 1 -dpi 150 application-test-139e32dc-4339-47db-ad95-a16112a7666d.pdf
  • identify the image
    identify application-test-139e32dc-4339-47db-ad95-a16112a7666d1.jpg
    application-test-139e32dc-4339-47db-ad95-a16112a7666d1.jpg JPEG 1240x1753 1240x1753+0+0 ...

Are the same dimensions as returned by querying Cantaloupe for page 1 of PDF:

https://archipelago.byterfly.eu/iiif-server/iiif/2/90a%2Fapplication-test-139e32dc-4339-47db-ad95-a16112a7666d.pdf/info.json?page=1

@context | "http://iiif.io/api/image/2/context.json"
-- | --
@id | "https://archipelago.byte…db-ad95-a16112a7666d.pdf"
protocol | "http://iiif.io/api/image"
width | 1240
height | 1754

@DiegoPino
Copy link
Member Author

@giancarlobi Thanks, give me a day or two to thing about the consequences of this. There are a few use cases where this may not be true (e.g cantaloupe where the max size is restricted which one can do) but yes, in general this applies and you are right. But I would prefer to keep the decimal notation in the OCR for now until we get at least one solution working completely and then we can refine and make it better and test with new code. I understand totally what you say and I agree. I just feel I'm right too tired! (really) to have a decent argument or apply changes until at least I have the search endpoints working correctly first.Hope that makes sense. Will follow up once I have more code to share but I won't forget this, no worries.

@giancarlobi
Copy link
Collaborator

@DiegoPino No rush, I wrote here as the right place for my (nightly) thoughts.
I don't want change MIniOCR notation from relative (decimal) to absolute ... in absolute, sorry if I explained with wrong words.
And descansa amigo, por favor.

@giancarlobi
Copy link
Collaborator

@DiegoPino I discovered that we don't need Apache PDFBox, it is more simple, as Archipelago philosophy, we only need to add a parameter to identify to have same dimensions than Cantaloupe: -density NNNxNNN where NNN is the value of cantaloupe configuration processor.dpi. Obviously, this only for PDF file.
I.e.
identify -density 150x150 application-test-139e32dc-4339-47db-ad95-a16112a7666d.pdf

application-test-139e32dc-4339-47db-ad95-a16112a7666d.pdf[0] PDF 1240x1754 1240x1754+0+0 16-bit sRGB 405KB 0.010u 0:00.009
application-test-139e32dc-4339-47db-ad95-a16112a7666d.pdf[1] PDF 1240x1754 1240x1754+0+0 16-bit sRGB 405KB 0.010u 0:00.009
application-test-139e32dc-4339-47db-ad95-a16112a7666d.pdf[2] PDF 1240x1754 1240x1754+0+0 16-bit sRGB 405KB 0.010u 0:00.009
application-test-139e32dc-4339-47db-ad95-a16112a7666d.pdf[3] PDF 1240x1754 1240x1754+0+0 16-bit sRGB 405KB 0.010u 0:00.009
application-test-139e32dc-4339-47db-ad95-a16112a7666d.pdf[4] PDF 1240x1754 1240x1754+0+0 16-bit sRGB 405KB 0.010u 0:00.009
application-test-139e32dc-4339-47db-ad95-a16112a7666d.pdf[5] PDF 1240x1754 1240x1754+0+0 16-bit sRGB 405KB 0.010u 0:00.009
application-test-139e32dc-4339-47db-ad95-a16112a7666d.pdf[6] PDF 1240x1754 1240x1754+0+0 16-bit sRGB 405KB 0.000u 0:00.009
application-test-139e32dc-4339-47db-ad95-a16112a7666d.pdf[7] PDF 1240x1754 1240x1754+0+0 16-bit sRGB 405KB 0.000u 0:00.000
application-test-139e32dc-4339-47db-ad95-a16112a7666d.pdf[8] PDF 1240x1754 1240x1754+0+0 16-bit sRGB 405KB 0.000u 0:00.000
application-test-139e32dc-4339-47db-ad95-a16112a7666d.pdf[9] PDF 1240x1754 1240x1754+0+0 16-bit sRGB 405KB 0.000u 0:00.000

@giancarlobi
Copy link
Collaborator

Also, with a pipe to identify a single page:
qpdf --empty --pages application-test-139e32dc-4339-47db-ad95-a16112a7666d.pdf 1 -- - | identify -density 150x150 -

@giancarlobi
Copy link
Collaborator

Also, as MiniOCR optionally can have a wh attribute with the {width} {height} values for the page, it would be useful to include same WxH as in SBF-JSON identify (and same of info-json), so calculation of absolute bbox values will be more simple in a IAB search result.

@DiegoPino
Copy link
Member Author

DiegoPino commented Jan 29, 2021

@giancarlobi @pcambra this is almost done:

We still need 3 tasks to get the full pipeline

  1. Modify https://github.com/esmero/archipelago-docker-images/blob/main/esmero-php-fpm/Dockerfile to have the missing tools @giancarlobi added for direct text extraction from PDFs instead of HOCRing them as images (default when those tools are not around). These tools are pdf2djvu, djvudump and djvu2hocr. Some of these are python tools and need to be compiled
  2. Persist our temporary Key Values into a frictionless datapackage and attach to the source NODE/ADO once all HOCR pages are processed. This may need to go into Strawberryfield as a generic/general Frictionless datapackage processor. With adding files/extracting files capabilities. That module already has the required dependencies to deal with https://github.com/frictionlessdata/datapackage-php. Why generic? because a WACZ file is also a datapackage and for preservation needs we will want to add heavy on process, rarely needed to be accessed data to be put inside a single file.
  3. Making sure Books made of single images can be processed. Which means also changing in Strawberry_runners our Pager Plugin.

I mentioning you both because I may need help figuring out/testing and implementing some of these things.
Should I open individual issues and then make this a Macro one linked to those?

Asking for a friend

@giancarlobi
Copy link
Collaborator

@DiegoPino I was a couple of days off line to solve hardware issues and close some reports. I start to read all you done and answer asap. Take care, friend

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request IIIF Specs/Manifests/Implementations Javascript Favourite language of a PHP developer
Projects
None yet
Development

No branches or pull requests

2 participants