-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HOCR my old friend: enable full HOCR pipeline for IAbookreader #105
Comments
About 4. I added here
maxWidth: 800,
imagesBaseURL: 'https://cdn.jsdelivr.net/gh/internetarchive/[email protected]/BookReader/images/',
+ server: 'archipelago.byterfly.eu',
+ bookId: 'TheBookID',
+ searchInsideUrl: '/endpoint.php', |
About 7. I prefer the field type "text_ocr_stored"
<lib dir="${solr.install.dir:../../../..}/contrib/ocrsearch/lib" regex=".*\.jar" />
<searchComponent class="de.digitalcollections.solrocr.solr.OcrHighlightComponent" name="ocrHighlight" />
<requestHandler name="/select" class="solr.SearchHandler">
<lst name="defaults">
<str name="defType">lucene</str>
<str name="df">id</str>
<str name="echoParams">explicit</str>
<str name="omitHeader">true</str>
<str name="timeAllowed">${solr.selectSearchHandler.timeAllowed:-1}</str>
<str name="spellcheck">false</str>
</lst>
<arr name="last-components">
<str>ocrHighlight</str>
<str>highlight</str>
<str>spellcheck</str>
<str>elevator</str>
</arr>
</requestHandler>
<fieldtype name="text_ocr_stored" class="solr.TextField" storeOffsetsWithPositions="true" termVectors="true">
<analyzer type="index">
<charFilter class="de.digitalcollections.solrocr.lucene.filters.OcrCharFilterFactory" />
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
</fieldtype>
<field name="ocr_text_stored" type="text_ocr_stored" multiValued="false" indexed="true" stored="true" /> |
About 7. To check if Solr plugin works you can update Solr doc with Solr post tool {
"id": "ocrdoc-1-stored",
"ocr_text_stored": "<?xml version='1.0' encoding='UTF-8'?>
<ocr>
<p xml:id=\"0\" wh=\"1836 2596\">
<b>
<l><w x=\"385 631 566 666\">ISTITUTO<\/w> <w x=\"583 631 621 666\">DI<\/w> <w x=\"639 631 820 666\">RICERCA<\/w> <w x=\"837 631 972 666\">SULLA<\/w> <w x=\"989 631 1190 666\">CRESCITA<\/w> <w x=\"1205 631 1459 666\">ECONOMICA<\/w> <w x=\"1477 631 1738 666\">SOSTENIBILE<\/w> <\/l>
<l><w x=\"451 683 675 718\">RESEARCH<\/w> <w x=\"693 683 903 718\">INSTITUTE<\/w> <w x=\"922 683 980 718\">ON<\/w> <w x=\"999 683 1288 718\">SUSTAINABLE<\/w> <w x=\"1304 683 1528 718\">ECONOMIC<\/w> <w x=\"1546 683 1736 718\">GROWTH<\/w> <\/l>
<l><w x=\"633 1532 1000 1603\">Numero<\/w> <w x=\"1032 1531 1104 1618\">6,<\/w> <w x=\"1140 1528 1486 1622\">maggio<\/w> <w x=\"1515 1531 1740 1603\">2018<\/w> <\/l>
<l><w x=\"1371 1980 1482 2009\">Follow<\/w> <w x=\"1494 1980 1549 2009\">the<\/w> <w x=\"1565 1979 1697 2017\">Byterfly<\/w> <\/l>
<l><w x=\"1226 2041 1287 2070\">and<\/w> <w x=\"1302 2042 1396 2078\">enjoy<\/w> <w x=\"1408 2049 1493 2078\">open<\/w> <w x=\"1508 2041 1695 2078\">knowledge<\/w> <\/l>
<l><w x=\"1082 2155 1293 2183\">GIANCARLO<\/w> <w x=\"1304 2155 1457 2189\">BIRELLO,<\/w> <w x=\"1469 2156 1577 2183\">ANNA<\/w> <w x=\"1590 2156 1698 2183\">PERIN<\/w> <\/l>
<l><w x=\"1323 128 1402 156\">ISSN<\/w> <w x=\"1413 126 1536 164\">(print):<\/w> <w x=\"1546 128 1734 156\">2421-5783<\/w> <\/l>
<l><w x=\"1288 181 1368 209\">ISSN<\/w> <w x=\"1379 179 1434 216\">(on<\/w> <w x=\"1447 179 1536 216\">line):<\/w> <w x=\"1546 181 1734 209\">2421-5562<\/w> <\/l>
<l><w x=\"548 878 1734 1128\">Rapporto<\/w> <\/l>
<l><w x=\"805 1151 1734 1358\">Tecnico<\/w> <\/l>
<\/b>
<\/p>
<\/ocr>"
} Then query by this select: {
"response":{"numFound":1,"start":0,"numFoundExact":true,"docs":[
{
"id":"ocrdoc-1-stored",
"ocr_text_stored":"<?xml version='1.0' encoding='UTF-8'?>\n<ocr>\n<p xml:id=\"0\" wh=\"1836 2596\">\n<b>\n<l><w x=\"385 631 566 666\">ISTITUTO</w> <w x=\"583 631 621 666\">DI</w> <w x=\"639 631 820 666\">RICERCA</w> <w x=\"837 631 972 666\">SULLA</w> <w x=\"989 631 1190 666\">CRESCITA</w> <w x=\"1205 631 1459 666\">ECONOMICA</w> <w x=\"1477 631 1738 666\">SOSTENIBILE</w> </l>\n<l><w x=\"451 683 675 718\">RESEARCH</w> <w x=\"693 683 903 718\">INSTITUTE</w> <w x=\"922 683 980 718\">ON</w> <w x=\"999 683 1288 718\">SUSTAINABLE</w> <w x=\"1304 683 1528 718\">ECONOMIC</w> <w x=\"1546 683 1736 718\">GROWTH</w> </l>\n<l><w x=\"633 1532 1000 1603\">Numero</w> <w x=\"1032 1531 1104 1618\">6,</w> <w x=\"1140 1528 1486 1622\">maggio</w> <w x=\"1515 1531 1740 1603\">2018</w> </l>\n<l><w x=\"1371 1980 1482 2009\">Follow</w> <w x=\"1494 1980 1549 2009\">the</w> <w x=\"1565 1979 1697 2017\">Byterfly</w> </l>\n<l><w x=\"1226 2041 1287 2070\">and</w> <w x=\"1302 2042 1396 2078\">enjoy</w> <w x=\"1408 2049 1493 2078\">open</w> <w x=\"1508 2041 1695 2078\">knowledge</w> </l>\n<l><w x=\"1082 2155 1293 2183\">GIANCARLO</w> <w x=\"1304 2155 1457 2189\">BIRELLO,</w> <w x=\"1469 2156 1577 2183\">ANNA</w> <w x=\"1590 2156 1698 2183\">PERIN</w> </l>\n<l><w x=\"1323 128 1402 156\">ISSN</w> <w x=\"1413 126 1536 164\">(print):</w> <w x=\"1546 128 1734 156\">2421-5783</w> </l>\n<l><w x=\"1288 181 1368 209\">ISSN</w> <w x=\"1379 179 1434 216\">(on</w> <w x=\"1447 179 1536 216\">line):</w> <w x=\"1546 181 1734 209\">2421-5562</w> </l>\n<l><w x=\"548 878 1734 1128\">Rapporto</w> </l>\n<l><w x=\"805 1151 1734 1358\">Tecnico</w> </l>\n</b>\n</p>\n</ocr>",
"timestamp":"2020-11-17T17:05:21.248Z",
"_version_":1683627936321634304}]
},
"highlighting":{
"ocrdoc-1-stored":{
"id":["<em>ocrdoc-1-stored</em>"]}},
"ocrHighlighting":{
"ocrdoc-1-stored":{
"ocr_text_stored":{
"snippets":[{
"text":"ISTITUTO DI RICERCA SULLA CRESCITA ECONOMICA SOSTENIBILE RESEARCH INSTITUTE ON SUSTAINABLE ECONOMIC GROWTH Numero 6, <em>maggio</em> 2018 Follow the Byterfly and enjoy open knowledge",
"score":42.31104,
"pages":[{
"id":"0",
"width":1836,
"height":2596}],
"regions":[{
"ulx":385,
"uly":631,
"lrx":3282,
"lry":4127,
"text":"ISTITUTO DI RICERCA SULLA CRESCITA ECONOMICA SOSTENIBILE RESEARCH INSTITUTE ON SUSTAINABLE ECONOMIC GROWTH Numero 6, <em>maggio</em> 2018 Follow the Byterfly and enjoy open knowledge",
"pageIdx":0}],
"highlights":[[{
"ulx":1140,
"uly":1528,
"lrx":2626,
"lry":3150,
"text":"maggio",
"parentRegionIdx":0}]]}],
"numTotal":1}}},
"highlighting":{}} |
@DiegoPino some thoughts about PDF/IIF/resolution. I start from final note: I think that using indentify/pdfinfo for image WxH we are loosing resolution of original image. |
And a related note: Do we really need to store into SBF JSON WxD for each page? what about a pdf with 1000 pages? I think we can save that space. |
Hi, yes. We can discuss this and i’m with asking cantaloupe if that works
for you. The cantaloupe value is really based on the rastering resolution
in the cantaloupe properties file so also variable. Best way may to define
a common resolution value and apply the same everywhere (so a setting) and
we just multiply. Let’s talk about this, not different to exporting tiffs
manually really except that we need to be consistent everywhere here and
with tiff we can be making a mistake (wrong dpi) and have to live with the
small tiff forever
Are there better id tools for pdf?
El El sáb, 21 de nov. de 2020 a la(s) 09:05, Giancarlo <
[email protected]> escribió:
@DiegoPino <https://github.com/DiegoPino> some thoughts about
PDF/IIF/resolution. I start from final note: I think that using
indtify/pdfinfo for image WxH we are loosing resolution of original image.
I check that by 2 pdf: one generated from DOCX (
https://archipelago.byterfly.eu/node/29) and one generated by abbey from
TIFF (300 dpi) (
https://archipelago.byterfly.eu/do/750aeedb-9a86-4bdd-bf93-4a5377e149af).
For pdf_docx identify report 595x842 pts (793x1123 px) and for pdf_tiff
325x491 pts (439x651 px)
If I query Cantaloupe for first page and full resolution
/full/full/0/default.jpg?page=1 I get:
pdf_docx (1240x1753 px) and pdf_tiff (686x1016 px)
That are the same values I get from info.json from Cantaloupe.
So my conclusion is that we have to use info.json width and height (the
first ones returned in main array) and not the ones returned by identify.
Just a note, to discuss and what is better and simple to manage.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#105 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABU7ZZ7REA7CEYZLXK6RI7DSQ7CL3ANCNFSM4T35MTBA>
.
--
Diego Pino Navarro
Digital Repositories Developer
Metropolitan New York Library Council (METRO)
|
If we use contaloupe jnfo.json response, that works also with tiff and not only for pdf and we don't lost resolution. |
In addition, as you already asserted, for book made by tiff (i.e. https://archipelago.byterfly.eu/node/18) then Cantaloupe info.json returns 2481x3508 px as the original tiff dimensions are, so limit to a fixed value (i.e. 1200 width) means lost resolution respect tiff stored. |
And also this: Cantaloupe pdf rasterized image depends on this conf param |
@DiegoPino I was thinking more about how Archipelago have to manage ADO paged objects. Evaluating how viewers (first of all IAB but also valid for Mirador) manage images, the high importance of IIIF and Manifest, the performance of Solr indexing/query, the availability of a (it seems) so good plugin for hOCR/MiniOCR for Solr and some personal feelings, I made this (new) idea of manage ADO paged context:
Well, more things to discuss ... have a nice Sunday, amigo |
Hi, i will read this in detail and will reply to each point tomorrow (need
to test code to be sure what i say is correct) but even when I understand
your use case if feel it is totally not the archipelago way of having
anything hardcoded. If hardcoded means you can make the exact manifest
that works for your use case and setup all the rest based on that one
(settings can be even automatically saved by parsing the manifest once
during setup) then great, but not in code. if viewers are able to adapt to
a manifest that is unknown and variable why not we too? If we go that way
we will totally deviate from what we are as a project just to serve a
single need. I agree with the hocr sbr, no settings for that, too much
logic to make it configurable and about what we store in the sbf, well up
to each institution, probably can make the postprocessor more configurable.
I have no personal issues yet with 1000+ pages but we may have. We can also
only store a main width/height and then only pages that deviate from that.
We may need to keep exploring what we need and where in the workflow the
lowest effort/complexity denominator is until we find the solution. Let’s
have a call tomorrow or Tuesday and we will for sure figure it out
Enjoy a peaceful sunday!
El El dom, 22 de nov. de 2020 a la(s) 07:55, Giancarlo <
[email protected]> escribió:
@DiegoPino <https://github.com/DiegoPino> I was thinking more about how
Archipelago have to manage ADO paged objects. Evaluating how viewers (first
of all IAB but also valid for Mirador) manage images, the high importance
of IIIF and Manifest, the performance of Solr indexing/query, the
availability of a (it seems) so good plugin for hOCR/MiniOCR for Solr and
some personal feelings, I made this (new) idea of manage ADO paged context:
-
Solr doc have to store ADO reference + page reference + width and
height (as returned by cantaloupe info.json) + MiniOCR
-
We don't have to store anything or almost anything of above into
SBF-JSON (i.e. thinking a book with really many pages)
-
IIIF manifest has to be "hardcoded" that is, a service with really few
settings, passing to the service ADO ref it returns the manifest making a
query to Solr for page WxH, it could be an IIIF manifest endpoint public
available
-
Solr doc update has to be managed by a dedicated (at the beginning,
not customizable) service/flavours executed after ADO creation, this can
support a yes/no option for the user or something the user decide to
executed later or just after ADO ingest
-
Manage hOCR by zip it's a good choice but as we store all into Solr
docs, zip storing could be not really needed, almost we can store into Solr
a checksum to evaluate if something changes
-
regarding IAB, it uses manifest WxD as default settings (manifest
returned by service based on Solr query) so when search the IAB search
endpoint has to A) query Solr for term searched filtering by ADO ref B)
transform coordinates multiplicating relatives value * width(height)
returned.
We can choose to store into Solr absolute coordinate values, this save
a calculation into IAB search endpoint but I don't have clear if this is
good also for Mirador, to check
Well, more things to discuss ... have a nice Sunday, amigo
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#105 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABU7ZZ562TOUIBO66E2DKZ3SREC5HANCNFSM4T35MTBA>
.
--
Diego Pino Navarro
Digital Repositories Developer
Metropolitan New York Library Council (METRO)
|
@DiegoPino Great for a call tomorrow or Tuesday, I probably can explain better, i.e. I don't want something hardcoded and specific for a use case, instead I mean something working with any kind of viewer. |
@giancarlobi tomorrow Tuesday, 9:AM EST, 3:00 PM Milan, does that work? Thx! |
Perfect, amigo! |
@DiegoPino I was thinking about IAB and the WxH that uses as reference, the same we have to use to calculate highlighting boxes. |
@DiegoPino An addition here, I tried a useful tool Apache PDFBox , the same that Cantaloupe uses to convert PDF to JPG.
Are the same dimensions as returned by querying Cantaloupe for page 1 of PDF:
@context | "http://iiif.io/api/image/2/context.json"
-- | --
@id | "https://archipelago.byte…db-ad95-a16112a7666d.pdf"
protocol | "http://iiif.io/api/image"
width | 1240
height | 1754 |
@giancarlobi Thanks, give me a day or two to thing about the consequences of this. There are a few use cases where this may not be true (e.g cantaloupe where the max size is restricted which one can do) but yes, in general this applies and you are right. But I would prefer to keep the decimal notation in the OCR for now until we get at least one solution working completely and then we can refine and make it better and test with new code. I understand totally what you say and I agree. I just feel I'm right too tired! (really) to have a decent argument or apply changes until at least I have the search endpoints working correctly first.Hope that makes sense. Will follow up once I have more code to share but I won't forget this, no worries. |
@DiegoPino No rush, I wrote here as the right place for my (nightly) thoughts. |
@DiegoPino I discovered that we don't need Apache PDFBox, it is more simple, as Archipelago philosophy, we only need to add a parameter to identify to have same dimensions than Cantaloupe:
|
Also, with a pipe to identify a single page: |
Also, as MiniOCR optionally can have a wh attribute with the {width} {height} values for the page, it would be useful to include same WxH as in SBF-JSON identify (and same of info-json), so calculation of absolute bbox values will be more simple in a IAB search result. |
@giancarlobi @pcambra this is almost done: We still need 3 tasks to get the full pipeline
I mentioning you both because I may need help figuring out/testing and implementing some of these things. Asking for a friend |
@DiegoPino I was a couple of days off line to solve hardware issues and close some reports. I start to read all you done and answer asap. Take care, friend |
See #11 related to this and all the work that Giancarlo has been doing in the last week
The plan
and i may want all the binaries commands we need to call from 1) PDF to 2) multiple miniOCR (edited)
strawberry_runners
)The text was updated successfully, but these errors were encountered: