WorkspaceBagger: Use, in order of preference, f.basename, f.contentids and f.ID for filenames #1157

kba · 2023-12-14T19:06:16Z

As requested in #1154, this PR introduces a contentids attribute for OcrdFile, which delegates to OcrdMets.get_contentids_for_file, which looks up the CONTENTIDS attribute of the mets:div[@TYPE="page"] that a file belongs to.

The bagger uses this information to set the filenames of the bagged files.

E.g. for this mets:file

<mets:file ID="FILE_0009_DEFAULT" MIMETYPE="image/tiff">                                                                                                             
  <mets:FLocat xmlns:xlink="http://www.w3.org/1999/xlink" LOCTYPE="URL" xlink:href="http://content.staatsbibliothek-berlin.de/dms/PPN85249078X/800/0/00000010.tif"/> 
</mets:file>                                                                                                                                                         
...
<mets:div CONTENTIDS="http://resolver.staatsbibliothek-berlin.de/SBB0001CA7900000010" ID="PHYS_0010" ORDER="10" ORDERLABEL="2" TYPE="page">
  <mets:fptr FILEID="FILE_0009_DEFAULT"/>                                                                                                  
</mets:div>

This file will be bagged as DEFAULT/http_resolver_staatsbibliothek_berlin_de_SBB0001CA7900000010.tif

If there was no @CONTENTIDS for the corresponding mets:div[@TYPE="PAGE"], then the filename would be DEFAULT/FILE_0009_DEFAULT.tif.

A quick proof-of-concept to make sure this is the desired behavior, to be polished (e.g. adding setters for contentids and potentially also for @ORDER, @ORDERLABEL and make sure that we're consistent in all places where files are written out.

@M3ssman cannot use the "request review" feature because you're not in the OCR-D organization but would appreciate you providing one very much, thanks!

…ets:div

…s and f.ID for filenames, fix #1154

bertsky

A quick proof-of-concept to make sure this is the desired behavior, to be polished (e.g. adding setters for contentids and potentially also for @order, @ORDERLABEL and make sure that we're consistent in all places where files are written out.

Again, see #1063, and let's not forget about CLI (getters list-page and find, setters would be add-file, bulk-add and update-page).

bertsky · 2023-12-14T21:35:05Z

ocrd_models/ocrd_models/ocrd_mets.py

@@ -722,6 +722,25 @@ def get_physical_page_for_file(self, ocrd_file):
        if len(ret):
            return ret[0]

+    def get_contentids_for_file(self, ocrd_file):


In #1063, I added a more general solution (sans @CONTENTIDS but plus @ORDER, @ORDERLABEL and @LABEL): add another kwarg return_divs=True to get_physical_pages to get the full divs (which can further be queried). There's also an extra physical_pages_labels (I don't remember why this is independent though). Both extensions should be thrown together IMO.

That is a much better solution, I agree. As I said, this is just a proof-of-concept so we have a discussion basis for the naming. Implementation can be thrown away and replaced with a fine-tuned #1063.

It might be sensible to have an OcrdMetsPage(s) class similar to OcrdFile to provide a uniform interface. But that's best discussed in #1063.

bertsky · 2023-12-14T21:37:01Z

ocrd/ocrd/workspace_bagger.py

+                if f.local_filename and f.basename:
+                    basename = f.basename
+                else:
+                    basename = safe_filename(f.contentids if f.contentids else f.ID) + MIME_TO_EXT.get(f.mimetype, '.xml')


I would advise against direct use of @CONTENTIDS as file name. The URL prefix almost always is not what you want. How about stripping the host-name part (if in fact it is a URL), and then using makedirs for all remaining path prefixes?

Fair points. Agree that representing the URL path as directory is a neat way to do it, though it deviates from our general flat directory structure below the fileGrp dirs. Removing the host is also prettier.

But I'm wondering how that would work for @M3ssman's use case - how much info do you need to still be able to debug your workflows?

It is not intended to be used as-it-is, simply because an URN/URI contains chars that are not valid as part in local filenames. Therefore - at least in ULB and semantics workflows - colon is exchanged with a plus sign.

Consider a page container like this

<mets:div ID="phys1278993" TYPE="page" CONTENTIDS="urn:nbn:de:gbv:3:1-113129-p0007-8" ORDER="1"> <mets:fptr FILEID="IMG_MAX_1278993"/> <mets:fptr FILEID="OCR-D-GT-FULLTEXT-1"/> </mets:div>

with GT linked

<mets:file ID="OCR-D-GT-FULLTEXT-1" MIMETYPE="application/vnd.prima.page+xml"> <mets:FLocat xlink:href="GT-PAGE/urn+nbn+de+gbv+3+1-113129-p0007-8_ger.gt.xml" LOCTYPE="OTHER" OTHERLOCTYPE="FILE"/> </mets:file>

but the actual imsge via URL like this

<mets:file MIMETYPE="image/jpeg" ID="IMG_MAX_1278993"> <mets:FLocat xlink:href="https://opendata.uni-halle.de/retrieve/eeeee05d-c7cd-4e89-9607-5d1ac175afa1/00000007.jpg" LOCTYPE="URL"/> </mets:file>

The goal is to match both files, GT and Image, by their names alone without any extensions.
If this can be achieved, both can be used out-of-the-box for further GT-works in Tools like Transkribus or Larex.
(I have to handle 1.600 GT-ODEM-files, nearly 100 newspaper-GT-files, 101 GT arabic and about 400 pages GT persian. And if our next digi-project will be granted, the GT will increase further.)

Proposal: Instead of using the CONTENTIDS attribute it'd be sufficient to rename the images locally like the corresponding GT-file, whatever it's name was. My main concern is therefore to have equal names, not to name something like some attribute.

Maybe this way one hopefully avoids additional processing?
This would also avoid additional problems which may occur since even for 2 units (SBB, ULB) there are yet 2 different interpretations for this attribute, and who knows what else lurks out there.

(c.f. https://github.com/M3ssman/gt-test/blob/a370f3a691506f4eab1b91226a76a7c1f461ba10/data/corpus-odem-ger-256/mets.xml#L8905)

It is not intended to be used as-it-is, simply because an URN/URI contains chars that are not valid as part in local filenames. Therefore - at least in ULB and semantics workflows - colon is exchanged with a plus sign.

It is not used as-is but passed through safe_filename which does

def safe_filename(url): """ Sanitize input to be safely used as the basename of a local file. """ ret = re.sub(r'[^\w]+', '_', url) ret = re.sub(r'^\.*', '', ret) ret = re.sub(r'\.\.*', '.', ret) # print('safe filename: %s -> %s' % (url, ret)) return ret

Adapting this to make the replacement (_) configurable as + is not an issue. However

Proposal: Instead of using the CONTENTIDS attribute it'd be sufficient to rename the images locally like the corresponding GT-file, whatever it's name was. My main concern is therefore to have equal names, not to name something like some attribute.

I still don't understand how to achieve that. For your example

<mets:file ID="OCR-D-GT-FULLTEXT-1" MIMETYPE="application/vnd.prima.page+xml"> <mets:FLocat xlink:href="GT-PAGE/urn+nbn+de+gbv+3+1-113129-p0007-8_ger.gt.xml" LOCTYPE="OTHER" OTHERLOCTYPE="FILE"/> </mets:file> ... <mets:file MIMETYPE="image/jpeg" ID="IMG_MAX_1278993"> <mets:FLocat xlink:href="https://opendata.uni-halle.de/retrieve/eeeee05d-c7cd-4e89-9607-5d1ac175afa1/00000007.jpg" LOCTYPE="URL"/> </mets:file> ... <mets:div ID="phys1278993" TYPE="page" CONTENTIDS="urn:nbn:de:gbv:3:1-113129-p0007-8" ORDER="1"> <mets:fptr FILEID="IMG_MAX_1278993"/> <mets:fptr FILEID="OCR-D-GT-FULLTEXT-1"/> </mets:div>

What should the bagger write as the filenames of IMG_MAX_1278993 and OCR-D-GT-FULLTEXT-1?

I'd wish to save urn+nbn+de+gbv+3+1-113129-p0007-8_ger.gt.xml as filename for the GT, since this is the file which OCR-D-GT-FULLTEXT-1 points to, and for the image where container IMG_MAX_1278993 refers to, a corresponding name like urn+nbn+de+gbv+3+1-113129-p0007-8_ger.jpg , if it's possible.

The point at least in my understanding - ALIMU- concerning ULB is, that the GT-files shall be published (and probably edited further afterwards) but will be tied to the GT-repository. There is nothing about to change with the images, they only need to be referenced and resolvable, for example, for later generation of training data using the GT-files.

M3ssman · 2024-01-10T15:01:16Z

I'd like to provide additional testing (and test data as well) - is this possible in terms of this PR?

(And I'm uncertain where to add tests. There's actually one in tests/model/test_ocrd_mets.py but I thought this was about the bagger, which has test in tests/validator/test_workspace_bagger.py?)

kba added 2 commits December 14, 2023 19:43

OcrdMets/OcrdFile: access to @CONTENTIDS attribute of physical page m…

ae8c44b

…ets:div

WorkspaceBagger: Use, in order of preference, f.basename, f.contentid…

bd2d48f

…s and f.ID for filenames, fix #1154

kba requested review from MehmedGIT and bertsky December 14, 2023 19:06

bertsky reviewed Dec 14, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WorkspaceBagger: Use, in order of preference, f.basename, f.contentids and f.ID for filenames #1157

WorkspaceBagger: Use, in order of preference, f.basename, f.contentids and f.ID for filenames #1157

kba commented Dec 14, 2023

bertsky left a comment

bertsky Dec 14, 2023

kba Dec 15, 2023

bertsky Dec 14, 2023

kba Dec 15, 2023

M3ssman Dec 15, 2023 •

edited

Loading

kba Jan 8, 2024

M3ssman Jan 9, 2024

M3ssman Jan 9, 2024

M3ssman commented Jan 10, 2024

WorkspaceBagger: Use, in order of preference, f.basename, f.contentids and f.ID for filenames #1157

Are you sure you want to change the base?

WorkspaceBagger: Use, in order of preference, f.basename, f.contentids and f.ID for filenames #1157

Conversation

kba commented Dec 14, 2023

bertsky left a comment

Choose a reason for hiding this comment

bertsky Dec 14, 2023

Choose a reason for hiding this comment

kba Dec 15, 2023

Choose a reason for hiding this comment

bertsky Dec 14, 2023

Choose a reason for hiding this comment

kba Dec 15, 2023

Choose a reason for hiding this comment

M3ssman Dec 15, 2023 • edited Loading

Choose a reason for hiding this comment

kba Jan 8, 2024

Choose a reason for hiding this comment

M3ssman Jan 9, 2024

Choose a reason for hiding this comment

M3ssman Jan 9, 2024

Choose a reason for hiding this comment

M3ssman commented Jan 10, 2024

M3ssman Dec 15, 2023 •

edited

Loading