Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add METS metadata parsing experiment (DOROP-20) #7

Draft
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

ssciolla
Copy link
Contributor

@ssciolla ssciolla commented Nov 14, 2024

This PR aims to resolve DOROP-20.

It is still rough and could go in lots of directions, but the core parsing functionality is present I think. Happy to receive any feedback.

Just to be clear: I don't intend to merge this in its current form.

@ssciolla ssciolla added the enhancement New feature or request label Nov 18, 2024
Comment on lines 6 to 20
class ElementAdapter():

@classmethod
def from_string(cls, text: str, namespaces: dict[str, str]) -> "ElementAdapter":
return cls(ET.fromstring(text=text), namespaces)

def __init__(self, elem: Element, namespaces: dict[str, str]):
self.elem = elem
self.namespaces = namespaces

def find(self, path: str) -> "ElementAdapter":
result = self.elem.find(path, self.namespaces)
if result is None:
raise DataNotFoundError()
return ElementAdapter(result, self.namespaces)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did end up implementing a thin wrapper around the standard library xml.etree.ElementTree.Element so that I could raise errors when things were found to be missing. I could imagine this being modified slightly to accommodate None as a return value in some cases (maybe by adding a required=True parameter to find, get, and text).

Comment on lines +36 to +50
class CommonMetadataParser():

def __init__(self, metadata_path: Path):
self.metadata_path: Path = metadata_path
try:
self.text = self.metadata_path.read_text()
except FileNotFoundError as e:
raise MetadataFileNotFoundError from e

def get_metadata(self) -> CommonMetadata:
try:
return CommonMetadata.model_validate_json(self.text)
except ValidationError as e:
raise DataNotFoundError("Common metadata parsing failed") from e

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is kinda a mini-experiment with Pydantic. Aside from my @schema difficulties, it seems pretty neat-o.

Comment on lines +181 to +187
asset = MetsAssetParser(asset_file_path).get_asset()
assets.append(asset)
struct_map_items.append(StructMapItem(
order=order_number,
label=label,
asset_id=asset.id
))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I've opted not to think about yet is what happens if the asset files are not present (as might happen during an update). Ideally speaking we just allow missing documents, but right now I'm relying on them for getting the asset IDs for the struct map -- so we'd have to figure that out.

def find(self, path: str) -> "ElementAdapter":
result = self.elem.find(path, self.namespaces)
if result is None:
raise DataNotFoundError(f"No element found for path {path}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ssciolla are these exceptions the reason for ElementAdapter?

Copy link
Contributor Author

@ssciolla ssciolla Nov 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty much. The adapter provides a couple other advantages (stores namespaces and shares them with children, reduces and specifies the API surface of the element-like object we deal with), but the primary one is that we can raise errors if we don't find what we expect. See #7 (comment)

There are probably other ways to solve this type safety problem, but this is the one I chose for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants