-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add METS metadata parsing experiment (DOROP-20) #7
base: main
Are you sure you want to change the base?
Conversation
…emove get_asset_order
…n things are missing
class ElementAdapter(): | ||
|
||
@classmethod | ||
def from_string(cls, text: str, namespaces: dict[str, str]) -> "ElementAdapter": | ||
return cls(ET.fromstring(text=text), namespaces) | ||
|
||
def __init__(self, elem: Element, namespaces: dict[str, str]): | ||
self.elem = elem | ||
self.namespaces = namespaces | ||
|
||
def find(self, path: str) -> "ElementAdapter": | ||
result = self.elem.find(path, self.namespaces) | ||
if result is None: | ||
raise DataNotFoundError() | ||
return ElementAdapter(result, self.namespaces) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did end up implementing a thin wrapper around the standard library xml.etree.ElementTree.Element
so that I could raise errors when things were found to be missing. I could imagine this being modified slightly to accommodate None
as a return value in some cases (maybe by adding a required=True
parameter to find
, get
, and text
).
class CommonMetadataParser(): | ||
|
||
def __init__(self, metadata_path: Path): | ||
self.metadata_path: Path = metadata_path | ||
try: | ||
self.text = self.metadata_path.read_text() | ||
except FileNotFoundError as e: | ||
raise MetadataFileNotFoundError from e | ||
|
||
def get_metadata(self) -> CommonMetadata: | ||
try: | ||
return CommonMetadata.model_validate_json(self.text) | ||
except ValidationError as e: | ||
raise DataNotFoundError("Common metadata parsing failed") from e | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is kinda a mini-experiment with Pydantic. Aside from my @schema
difficulties, it seems pretty neat-o.
…as PREMIS namespace
asset = MetsAssetParser(asset_file_path).get_asset() | ||
assets.append(asset) | ||
struct_map_items.append(StructMapItem( | ||
order=order_number, | ||
label=label, | ||
asset_id=asset.id | ||
)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing I've opted not to think about yet is what happens if the asset files are not present (as might happen during an update). Ideally speaking we just allow missing documents, but right now I'm relying on them for getting the asset IDs for the struct map -- so we'd have to figure that out.
def find(self, path: str) -> "ElementAdapter": | ||
result = self.elem.find(path, self.namespaces) | ||
if result is None: | ||
raise DataNotFoundError(f"No element found for path {path}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ssciolla are these exceptions the reason for ElementAdapter
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty much. The adapter provides a couple other advantages (stores namespaces and shares them with children, reduces and specifies the API surface of the element-like object we deal with), but the primary one is that we can raise errors if we don't find what we expect. See #7 (comment)
There are probably other ways to solve this type safety problem, but this is the one I chose for now.
This PR aims to resolve DOROP-20.
It is still rough and could go in lots of directions, but the core parsing functionality is present I think. Happy to receive any feedback.
Just to be clear: I don't intend to merge this in its current form.