generated from mlibrary/python-starter
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add METS metadata parsing experiment (DOROP-20) #7
Draft
ssciolla
wants to merge
17
commits into
main
Choose a base branch
from
dorop-20-mets-metadata-2
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Changes from all commits
Commits
Show all changes
17 commits
Select commit
Hold shift + click to select a range
89fa9d9
Add experiment directory setup; add first draft of MetsMetadataExtrac…
ssciolla f61fdec
Extract MetsAssetExtractor as separate class
ssciolla 3be1bc2
Extractor -> Parser
ssciolla 0f97085
Introduce StructMap/StructMapItem; refactor get_repository_item and r…
ssciolla 20f29e6
Introduce PreservationEvent; parse PREMIS events in item and asset me…
ssciolla 5986e4a
Add AssetFileUse
ssciolla bf72999
Fixes for mypy
ssciolla b6fa306
Add basic rights attribute; tweak get_record_status to return enum me…
ssciolla fcb63bf
Fix(?) relative path handling
ssciolla 00ef806
Add rough implementation of ElementAdapter that raises exceptions whe…
ssciolla 3bc6768
Move ElementAdapter to its own file; add from_string class method; ad…
ssciolla 9c63da0
Add common metadata parsing using pydantic, including test
ssciolla 05846ce
Remove lxml
ssciolla a398c64
MetsMetadataParser -> MetsItemParser; mets_metadata_parser -> parsers
ssciolla ca9dd49
Remove redundant tests
ssciolla f96b24c
Make some minor spacing changes; add error messages to ElementAdapter
ssciolla 6cb5788
Remove unused get_asset_file_paths; ensure PremisEventParser always h…
ssciolla File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,58 @@ | ||
FROM python:3.12-slim-bookworm AS base | ||
|
||
ARG POETRY_VERSION=1.8.4 | ||
ARG UID=1000 | ||
ARG GID=1000 | ||
|
||
ENV PYTHONPATH="/app" | ||
|
||
RUN groupadd -g ${GID} -o app | ||
RUN useradd -m -d /app -u ${UID} -g ${GID} -o -s /bin/bash app | ||
|
||
RUN apt-get update -yqq && apt-get install -yqq --no-install-recommends \ | ||
python3-dev \ | ||
build-essential \ | ||
pkg-config \ | ||
vim-tiny \ | ||
curl \ | ||
unzip | ||
|
||
WORKDIR /app | ||
|
||
ENV PYTHONPATH="/app" | ||
|
||
CMD ["tail", "-f", "/dev/null"] | ||
|
||
FROM base AS poetry | ||
|
||
RUN pip install poetry==${POETRY_VERSION} | ||
|
||
ENV PYTHONUNBUFFERED=1\ | ||
PIP_NO_CACHE_DIR=off \ | ||
PIP_DISABLE_PIP_VERSION_CHECK=on \ | ||
PIP_DEFAULT_TIMEOUT=100 \ | ||
POETRY_NO_INTERACTION=1 \ | ||
POETRY_VIRTUALENVS_CREATE=1 \ | ||
POETRY_VIRTUALENVS_IN_PROJECT=1 \ | ||
POETRY_CACHE_DIR=/tmp/poetry_cache | ||
|
||
FROM poetry AS build | ||
|
||
COPY pyproject.toml poetry.lock README.md ./ | ||
|
||
RUN poetry export --without dev -f requirements.txt --output requirements.txt | ||
|
||
FROM poetry AS development | ||
RUN apt-get update -yqq && apt-get install -yqq --no-install-recommends \ | ||
git | ||
|
||
USER app | ||
|
||
FROM base AS production | ||
|
||
COPY --chown=${UID}:${GID} . /app | ||
COPY --chown=${UID}:${GID} --from=build "/app/requirements.txt" /app/requirements.txt | ||
|
||
RUN pip install -r /app/requirements.txt | ||
|
||
USER app |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
# Metadata extraction experiment | ||
|
||
## Tests | ||
|
||
```sh | ||
docker compose run poetry run pytest | ||
``` |
Empty file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,16 @@ | ||
services: | ||
app: | ||
build: | ||
context: . | ||
target: development | ||
dockerfile: Dockerfile | ||
args: | ||
UID: ${UID:-1000} | ||
GID: ${GID:-1000} | ||
DEV: ${DEV:-false} | ||
POETRY_VERSION: ${POETRY_VERSION:-1.8.4} | ||
volumes: | ||
- .:/app | ||
- ../../tests/fixtures/test_submission_package:/app/tests/fixtures/test_submission_package | ||
tty: true | ||
stdin_open: true |
Empty file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,46 @@ | ||
import xml.etree.ElementTree as ET | ||
from xml.etree.ElementTree import Element | ||
|
||
from metadata.exceptions import DataNotFoundError | ||
|
||
class ElementAdapter(): | ||
|
||
@classmethod | ||
def from_string(cls, text: str, namespaces: dict[str, str]) -> "ElementAdapter": | ||
return cls(ET.fromstring(text=text), namespaces) | ||
|
||
def __init__(self, elem: Element, namespaces: dict[str, str]): | ||
self.elem = elem | ||
self.namespaces = namespaces | ||
|
||
def find(self, path: str) -> "ElementAdapter": | ||
result = self.elem.find(path, self.namespaces) | ||
if result is None: | ||
raise DataNotFoundError(f"No element found for path {path}") | ||
return ElementAdapter(result, self.namespaces) | ||
|
||
@property | ||
def text(self) -> str: | ||
result = self.elem.text | ||
if result is None: | ||
raise DataNotFoundError(f"No text found for {self.elem.tag}") | ||
return result | ||
|
||
def get(self, key: str) -> str: | ||
result = self.elem.get(key) | ||
if result is None: | ||
raise DataNotFoundError(f"No value for attribute {key} found for {self.elem.tag}") | ||
return result | ||
|
||
def findall(self, path: str) -> "list[ElementAdapter]": | ||
return [ | ||
ElementAdapter(elem, self.namespaces) | ||
for elem in self.elem.findall(path, self.namespaces) | ||
] | ||
|
||
@property | ||
def tag(self) -> str: | ||
return self.elem.tag | ||
|
||
def get_children(self) -> "list[ElementAdapter]": | ||
return [ElementAdapter(elem, self.namespaces) for elem in self.elem[:]] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
class MetadataFileNotFoundError(Exception): | ||
pass | ||
|
||
class DataNotFoundError(Exception): | ||
pass |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,81 @@ | ||
from pathlib import Path | ||
from dataclasses import dataclass | ||
from datetime import datetime | ||
from enum import Enum | ||
|
||
from pydantic import BaseModel | ||
|
||
class RecordStatus(Enum): | ||
STAGE = "stage" | ||
STORE = "store" | ||
|
||
@dataclass | ||
class Actor(): | ||
address: str | ||
role: str | ||
|
||
@dataclass | ||
class PreservationEvent(): | ||
identifier: str | ||
type: str | ||
datetime: datetime | ||
detail: str | ||
actor: Actor | ||
|
||
class CommonMetadata(BaseModel): | ||
title: str | ||
ssciolla marked this conversation as resolved.
Show resolved
Hide resolved
|
||
author: str | ||
publication_date: datetime | ||
subjects: list[str] | ||
|
||
class AssetFileUse(Enum): | ||
ACCESS = "ACCESS" | ||
SOURCE = "SOURCE" | ||
|
||
class FileMetadataFileType(Enum): | ||
TECHNICAL = "TECHNICAL" | ||
|
||
@dataclass | ||
class FileMetadataFile: | ||
id: str | ||
type: FileMetadataFileType | ||
path: Path | ||
|
||
@dataclass | ||
class AssetFile: | ||
id: str | ||
use: AssetFileUse | ||
path: Path | ||
metadata_file: FileMetadataFile | ||
|
||
@dataclass | ||
class Asset: | ||
id: str | ||
events: list[PreservationEvent] | ||
files: list[AssetFile] | ||
|
||
class StructMapType(Enum): | ||
PHYSICAL = "PHYSICAL" | ||
LOGICAL = "LOGICAL" | ||
|
||
@dataclass | ||
class StructMapItem(): | ||
order: int | ||
label: str | ||
asset_id: str | ||
|
||
@dataclass | ||
class StructMap(): | ||
id: str | ||
type: StructMapType | ||
items: list[StructMapItem] | ||
|
||
@dataclass | ||
class RepositoryItem(): | ||
id: str | ||
record_status: RecordStatus | ||
rights: str | None | ||
events: list[PreservationEvent] | ||
common_metadata: CommonMetadata | ||
struct_map: StructMap | ||
assets: list[Asset] |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ssciolla are these exceptions the reason for
ElementAdapter
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty much. The adapter provides a couple other advantages (stores namespaces and shares them with children, reduces and specifies the API surface of the element-like object we deal with), but the primary one is that we can raise errors if we don't find what we expect. See #7 (comment)
There are probably other ways to solve this type safety problem, but this is the one I chose for now.