Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add METS metadata parsing experiment (DOROP-20) #7

Draft
wants to merge 17 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
58 changes: 58 additions & 0 deletions experiments/dorop-20/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
FROM python:3.12-slim-bookworm AS base

ARG POETRY_VERSION=1.8.4
ARG UID=1000
ARG GID=1000

ENV PYTHONPATH="/app"

RUN groupadd -g ${GID} -o app
RUN useradd -m -d /app -u ${UID} -g ${GID} -o -s /bin/bash app

RUN apt-get update -yqq && apt-get install -yqq --no-install-recommends \
python3-dev \
build-essential \
pkg-config \
vim-tiny \
curl \
unzip

WORKDIR /app

ENV PYTHONPATH="/app"

CMD ["tail", "-f", "/dev/null"]

FROM base AS poetry

RUN pip install poetry==${POETRY_VERSION}

ENV PYTHONUNBUFFERED=1\
PIP_NO_CACHE_DIR=off \
PIP_DISABLE_PIP_VERSION_CHECK=on \
PIP_DEFAULT_TIMEOUT=100 \
POETRY_NO_INTERACTION=1 \
POETRY_VIRTUALENVS_CREATE=1 \
POETRY_VIRTUALENVS_IN_PROJECT=1 \
POETRY_CACHE_DIR=/tmp/poetry_cache

FROM poetry AS build

COPY pyproject.toml poetry.lock README.md ./

RUN poetry export --without dev -f requirements.txt --output requirements.txt

FROM poetry AS development
RUN apt-get update -yqq && apt-get install -yqq --no-install-recommends \
git

USER app

FROM base AS production

COPY --chown=${UID}:${GID} . /app
COPY --chown=${UID}:${GID} --from=build "/app/requirements.txt" /app/requirements.txt

RUN pip install -r /app/requirements.txt

USER app
7 changes: 7 additions & 0 deletions experiments/dorop-20/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# Metadata extraction experiment

## Tests

```sh
docker compose run poetry run pytest
```
Empty file.
16 changes: 16 additions & 0 deletions experiments/dorop-20/compose.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
services:
app:
build:
context: .
target: development
dockerfile: Dockerfile
args:
UID: ${UID:-1000}
GID: ${GID:-1000}
DEV: ${DEV:-false}
POETRY_VERSION: ${POETRY_VERSION:-1.8.4}
volumes:
- .:/app
- ../../tests/fixtures/test_submission_package:/app/tests/fixtures/test_submission_package
tty: true
stdin_open: true
Empty file.
46 changes: 46 additions & 0 deletions experiments/dorop-20/metadata/element_adapter.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
import xml.etree.ElementTree as ET
from xml.etree.ElementTree import Element

from metadata.exceptions import DataNotFoundError

class ElementAdapter():

@classmethod
def from_string(cls, text: str, namespaces: dict[str, str]) -> "ElementAdapter":
return cls(ET.fromstring(text=text), namespaces)

def __init__(self, elem: Element, namespaces: dict[str, str]):
self.elem = elem
self.namespaces = namespaces

def find(self, path: str) -> "ElementAdapter":
result = self.elem.find(path, self.namespaces)
if result is None:
raise DataNotFoundError(f"No element found for path {path}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ssciolla are these exceptions the reason for ElementAdapter?

Copy link
Contributor Author

@ssciolla ssciolla Nov 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty much. The adapter provides a couple other advantages (stores namespaces and shares them with children, reduces and specifies the API surface of the element-like object we deal with), but the primary one is that we can raise errors if we don't find what we expect. See #7 (comment)

There are probably other ways to solve this type safety problem, but this is the one I chose for now.

return ElementAdapter(result, self.namespaces)

@property
def text(self) -> str:
result = self.elem.text
if result is None:
raise DataNotFoundError(f"No text found for {self.elem.tag}")
return result

def get(self, key: str) -> str:
result = self.elem.get(key)
if result is None:
raise DataNotFoundError(f"No value for attribute {key} found for {self.elem.tag}")
return result

def findall(self, path: str) -> "list[ElementAdapter]":
return [
ElementAdapter(elem, self.namespaces)
for elem in self.elem.findall(path, self.namespaces)
]

@property
def tag(self) -> str:
return self.elem.tag

def get_children(self) -> "list[ElementAdapter]":
return [ElementAdapter(elem, self.namespaces) for elem in self.elem[:]]
5 changes: 5 additions & 0 deletions experiments/dorop-20/metadata/exceptions.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
class MetadataFileNotFoundError(Exception):
pass

class DataNotFoundError(Exception):
pass
81 changes: 81 additions & 0 deletions experiments/dorop-20/metadata/models.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
from pathlib import Path
from dataclasses import dataclass
from datetime import datetime
from enum import Enum

from pydantic import BaseModel

class RecordStatus(Enum):
STAGE = "stage"
STORE = "store"

@dataclass
class Actor():
address: str
role: str

@dataclass
class PreservationEvent():
identifier: str
type: str
datetime: datetime
detail: str
actor: Actor

class CommonMetadata(BaseModel):
title: str
ssciolla marked this conversation as resolved.
Show resolved Hide resolved
author: str
publication_date: datetime
subjects: list[str]

class AssetFileUse(Enum):
ACCESS = "ACCESS"
SOURCE = "SOURCE"

class FileMetadataFileType(Enum):
TECHNICAL = "TECHNICAL"

@dataclass
class FileMetadataFile:
id: str
type: FileMetadataFileType
path: Path

@dataclass
class AssetFile:
id: str
use: AssetFileUse
path: Path
metadata_file: FileMetadataFile

@dataclass
class Asset:
id: str
events: list[PreservationEvent]
files: list[AssetFile]

class StructMapType(Enum):
PHYSICAL = "PHYSICAL"
LOGICAL = "LOGICAL"

@dataclass
class StructMapItem():
order: int
label: str
asset_id: str

@dataclass
class StructMap():
id: str
type: StructMapType
items: list[StructMapItem]

@dataclass
class RepositoryItem():
id: str
record_status: RecordStatus
rights: str | None
events: list[PreservationEvent]
common_metadata: CommonMetadata
struct_map: StructMap
assets: list[Asset]
Loading