Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

creates PREMIS CSV implementation scripts #205

Open
wants to merge 33 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
f21e898
creates premis CSV proof of concept scripts
kieranjol Jul 30, 2017
53aac7d
adds premis csv to xml draft script
kieranjol Jul 30, 2017
88b94be
premiscsv2xml - PEP-08 cleanup
kieranjol Jul 30, 2017
713d2af
premiscsv2xml - removes objectCategory, move information to attribute…
kieranjol Jul 30, 2017
de5b4b0
premiscsv2xml - performs recursive item search
kieranjol Jul 31, 2017
1986b32
premisobjects - remove debug statements
kieranjol Jul 31, 2017
2af6356
ififuncs - adds pronom/siegfied function
kieranjol Aug 2, 2017
55b6c86
premiscsv - ads siegfried/pronom format registry
kieranjol Aug 2, 2017
f4af73e
premiscsv - fix mediainfo eventDetail
kieranjol Aug 3, 2017
33b78d5
premisobjects - extract relative path of object for contentLocation
kieranjol Aug 3, 2017
7bc4538
premiscsv - extract linkingObjectIdentifier for metadata extraction
kieranjol Aug 3, 2017
0f6e048
premiscsv - ads contentLocation to xml transform
kieranjol Aug 4, 2017
4ab7049
adds linkingEventIdentifierValue to object descriptions
kieranjol Aug 5, 2017
494a167
premiscsv2xml - converts Events CSV to XML
kieranjol Aug 5, 2017
39c3235
premiscsv2xml - fixes element order, XML validates again against schema
kieranjol Aug 5, 2017
63cc919
premiscsv2xml - more PREMIS event info
kieranjol Aug 5, 2017
d079604
premiscsv2xml - adds linkingEventIdentifiers
kieranjol Aug 5, 2017
bd4259f
ififuncs/premiscsv - moves functions form premis scripts into ififuncs
kieranjol Aug 6, 2017
a455a45
premiscsv2xml - removes stupid add_value() function
kieranjol Aug 6, 2017
4d762b5
premiscsv2xml - cleanup and docstrings
kieranjol Aug 6, 2017
73f08d6
premisobjects/premiscsv2xml - more cleanup
kieranjol Aug 6, 2017
a6288cc
ififuncs/premiscsv - adds argparse and changes variables in all PREMI…
kieranjol Aug 6, 2017
8c45228
cleans up PREMIS csv scripts
kieranjol Aug 7, 2017
9bad84c
premisobjects - be less IFI folder structure specific
kieranjol Aug 7, 2017
f068310
logs2premis - renames premiscsv to logs2premis
kieranjol Aug 7, 2017
784665d
deletes renamed premiscsv
kieranjol Aug 7, 2017
951ae94
makepremis - makepremis - adds helper script that launches premisobje…
kieranjol Aug 7, 2017
ea1b8f9
Merge branch 'master' into premiscsv
kieranjol Aug 18, 2017
01a5254
makepremis - adds arguments for object/events csv filenames
kieranjol Aug 18, 2017
61fd02e
README.md - updates PREMIS scripts documentation
kieranjol Aug 19, 2017
377b630
Merge branch 'master' into premiscsv
kieranjol Jan 16, 2018
ebc6ef4
premisobjects/makepremis - clarifies argparse and fixes typo
kieranjol Jan 16, 2018
ec2e17b
premisobjects - adds placeholder relationships function
kieranjol Jan 17, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
47 changes: 39 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,23 +6,28 @@ table of contents

1. [summary](https://github.com/kieranjol/IFIscripts#summary)
2. [Arrangement](https://github.com/kieranjol/IFIscripts#arrangement)
* [sipcreator.py](https://github.com/kieranjol/IFIscripts#sipcreatorpy)
3. [Transcodes](https://github.com/kieranjol/IFIscripts#transcodes)
* [sipcreator.py](https://github.com/kieranjol/IFIscripts#sipcreator)
3. [PREMIS](https://github.com/kieranjol/IFIscripts#PREMIS)
* [premisobjects.py](https://github.com/kieranjol/IFIscripts#premisobjectspy)
* [logs2premis.py](https://github.com/kieranjol/IFIscripts#logs2premispy)
* [makepremis.py](https://github.com/kieranjol/IFIscripts#makepremispremispy)
* [premiscsv2xml.py](https://github.com/kieranjol/IFIscripts#premiscsv2xmlpy)
4. [Transcodes](https://github.com/kieranjol/IFIscripts#transcodes)
* [makeffv1.py](https://github.com/kieranjol/IFIscripts#makeffv1py)
* [bitc.py](https://github.com/kieranjol/IFIscripts#bitcpy)
* [prores.py](https://github.com/kieranjol/IFIscripts#prorespy)
* [concat.py](https://github.com/kieranjol/IFIscripts#concatpy)
4. [Digital Cinema Package Scripts](https://github.com/kieranjol/IFIscripts#digital-cinema-package-scripts)
5. [Digital Cinema Package Scripts](https://github.com/kieranjol/IFIscripts#digital-cinema-package-scripts)
* [dcpaccess.py](https://github.com/kieranjol/IFIscripts#dcpaccesspy)
* [dcpfixity.py](https://github.com/kieranjol/IFIscripts#dcpfixitypy)
* [dcpsubs2srt.py](https://github.com/kieranjol/IFIscripts#dcpsubs2srtpy)
5. [Fixity Scripts](https://github.com/kieranjol/IFIscripts#fixity-scripts)
6. [Fixity Scripts](https://github.com/kieranjol/IFIscripts#fixity-scripts)
* [copyit.py](https://github.com/kieranjol/IFIscripts#copyitpy)
* [manifest.py](https://github.com/kieranjol/IFIscripts#manifestpy)
* [sha512deep.py](https://github.com/kieranjol/IFIscripts#sha512deeppy)
* [validate.py](https://github.com/kieranjol/IFIscripts#validatepy)
* [batchfixity.py](https://github.com/kieranjol/IFIscripts#batchfixitypy)
6. [Image Sequences](https://github.com/kieranjol/IFIscripts#image-sequences)
7. [Image Sequences](https://github.com/kieranjol/IFIscripts#image-sequences)
* [makedpx.py](https://github.com/kieranjol/IFIscripts#makedpxpy)
* [seq2ffv1.py](https://github.com/kieranjol/IFIscripts#seq2ffv1py)
* [seq2prores.py](https://github.com/kieranjol/IFIscripts#seq2prorespy)
Expand All @@ -33,10 +38,10 @@ table of contents
* [seq2dv.py](https://github.com/kieranjol/IFIscripts#seq2dvpy)
* [batchmetadata.py](https://github.com/kieranjol/IFIscripts#batchmetadata)
* [batchrename.py](https://github.com/kieranjol/IFIscripts#batchrename)
7. [Quality Control](https://github.com/kieranjol/IFIscripts#quality-control)
8. [Quality Control](https://github.com/kieranjol/IFIscripts#quality-control)
* [qctools.py](https://github.com/kieranjol/IFIscripts#qctoolspy)
9. [Specific Workflows](https://github.com/kieranjol/IFIscripts#specific-workflows)
* [ffv1mkvvalidate.py](https://github.com/kieranjol/IFIscripts#ffv1mkvvalidatespy)
8. [Specific Workflows](https://github.com/kieranjol/IFIscripts#specific-workflows)
* [mezzaninecheck.py](https://github.com/kieranjol/IFIscripts#mezzaninecheckpy)
* [loopline.py](https://github.com/kieranjol/IFIscripts#looplinepy)
* [masscopy.py](https://github.com/kieranjol/IFIscripts#masscopypy)
Expand All @@ -47,7 +52,7 @@ table of contents
* [giffer.py](https://github.com/kieranjol/IFIscripts#gifferpy)
* [makeuuid.py](https://github.com/kieranjol/IFIscripts#makeuuidpy)
* [durationcheck.py](https://github.com/kieranjol/IFIscripts#durationcheck.py)
10. [Experimental-Premis](https://github.com/kieranjol/IFIscripts#experimental-premis)
11. [Experimental-Premis](https://github.com/kieranjol/IFIscripts#experimental-premis)
* [premis.py](https://github.com/kieranjol/IFIscripts#premispy)
* [revtmd.py](https://github.com/kieranjol/IFIscripts#revtmdpy)
* [as11fixity.py](https://github.com/kieranjol/IFIscripts#as11fixitypy)
Expand All @@ -71,6 +76,32 @@ Note: Documentation template has been copied from [mediamicroservices](https://g
* Usage for more than one directory - `sipcreator.py -i /path/to/directory_name1 /path/to/directory_name2 -o /path/to/output_folder`
* Run `sipcreator.py -h` for all options.

## PREMIS ##

### makepremis.py ###
* Creates PREMIS CSV and XML descriptions by launching other IFIscripts, such as logs2premis.py, premisobjects.py, premiscsv2xml.py.
* Assumptions for now: representation UUID already exists as part of the SIP/AIP folder structure. Find a way to supply this, probably via argparse.
* For more information, run `pydoc makepremis `
* Usage: `makepremis.py -event_csv path/to/events.csv -object_csv path/to/objects.csv`

### premisobjects.py ###
* Creates a somewhat PREMIS compliant CSV file describing objects in a package. A seperate script will need to be written in order to transform these CSV files into XML.
* As the flat CSV structure prevents maintaining some of the relationships between units, some semantic units have been merged, for example:`relationship_structural_includes` is really a combination of the `relationshipType` and `relationshipSubType` units, which each have the values: `Structural` and `Includes` respectively.
* Assumptions for now: representation UUID already exists as part of the SIP/AIP folder structure. Find a way to supply this, probably via argparse.
* For more information, run `pydoc premisobjects`
* Usage: `premisobjects.py -i path/to/SIP -m path/to/manifest.md5 -o path/to/output.csv`

### logs2premis.py ###
* Extracts preservation events from an IFI plain text log file and converts to a CSV using the PREMIS data dictionary.
* For more information, run `pydoc premiscsv`
* Usage: - `premiscsv.py -i path/to/logfile.log -o path/to/output.csv -object_csv path/to/objects.csv`

### premiscsv2xml.py ###
* Transforms PREMIS csv files into XML.
* For more information, run `pydoc premiscsv2xml`
* Usage: `premiscsv2xml.py -ev path/to/events.csv -i path/to/objects.csv`


## Transcodes ##

### makeffv1.py ###
Expand Down
45 changes: 45 additions & 0 deletions ififuncs.py
Original file line number Diff line number Diff line change
Expand Up @@ -855,6 +855,51 @@ def checksum_replace(manifest, logname):
for lines in updated_manifest:
fo.write(lines)

def get_pronom_format(filename):
'''
Uses siegfried to return a tuple that contains:
pronom_id, authority, siegfried version
'''
siegfried_json = subprocess.check_output(
['sf', '-json', filename]
)
json_object = json.loads(siegfried_json)
pronom_id = str(json_object['files'][0]['matches'][0]['id'])
authority = str(json_object['files'][0]['matches'][0]['ns'])
version = str(json_object['siegfried'])
return (pronom_id, authority, version)

def get_checksum(manifest, filename):
'''
Extracts the checksum and path within a manifest, returning both as a tuple.
'''
if os.path.isfile(manifest):
with open(manifest, 'r') as manifest_object:
manifest_lines = manifest_object.readlines()
for md5 in manifest_lines:
if 'objects' in md5:
if filename in md5:
return md5[:32], md5[34:].rstrip()

def find_representation_uuid(source):
'''
This extracts the representation UUID from a directory name.
This should be moved to ififuncs as it can be used by other scripts.
'''
for root, _, _ in os.walk(source):
if 'objects' in root:
return os.path.basename(os.path.dirname(root))

def extract_metadata(csv_file):
'''
Read the PREMIS csv and store the metadata in a list of dictionaries.
'''
object_dictionaries = []
input_file = csv.DictReader(open(csv_file))
for rows in input_file:
object_dictionaries.append(rows)
return object_dictionaries

def img_seq_pixfmt(start_number, path):
'''
Determine the pixel format of an image sequence
Expand Down
226 changes: 226 additions & 0 deletions logs2premis.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,226 @@
#!/usr/bin/env python
'''
Extracts preservation events from an IFI plain text log file and converts
to a CSV using the PREMIS data dictionary
'''
import os
import sys
import csv
import shutil
import argparse
# from lxml import etree
import ififuncs


def find_events(logfile, output):
'''
A very hacky attempt to extract the relevant preservation events from our
log files.
'''
sip_test = os.path.basename(logfile).replace('_sip_log.log', '')
if ififuncs.validate_uuid4(sip_test) != False:
linking_object_identifier_value = sip_test
with open(logfile, 'r') as logfile_object:
log_lines = logfile_object.readlines()
for event_test in log_lines:
if 'eventDetail=copyit.py' in event_test:
logsplit = event_test.split(',')
for line_fragment in logsplit:
manifest_event = line_fragment.replace(
'eventDetail', ''
).replace('\n', '').split('=')[1]
object_info = ififuncs.extract_metadata('objects.csv')
object_locations = {}
for i in object_info:
object_locations[
i['contentLocationValue']
] = i['objectIdentifier'].split(', ')[1].replace(']', '')
for log_entry in log_lines:
valid_entries = [
'eventType',
'eventDetail=sipcreator.py',
'eventDetail=Mediatrace',
'eventDetail=Technical',
'eventDetail=copyit.py'
]
for entry in valid_entries:
if entry in log_entry:
break_loop = ''
event_outcome = ''
event_detail = ''
event_outcome_detail_note = ''
event_type = ''
event_row = []
datetime = log_entry[:19]
logsplit = log_entry.split(',')
for line_fragment in logsplit:
if 'eventType' in line_fragment:
if 'EVENT =' in line_fragment:
line_fragment = line_fragment.split('EVENT =')[1]
event_type = line_fragment.replace(
' eventType=', ''
).replace('assignement', 'assignment')
if ' value' in line_fragment:
# this assumes that the value is the outcome of an identifier assigment.
event_outcome = line_fragment[7:].replace('\n', '')
# we are less concerned with events starting.
if 'status=started' in line_fragment:
break_loop = 'continue'
if 'Generating destination manifest:' in line_fragment:
break_loop = ''
event_detail = manifest_event
# ugh, this might run multiple times.
if 'eventDetail=sipcreator.py' in log_entry:
event_type = 'Information Package Creation'
event_detail = line_fragment.replace(
'eventDetail', ''
).replace('\n', '').split('=')[1]
event_outcome_detail_note = 'Submission Information Package'
if ('eventDetail=Mediatrace' in log_entry) or ('eventDetail=Technical' in log_entry):
event_type = 'metadata extraction'
event_detail = log_entry.split(
'eventDetail=', 1
)[1].split(',')[0]
event_outcome = log_entry.split(
'eventOutcome=', 1
)[1].replace(', agentName=mediainfo', '').replace('\n', '')
if 'eventDetail=Mediatrace' in log_entry:
event_outcome = event_outcome.replace('mediainfo.xml', 'mediatrace.xml')
for x in object_locations:
'''
This is trying to get the UUID of the source object
that relates to the mediainfo xmls. This is
achieved via a dictionary.
'''
if 'objects' in x:
a = os.path.basename(event_outcome).replace('_mediainfo.xml', '').replace('_mediatrace.xml', '')[:-1]
b = os.path.basename(x)
if a == b:
linking_object_identifier_value = object_locations[x].replace('\'', '')
if (break_loop == 'continue') or (event_type == ''):
continue
print event_type
event_row = [
'UUID', ififuncs.create_uuid(),
event_type, datetime, event_detail,
'',
event_outcome, '',
event_outcome_detail_note, '',
'', '',
'', 'UUID',
linking_object_identifier_value, ''
]
ififuncs.append_csv(output, event_row)


def update_objects(output, objects_csv):
'''
Update the object description with the linkingEventIdentifiers
'''
link_dict = {}
event_dicts = ififuncs.extract_metadata(output)
for i in event_dicts:
a = i['eventIdentifierValue']
try:
link_dict[i['linkingObjectIdentifierValue']] += a + '|'
except KeyError:
link_dict[i['linkingObjectIdentifierValue']] = a + '|'
print link_dict
object_dicts = ififuncs.extract_metadata(objects_csv)
for x in object_dicts:
for link in link_dict:
if link == x['objectIdentifier'].split(', ')[1].replace(']', '').replace('\'', ''):
x['linkingEventIdentifierValue'] = link_dict[link]
premis_object_units = [
'objectIdentifier',
'objectCategory',
'messageDigestAlgorithm', 'messageDigest', 'messageDigestOriginator',
'size', 'formatName', 'formatVersion',
'formatRegistryName', 'formatRegistryKey', 'formatRegistryRole',
'objectCharacteristicsExtension', 'originalName',
'contentLocationType', 'contentLocationValue',
'relatedObjectIdentifierType', 'relatedObjectIdentifierValue',
'relatedObjectSequence',
'relatedEventIdentifierType', 'relatedEventIdentifierValue',
'relatedEventSequence',
'linkingEventIdentifierType', 'linkingEventIdentifierValue',
'relationship_structural_includes',
'relationship_structural_isincludedin',
'relationship_structural_represents',
'relationship_structural_hasroot',
'relationship_derivation_hassource'
]
with open('mycsvfile.csv', 'wb') as f:
counter = 0
for i in object_dicts:
w = csv.DictWriter(f, fieldnames=premis_object_units)
if counter == 0:
w.writeheader()
counter += 1
w.writerow(i)
shutil.move('mycsvfile.csv', objects_csv)


def make_events_csv(output):
'''
Generates a CSV with PREMIS-esque headings. Currently it's just called
'bla.csv' but it will probably be called:
UUID_premisevents.csv
and sit in the metadata directory.
'''
premis_events = [
'eventIdentifierType', 'eventIdentifierValue',
'eventType', 'eventDateTime', 'eventDetail',
'eventDetailExtension',
'eventOutcome', 'eventOutcomeDetail',
'eventOutcomeDetailNote', 'eventOutcomeDetailExtension',
'linkingAgentIdentifierType', 'linkingAgentIdentifierValue',
'linkingAgentIdentifierRole', 'linkingObjectIdentifierType',
'linkingObjectIdentifierValue', 'linkingObjectRole'
]
ififuncs.create_csv(output, premis_events)


def parse_args(args_):
'''
Parse command line arguments.
'''
parser = argparse.ArgumentParser(
description='Describes events using PREMIS data dictionary via CSV'
' Written by Kieran O\'Leary.'
)
parser.add_argument(
'-i',
help='full path of a log textfile', required=True
)
parser.add_argument(
'-o',
help='full path of output csv', required=True
)
parser.add_argument(
'-object_csv',
help='full path of object description csv', required=True
)
parser.add_argument(
'-user',
help='Declare who you are. If this is not set, you will be prompted.'
)
parsed_args = parser.parse_args(args_)
return parsed_args


def main(args_):
'''
Launches all the other functions when run from the command line.
'''
args = parse_args(args_)
logfile = args.i
output = args.o
objects_csv = args.object_csv
make_events_csv(output)
find_events(logfile, output)
update_objects(output, objects_csv)


if __name__ == '__main__':
main(sys.argv[1:])
Loading