Skip to content

Commit

Permalink
Merge branch 'pipeline_uni'
Browse files Browse the repository at this point in the history
  • Loading branch information
uogbuji committed Jul 28, 2021
2 parents a3b5c4c + 45f9df8 commit c9b8a96
Show file tree
Hide file tree
Showing 72 changed files with 4,530 additions and 1,523 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ scratch
.idea
.vscode
MANIFEST
prof

#----

Expand Down
104 changes: 100 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,102 @@
Versa
=====
# Versa

The Versa model for Web resources and relationships. Think of it as an evolution of Resource Description Framework (RDF)
that's at once simpler and more expressive.
Versa is a model for Web resources and relationships. It has a lot in common
with Resource Description Framework (RDF) or Property Graphs (PG). It is
a way to express and work with data on the Web, in direct terms of resources
and rich linking between these resources. This also makes it a good and
natural way to exrpess Knowledge Grapgs (KG).

This repository provides specification as well as tools for using Versa in
practice, and which serve as reference implementations.

# Brief introduction to Versa

To get a simple idea of Versa, think about how you can express the relationship
between a Web page and its author in HTML5.

<a href="http://uche.ogbuji.net" rel="author">Uche Ogbuji</a>

Let's say the page being described is `http://uche.ogbuji.net/ndewo/`.
Versa makes it easy to pull together all these author link components into a single construct for easy understanding and manipulation.

http://uche.ogbuji.net/ndewo/ author http://uche.ogbuji.net (caption="Uche Ogbuji")

In Versa this is called a link, and a link has four basic components, an
origin, a relationship, a target and a set of attributes. Link relationships
(also known as link types) are critical because they place links in context,
and Versa expects relationships to be IRIs so the context (meaning, if you like)
is properly expressed and fully scoped. Since rel=author is defined in HTML5,
you can complete the above as follows (using a made-up IRI for sake of example):

http://uche.ogbuji.net/ndewo/ http://www.w3.org/TR/html5/link-type/author http://uche.ogbuji.net (caption="Uche Ogbuji")

You can express Versa links in JSON, for example:

["http://uche.ogbuji.net/ndewo/", "http://www.w3.org/TR/html5/link-type/author", "http://uche.ogbuji.net", {"caption": "Uche Ogbuji"}]

Usually you think of links in groups, ro example the many links from one page,
or all the various links across, out of and into a Web site. Versa is
designed for working with such collections of links. A collection of links
in Versa is called a linkset. Again you can express a linkset in JSON.

[
["http://uche.ogbuji.net/ndewo/", "http://www.w3.org/TR/html5/link-type/author", "http://uche.ogbuji.net", {"http://www.w3.org/TR/html5/link/caption": "Uche Ogbuji"}],
["http://uche.ogbuji.net/ndewo/", "http://www.w3.org/TR/html5/link-type/see-also", "http://www.goodreads.com/book/show/18714145-ndewo-colorado", {"http://www.w3.org/TR/html5/link/label": "Goodreads"}],
["http://uche.ogbuji.net/", "http://www.w3.org/TR/html5/link-type/see-also", "http://uche.ogbuji.net/ndewo/"]
]

Notice that the third link has no attributes. Attributes are optional. I
invented a `see-also` relationship to represent a simple HTML link with no
`rel` attribute. The second link captures the idea of an HTML `alt`
attribute with a label attribute. In fact, HTML defines a bunch of
attributes which can be used with links, and you can add your own using XML
namespaces or HTML5 data attributes. This is why attributes are a core part
of a link in Versa. A Web link ties together multiple bits of information
in an extensible way, and attributes provide the extensibility, ensuring you
can work with all these bits of information as a unit.

If you think about data on the Web, links from one resource to another are
useful, but it's also useful to be able to express simple properties of a
resource. Versa supports this in the form of what's called a data link.
For example you could capture the title and other metadata about a resource.

["http://uche.ogbuji.net/ndewo/", "http://www.w3.org/TR/html5/title", "Ndewo, Colorado"]

The target of a data link is not a Web resource but rather a simple piece of
information. Technically, in Versa syntax you should always signal resources
as IRIs. In Javascript form this looks as follows:

[
["<http://uche.ogbuji.net/ndewo/>", "<http://www.w3.org/TR/html5/link-type/author>", "<http://uche.ogbuji.net>", {"<http://www.w3.org/TR/html5/link/description>": "Uche Ogbuji"}],
["<http://uche.ogbuji.net/ndewo/>", "<http://www.w3.org/TR/html5/link-type/see-also>", "<http://www.goodreads.com/book/show/18714145-ndewo-colorado>", {"<http://www.w3.org/TR/html5/link/label>": "Goodreads"}],
["<http://uche.ogbuji.net/>", "<http://www.w3.org/TR/html5/link-type/see-also>", "<http://uche.ogbuji.net/ndewo/>"]
["<http://uche.ogbuji.net/ndewo/>", "<http://www.w3.org/TR/html5/title>", "Ndewo, Colorado"]
]

The angle brackets signal to Versa what should be treated as an IRI.
Versa origins and relationships are always IRIs, so you can omit the angle
brackets in those cases.

[
["http://uche.ogbuji.net/ndewo/", "http://www.w3.org/TR/html5/link-type/author", "<http://uche.ogbuji.net>", {"<http://www.w3.org/TR/html5/link/description>": "Uche Ogbuji"}],
["http://uche.ogbuji.net/ndewo/", "http://www.w3.org/TR/html5/link-type/see-also", "<http://www.goodreads.com/book/show/18714145-ndewo-colorado>", {"<http://www.w3.org/TR/html5/link/label>": "Goodreads"}],
["http://uche.ogbuji.net/", "http://www.w3.org/TR/html5/link-type/see-also", "<http://uche.ogbuji.net/ndewo/>"]
["http://uche.ogbuji.net/ndewo/", "http://www.w3.org/TR/html5/title", "Ndewo, Colorado"]
]

All Versa data link targets are represented as strings, but they can be
interpreted as e.g. numbers, dates or other data types. Attributes are
useful for signaling such interpretation.

["http://uche.ogbuji.net/ndewo/", "http://www.w3.org/TR/html5/created", "2013-09-01", {"<@type>", "<@datetime>"}]

Notice the syntax used in the attribute. Versa provides some common data
modeling primitives such as a way to express the interpreted type of a data
link target. `@type` is just a convenient abbreviation for referring
to this Versa built-in concept. You can write out this link in full as follows:

["http://uche.ogbuji.net/ndewo/", "http://www.w3.org/TR/html5/created", "2013-09-01", {"<http://purl.org/versa/type>", "<http://purl.org/versa/datetime>"}]

# Developer notes

Dosctring style: [Google](https://google.github.io/styleguide/pyguide.html#38-comments-and-docstrings) + Markdown
4 changes: 4 additions & 0 deletions demo/ingest/books.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
Title,Author,Author date,ISBN,Publisher,Pub date
Half of a Yellow Sun,Chimamanda Ngozi Adichie,1977,9780008205249,Fourth Estate,2006
Things Fall Apart,Chinụalụmọgụ Achebe,1930,9781841593272,William Heinemann Ltd.,1958
"Death and the King's Horseman ",Olúwolé Sóyíinká,1934,9780413333506,Eyre Methuen,1975
14 changes: 14 additions & 0 deletions demo/ingest/books.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
[
{"title": "Half of a Yellow Sun",
"author": {"name": "Chimamanda Ngozi Adichie", "date": "1977"},
"publication": {"name": "Fourth Estate", "date": "2006"},
"isbn": "9780008205249"},
{"title": "Things Fall Apart",
"author": {"name": "Chinụalụmọgụ Achebe", "date": "1930"},
"publication": {"name": "William Heinemann Ltd.", "date": "1958"},
"isbn": "9781841593272"},
{"title": "Death and the King's Horseman",
"author": {"name": "Olúwolé Sóyíinká", "date": "1934"},
"publication": {"name": "Eyre Methuen", "date": "1975"},
"isbn": "9780413333506"}
]
226 changes: 226 additions & 0 deletions demo/ingest/csv_to_bibframe.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,226 @@
#!/usr/bin/env python
#-*- mode: python -*-
# csv_to_bibframe.py

'''
Demo of Versa Pipeline. Converts a CSV with book info into BIBFRAME Lite
You might first want to be familar with dc_to_schemaorg.py
and csv_to_schemaorg.py
python demo/csv_to_bibframe.py demo/books.csv
http://bibfra.me/
'''

import sys
import random
import warnings
import functools
from pathlib import Path

import click # Cmdline processing tool. pip install click

from amara3 import iri

from versa import ORIGIN, RELATIONSHIP, TARGET
from versa import I, VERSA_BASEIRI, VTYPE_REL, VLABEL_REL
from versa import util
from versa.driver.memory import newmodel
from versa.serial import csv, literate, mermaid
from versa.pipeline import *
from versa.contrib.datachefids import idgen as default_idgen

BOOK_NS = I('https://example.org/')
IMPLICIT_NS = I('http://example.org/vocab/')
BF_NS = I('http://bibfra.me/')


from versa.pipeline import *

FINGERPRINT_RULES = {
# Fingerprint DC book by ISBN & output resource will be a SCH Book

# Outermost parens here are not really needed, used for formatting.
# You can use an actual tuple here, though, to trigger multiple
# rules per matched type
IMPLICIT_NS('Book'): (
materialize(BF_NS('Instance'),
fprint=[
(BF_NS('isbn'), follow(IMPLICIT_NS('identifier'))),
],
links=[
(BF_NS('provenance'), var('provenance')),
(BF_NS('instantiates'),
materialize(BF_NS('Work'),
fprint=[
(BF_NS('name'), follow(IMPLICIT_NS('title'))),
],
),
)
]
)
)
}


# Data transformation rules. In general this is some sort of link from an
# Input pattern being matched to output generated by Versa pipeline actions

# In this case we use a dict of expected relationships from fingerprinted
# resources dict values are the action function that updates the output model
# by acting on the provided context (in this case just the triggered
# relationship in the input model)

# Work & instance types
WT = BF_NS('Work')
IT = BF_NS('Instance')


DC_TO_SCH_RULES = {
# Rules that are the same regardless of matched output resource type
IMPLICIT_NS('title'): link(rel=BF_NS('name')),

# Rules differentiated by matched output resource type
(IMPLICIT_NS('author'), WT): materialize(BF_NS('Person'),
BF_NS('creator'),
fprint=[
(BF_NS('name'), attr(IMPLICIT_NS('name'))),
(BF_NS('birthDate'), attr(IMPLICIT_NS('date'))),
],
links=[
(BF_NS('name'), attr(IMPLICIT_NS('name'))),
(BF_NS('birthDate'), attr(IMPLICIT_NS('date'))),
]
),
}


LABELIZE_RULES = {
# Labels come from input model's DC name rels
BF_NS('Book'): follow(BF_NS('name'))
}


# Just use Python's built-in string.format()
# Could also use e.g. Jinja
VLITERATE_TEMPLATE = '''\
# @docheader
* @iri:
* @base: https://example.org/
* @schema: http://example.org/vocab/
# /{ISBN} [Book]
* title: {Title}
* author:
* name: {Author}
* date: {Author_date}
* publisher:
* name: {Publisher}
* date: {Pub_date}
* identifier: {ISBN}
* type: isbn
'''


class csv_bibframe_pipeline(definition):
def __init__(self):
'''
csv_bibframe_pipeline initializer
'''
self._provenance = I('http://example.com/SOME_CSV_FILE')
super().__init__()

@stage(1)
def fingerprint(self):
'''
Generates fingerprints from the source model
Result of the fingerprinting phase is that the output model shows
the presence of each resource of primary interest expected to result
from the transformation, with minimal detail such as the resource type
'''
# Prepare a root context
ctx_vars = {'provenance': self._provenance}
ctx = DUMMY_CONTEXT.copy(variables=ctx_vars)

# Apply a common fingerprinting strategy using rules defined above
new_rids = self.fingerprint_helper(FINGERPRINT_RULES, root_context=ctx)

# In real code following lines could be simplified to: return bool(new_rids)
if not new_rids:
# Nothing found to process, so ret val set to False
# This will abort pipeline processing of this input & move on to the next, if any
return False

# ret val True so pipeline run will continue for this input
return True


@stage(2)
def main_transform(self):
'''
Executes the main transform rules to go from input to output model
'''
# Apply a common transform strategy using rules defined above
#
def missed_rel(link):
'''
Callback to handle cases where a transform wasn't found to match a link (by relationship) in the input model
'''
warnings.warn(f'Unknown, so unhandled link. Origin :{link[ORIGIN]}. Rel: {link[RELATIONSHIP]}')

new_rids = self.transform_by_rel_helper(DC_TO_SCH_RULES, handle_misses=missed_rel)
return True


@stage(3)
def labelize(self):
'''
Executes a utility rule to create labels in output model for new (fingerprinted) resources
'''
# XXX Check if there's already a label?
# Apply a common transform strategy using rules defined above
def missed_label(origin, type):
'''
Callback to handle cases where a transform wasn't found to match a link (by relationship) in the input model
'''
warnings.warn(f'No label generated for: {origin}')
labels = self.labelize_helper(LABELIZE_RULES, handle_misses=missed_label)
return True


@click.command()
@click.argument('source')
def main(source):
'Transform CSV SOURCE file to BF Lite in Versa'
ppl = csv_bibframe_pipeline()
input_model = newmodel()
with open(source) as csvfp:
for row_model in csv.parse_iter(csvfp, VLITERATE_TEMPLATE):
if row_model: input_model.update(row_model)

# Debug print of input model
# literate.write([input_model], out=sys.stdout)
output_model = ppl.run(input_model=input_model)
print('Low level JSON dump of output data model: ')
util.jsondump(output_model, sys.stdout)
print('\n') # 2 CRs
print('Versa literate form of output: ')
literate.write(output_model, out=sys.stdout)

print('Diagram from extracted a sample: ')
out_resources = []
for vs in ppl.fingerprints.values():
out_resources.extend(vs)
ITYPE = BF_NS('Instance')
instances = [ r for r in out_resources if ITYPE in util.resourcetypes(output_model, r) ]
zoomed, _ = util.zoom_in(output_model, random.choice(instances), depth=2)
mermaid.write(zoomed)
# literate.write(zoomed)


if __name__ == '__main__':
main()
Loading

0 comments on commit c9b8a96

Please sign in to comment.