-
Notifications
You must be signed in to change notification settings - Fork 7
Situating a FAqT Brick into csv2rdf4lod automation
- csv2rdf4lod-automation's SDV organization
The FAqT Brick page talks about how to specify and execute an analysis with:
- A set of evaluation services
- A set of datasets to evaluate
- Performed at different epochs over time
Conceptually, this is exactly what we need to achieve our Linked Open __meta__Data goals. But practically, it has been difficult to work with, share, and replicate since it spreads these three dimensions along a couple hierarchies on the file system. In contrast, the "source-dataset-version" organization fostered by csv2rdf4lod-automation provides approachable, sharable, and replicable structure. So, how can we combine the best of both worlds? That's what we try to tackle here.
The organization of a FAqT Brick was designed independently of csv2rdf4lod-automation, but the organization "SDV organization" principles that csv2rdf4lod-automation fosters is compelling within practical applications, so this page explores how a FAqT Brick can live within a data conversion root. As a concrete example, we'll figure out how to use the https://github.com/timrdf/lodcloud project to reproduce http://oeg-dev.dia.fi.upm.es/licensius/blog/?q=lodlicenses.
First, we need to review the organizing schemes that csv2rdf4lod and DataFAQs use.
csv2rdf4lod organizes datasets by forming a hierarchy out of the following aspects:
- source
- dataset
- version
Using these aspects, we can create the URIs:
- http://datafaqs.tw.rpi.edu/source/epa-gov (a foaf:Organization)
- http://datafaqs.tw.rpi.edu/source/epa-gov/dataset/air-quality-system (an abstract dataset; union of versions)
- http://datafaqs.tw.rpi.edu/source/epa-gov/dataset/air-quality-system/version/2013-Jan-01 (a concrete dataset of triples created from one retrieval of EPA's data files)
A FAqT Brick in DataFAQs also uses three aspects, but they are different. Conceptually, they are also not strictly hierarchical like csv2rdf4lod is. Practically, the cube is decomposed into three hierarchies on the file system.
- epoch - dataset
- faqt - epoch
- faqt - dataset - epoch
Using df: and cr: to distinguish terminology scope, the df:epoch aspect is analogous to cr:version, since each time the FAqT Brick is run we have a new subset of data.
df:dataset is NOT like cr:dataset, since df:dataset is a multi-element dimension in DataFAQS while cr:dataset is just the name of the bucket of data that is being gathered (this is the distinction between metadata and data; DataFAQs does the former and csv2rdf4lod does the latter).
df:faqt does not have an analog in csv2rdf4lod. Like df:dataset, it is the multi-element dimension of the evaluation service that provides metadata about each of the elements in the df:dataset dimension.
cr:dataset is analogous to the fixed specification of df:dataset and df:faqt.
csv2rdf4lod requires the following aspects, in that order:
- source identifier - this is a short string that names the agent person/organization that provided the dataset. Since <df:dataset, df:faqt> defines our cr:dataset, we are the source organization. For the lodcloud project "us" is the source identifier that we use to name ourselves. So, we'll work within data/source/us.
-
dataset identifier - this is a short string that the source organization uses to name the set of data that they provide. Since <df:dataset, df:faqt> defines our cr:dataset, we need to choose an identifier for it. In our example, we're reproducing the licensing survey, so we'll choose the string "
how-o-is-lod
" to create data/source/us/how-o-is-lod. -
version identifier - this is a short string that names the "update/revision/release" of the dataset identified above. Unfortunately, DataFAQs' notion of "version" (df:epoch) is spread throughout a FAqT Brick, so it isn't as easy to follow csv2rdf4lod here. Also, since DataFAQs currently requires that it works from a directory called "
faqt-brick
", we'll choose that as the "all versions" version identifier and create data/source/us/how-o-is-lod/version/faqt-brick.
Automated creation of a new Versioned Dataset provides some conventions for where to situate triggers that csv2rdf4lod-automation can recognize to automate the reconstruction of a dataset. In DataFAQs, an "epoch.ttl" file sits at the root to specify what evaluations should be performed. This aligns with choosing "faqt-brick
" as the version identifier above, resulting in its placement at data/source/us/how-o-is-lod/version/faqt-brick/epoch.ttl.
As we mentioned before, DataFAQs' epochs are spread throughout the faqt-brick directory root. So, it does not provide a direct way to "package" up each evaluation run "version". If we want to expose a single FAqT Brick epoch as a csv2rdf4lod version, we will need a widget to walk the faqt-brick and output a "one-click" into the csv2rdf4lod convention of data/source/us/how-o-is-lod/version/2013-Jun-15/publish/us-how-o-is-lod-2013-Jun-15.ttl.gz
(assuming epoch->version "2013-Jun-15"; also, note the new directory version/2013-Jun-15
). This might be as easy as building the right unix find command and feeding the file paths to aggregate-source-rdf.sh (we've implemented this with df-publish-epoch-via-cr.sh).
To find all RDF that is POSTed to the FAqT services in a given epoch:
find __PIVOT_epoch/2013-06-12/__PIVOT_dataset -name post.nt.rdf
Creating a version directory (mkdir -p ../2013-06-12/publish
) and piping this list to xargs -n 1 rdf2nt.sh >> ../publish/2013-06-12/publish/posted.nt
will situate the aggregation into a csv2rdf4lod-automation publishing convention.
To find all RDF that is returned by the FAqT services in a given epoch:
find __PIVOT_faqt -name evaluation.rdf | grep __PIVOT_epoch/2013-06-12
Creating a version directory (mkdir -p ../2013-06-12/publish
) and piping this list to xargs -n 1 rdf2nt.sh >> ../publish/2013-06-12/publish/evaluations.nt
will situate the aggregation into a csv2rdf4lod-automation publishing convention.