-
Notifications
You must be signed in to change notification settings - Fork 7
Situating a FAqT Brick into csv2rdf4lod automation
The FAqT Brick page talks about how to specify and execute an analysis with:
- A set of evaluation services
- A set of datasets to evaluate
- Performed at different epochs over time
Conceptually, this is exactly what we need to achieve our Linked Open __meta__Data goals. But practically, it has been difficult to work with, share, and replicate.
In contrast, the "source-dataset-version" organization fostered by csv2rdf4lod-automation provides approachable, sharable, and replicable structure.
So, how can we combine the best of both worlds? That's what we try to tackle here.
The organization of a FAqT Brick was designed independently of csv2rdf4lod-automation, but the organization "SDV organization" principles that csv2rdf4lod-automation fosters is compelling within practical applications, so this page explores how a FAqT Brick can live within a data conversion root. As a concrete example, we'll figure out how to use the https://github.com/timrdf/lodcloud project to reproduce http://www.licensius.com/blog/lodlicenses.
First, we need to review the organizing schemes that csv2rdf4lod and DataFAQs use.
csv2rdf4lod organizes datasets by forming a hierarchy out of the following aspects:
- source
- dataset
- version
Using these aspects, we can create the URIs:
- http://datafaqs.tw.rpi.edu/source/epa-gov (a foaf:Organization)
- http://datafaqs.tw.rpi.edu/source/epa-gov/dataset/air-quality-system (an abstract dataset; union of versions)
- http://datafaqs.tw.rpi.edu/source/epa-gov/dataset/air-quality-system/version/2013-Jan-01 (a concrete dataset of triples created from one retrieval of EPA's data files)
A FAqT Brick in DataFAQs also uses three aspects, but they are different. They are also not strictly hierarchical like csv2rdf4lod is.
- epoch - dataset
- faqt - epoch
- faqt - dataset - epoch
Using df: and cr: to distinguish terminology scope, the df:epoch aspect is analogous to cr:version, since each time the FAqT Brick is run we have a new subset of data.
df:dataset is NOT like cr:dataset, since df:dataset is a multi-element dimension in DataFAQS while cr:dataset is just the name of the bucket of data that is being gathered (this is the distinction between metadata and data; DataFAQs does the former and csv2rdf4lod does the latter).
df:faqt does not have an analog in csv2rdf4lod. Like df:dataset, it is the multi-element dimension of the evaluation service that provides metadata about each of the elements in the df:dataset dimension.
cr:dataset is analogous to the fixed specification of df:dataset and df:faqt.
csv2rdf4lod requires the following aspects, in that order:
- source identifier - this is a short string that names the agent person/organization that provided the dataset. Since <df:dataset, df:faqt> defines our cr:dataset, we are the source organization. For the lodcloud project "us" is the source identifier that we use to name ourselves. So, we'll work within data/source/us.
-
dataset identifier - this is a short string that the source organization uses to name the set of data that they provide. Since <df:dataset, df:faqt> defines our cr:dataset, we need to choose an identifier for it. In our example, we're reproducing the licensing survey, so we'll choose the string "
how-o-is-lod
" to create data/source/us/how-o-is-lod. -
version identifier - this is a short string that names the "update/revision/release" of the dataset identified above. Unfortunately, DataFAQs' notion of "version" (df:epoch) is spread throughout a FAqT Brick, so it isn't as easy to follow csv2rdf4lod here. Also, since DataFAQs currently requires that it works from a directory called "
faqt-brick
", we'll choose that as the "all versions" version identifier and create data/source/us/how-o-is-lod/version/faqt-brick.
Automated creation of a new Versioned Dataset provides some conventions for where to situate triggers that csv2rdf4lod-automation can recognize to automate the reconstruction of a dataset. In DataFAQs, an "epoch.ttl" file sits at the root to specify what evaluations should be performed. This aligns with choosing "faqt-brick
" as the version identifier above, resulting in its placement at data/source/us/how-o-is-lod/version/faqt-brick/epoch.ttl.
As we mentioned before, DataFAQs' epochs are spread throughout the faqt-brick directory root. So, it does not provide a direct way to "package" up each evaluation run "version". If we want to expose a single FAqT Brick epoch as a csv2rdf4lod version, we will need a widget to walk the faqt-brick and output a one-click into the csv2rdf4lod convention of data/source/us/how-o-is-lod/version/2013-Jun-15/publish/us-how-o-is-lod-2013-Jun-15.ttl.gz
(assuming epoch->version "2013-Jun-15"; also, note the new directory version/2013-Jun-15
). This might be as easy as building the right unix find command and feeding the file paths to aggregate-source-rdf.sh.
To find all RDF that is POSTed to the FAqT services in a given epoch:
find __PIVOT_epoch/2013-06-12/__PIVOT_dataset -name post.nt.rdf
Creating a version directory (mkdir -p ../2013-06-12/publish
) and piping this list to xargs -n 1 rdf2nt.sh >> ../publish/2013-06-12/publish/posted.nt
will situate the aggregation into a csv2rdf4lod-automation publishing convention.
To find all RDF that is returned by the FAqT services in a given epoch:
find __PIVOT_faqt -name evaluation.rdf | grep __PIVOT_epoch/2013-06-12
Creating a version directory (mkdir -p ../2013-06-12/publish
) and piping this list to xargs -n 1 rdf2nt.sh >> ../publish/2013-06-12/publish/evaluations.nt
will situate the aggregation into a csv2rdf4lod-automation publishing convention.