-
Notifications
You must be signed in to change notification settings - Fork 7
Situating a FAqT Brick into csv2rdf4lod automation
FAqT Brick talks about how to specify and execute an analysis with:
- A set of evaluation services
- A set of datasets to evaluate
- Performed at different epochs over time
Conceptually, this is exactly what we need to achieve our Linked Open metaData goals. But practically, it has been difficult to work with, share, and replicate.
In contrast, the "source-dataset-version" organization fostered by csv2rdf4lod-automation provides approachable, sharable, and replicable structure.
So, how can we combine the best of both worlds? That's what we try to tackle here.
The organization of a FAqT Brick was designed independently of csv2rdf4lod-automation, but the organization "SDV organization" principles that csv2rdf4lod-automation fosters is compelling within practical applications, so this page explores how a FAqT Brick can live within a data conversion root. As a concrete example, we'll figure out how to use the https://github.com/timrdf/lodcloud project to reproduce http://www.licensius.com/blog/lodlicenses.
First, we need to review the organizing schemes that csv2rdf4lod and DataFAQs use.
csv2rdf4lod organizes datasets by forming a hierarchy out of the following aspects:
- source
- dataset
- version
Using these aspects, we can create the URIs:
- http://datafaqs.tw.rpi.edu/source/epa-gov (a foaf:Organization)
- http://datafaqs.tw.rpi.edu/source/epa-gov/dataset/air-quality-system (an abstract dataset; union of versions)
- http://datafaqs.tw.rpi.edu/source/epa-gov/dataset/air-quality-system/version/2013-Jan-01 (a concrete dataset of triples created from one retrieval of EPA's data files)
A FAqT Brick in DataFAQs also uses three aspects, but they are different. They are also not strictly hierarchical like csv2rdf4lod is.
- epoch - dataset
- faqt - epoch
- faqt - dataset - epoch
Using df: and cr: to distinguish terminology scope, the df:epoch aspect is analogous to cr:version, since each time the FAqT Brick is run we have a new subset of data.
df:dataset is NOT like cr:dataset, since df:dataset is a multi-element dimension in DataFAQS while cr:dataset is just the name of the bucket of data that is being gathered (this is the distinction between metadata and data; DataFAQs does the former and csv2rdf4lod does the latter).
df:faqt does not have an analog in csv2rdf4lod. Like df:dataset, it is the multi-element dimension of the evaluation service that provides metadata about each of the elements in the df:dataset dimension.
cr:dataset is analogous to the fixed specification of df:dataset and df:faqt.
csv2rdf4lod requires the following aspects, in that order:
- source identifier - this is a short string that names the agent person/organization that provided the dataset. Since <df:dataset, df:faqt> defines our cr:dataset, we are the source organization. For the lodcloud project "us" is the source identifier that we use to name ourselves. So, we'll work within https://github.com/timrdf/lodcloud/tree/master/data/source/us.
-
dataset identifier - this is a short string that the source organization uses to name the set of data that they provide. Since <df:dataset, df:faqt> defines our cr:dataset, we need to choose an identifier for it. In our example, we're reproducing the licensing survey, so we'll choose the string "
how-o-is-lod
" to create https://github.com/timrdf/lodcloud/tree/master/data/source/us/how-o-is-lod. -
version identifier - this is a short string that names the "update/revision/release" of the dataset identified above. Unfortunately, DataFAQs' notion of "version" (df:epoch) is spread throughout a FAqT Brick, so it isn't as easy to follow csv2rdf4lod here. Also, since DataFAQs currently requires that it works from a directory called "
faqt-brick
", we'll choose that as the "all versions" version identifier and create https://github.com/timrdf/lodcloud/tree/master/data/source/us/how-o-is-lod/version/faqt-brick.
Automated creation of a new Versioned Dataset provides some conventions for where to situate triggers that csv2rdf4lod-automation can recognize to automate the reconstruction of a dataset. In DataFAQs, an "epoch.ttl" file sits at the root to specify what evaluations should be performed. This aligns with choosing "faqt-brick
" as the version identifier above, resulting in its placement at https://github.com/timrdf/lodcloud/blob/master/data/source/us/how-o-is-lod/version/faqt-brick/epoch.ttl.