Skip to content

FAqT Brick

timrdf edited this page Jan 19, 2012 · 130 revisions

A FAqT brick contains the materials created when using FAqT Services to analyze a set of datasets. A FAqT brick starts as a directory structure whose contents are then loaded into a SPARQL endpoint. Each time an analysis is performed, a new slice is added to the brick for the current time frame, or epoch. The three dimensions of a FAqT brick are dataset, evaluation service, and epoch, as illustrated below.

The three dimensions of a FAqT brick are *dataset*, *evaluation service*, and *epoch*

Directory conventions

You can choose any location for a FAqT brick directory, and you can have many FAqT bricks for different purposes. The name of a FAqT brick's root directory must be named faqt-brick. The core services follow directory conventions rooted on this name. For example, we can create a FAqT brick directory with the following commands:

mkdir ~/lebo/Desktop/faqt-brick
cd ~/lebo/Desktop/faqt-brick
datafaqs-evaluate.sh --help

datafaqs-evaluate.sh is available after Installing DataFAQs and prints usage similar to the following:

 usage: datafaqs-evaluate.sh [-n] [--force-epoch | --reuse-epoch <existing-epoch>]
                                  [--faqts    <rdf-file> <service-uri>]
                                  [--datasets <rdf-file> <service-uri>]

            -n: perform dry run (not implemented yet).

       --faqts: override the service-uri and its input (to evaluate with a different set of FAqT evaluation 

    --datasets: override the service-uri and its input (to evaluate a different set of datasets).

 --force-epoch: force new epoch; ignore 'once per day' convention.

 --reuse-epoch: reapply FAqT evaluation services to datasets in existing epoch. Takes precedence over --force-epoch.

Creating a slice

Running datafaqs-evaluate.sh will create a FAqT brick slice using a default configuration. Its output reports:

  • the name of the epoch it is going to create (e.g. 2012-01-13), then
  • the [DataFAQs Core Service](DataFAQs Core Services) (e.g. via-sparql-query) that it will use to obtain a list of FAqT services to apply, then
  • the DataFAQs Core Service (e.g. by-ckan-group) that it will use to obtain a list of datasets to evaluate, and finally
  • the DataFAQs Core Service (e.g. with-preferred-uri-and-ckan-meta-void) to use to obtain descriptions for each dataset.
mkdir ~/lebo/Desktop/faqt-brick
cd ~/lebo/Desktop/faqt-brick
datafaqs-evaluate.sh

[INFO] Using datafaqs.localhost/epochs/2012-01-13 
[INFO] Requesting FAqT services from 
       http://sparql.tw.rpi.edu/services/datafaqs/core/select-faqts/via-sparql-query
[INFO] Requesting datasets from 
       http://sparql.tw.rpi.edu/services/datafaqs/core/select-datasets/by-ckan-group
[INFO] Requesting dataset descriptions from 
       http://sparql.tw.rpi.edu/services/datafaqs/core/augment-datasets/with-preferred-uri-and-ckan-meta-void

After datafaqs-evaluate.sh lists the FAqT Services and dataset URIs, it gathers RDF descriptions of the datasets. It shows the URIs that it requests to accumulate descriptions about each dataset, along with the first line of each response.

[INFO] 5 FAqT services will evaluate 3 datasets.

[INFO] FAqT Services:

[INFO] http://sparql.tw.rpi.edu/services/datafaqs/faqt/lodcloud/max-1-topic-tag
[INFO] http://sparql.tw.rpi.edu/services/datafaqs/faqt/predicate-counter
[INFO] http://sparql.tw.rpi.edu/services/datafaqs/faqt/redirect-loop
[INFO] http://sparql.tw.rpi.edu/services/datafaqs/faqt/void-properties
[INFO] http://sparql.tw.rpi.edu/services/datafaqs/faqt/void-triples

[INFO] CKAN Datasets:

[INFO] http://thedatahub.org/dataset/congresspeople
[INFO] http://thedatahub.org/dataset/farmers-markets-geographic-data-united-states
[INFO] http://thedatahub.org/dataset/white-house-visitor-access-records


[INFO] Gathering information about CKAN Datasets, for input to FAqT evaluation services.

thedatahub.org/dataset/congresspeople (1/3)
   <?xml version="1.0" encoding="utf-8"?>
   1: http://logd.tw.rpi.edu/source/contactingthecongress/dataset/directory-for-the-112th-congress
      <?xml version="1.0" encoding="utf-8" ?>

thedatahub.org/dataset/farmers-markets-geographic-data-united-states (2/3)
   <?xml version="1.0" encoding="utf-8"?>
   1: http://logd.tw.rpi.edu/source/data-gov/dataset/4383/version/2011-Nov-29
      <?xml version="1.0" encoding="utf-8" ?>
   2: http://logd.tw.rpi.edu/source/data-gov/file/4383/version/2011-Nov-29/conversion/data-gov-4383-2011-Nov-29.void.ttl
      @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .

thedatahub.org/dataset/white-house-visitor-access-records (3/3)
   <?xml version="1.0" encoding="utf-8"?>

The accumulated dataset description responses are then submitted to each FAqT service, so that they have some basic information to start with when performing their evaluation. The RDF that each FAqT service returns is stored, and its size and format is reported by datafaqs-evaluate.sh .

[INFO] Submitting CKAN dataset information to FAqT evaluation services.

[INFO] dataset 1/3, FAqT 1/5 (1/15 total)
[INFO] http://thedatahub.org/dataset/congresspeople
[INFO] http://sparql.tw.rpi.edu/services/datafaqs/faqt/predicate-counter
[INFO] 32K of  results

[INFO] dataset 1/3, FAqT 2/5 (2/15 total)
[INFO] http://thedatahub.org/dataset/congresspeople
[INFO] http://sparql.tw.rpi.edu/services/datafaqs/faqt/void-properties
[INFO] 32K of  results

[INFO] dataset 1/3, FAqT 3/5 (3/15 total)
[INFO] http://thedatahub.org/dataset/congresspeople
[INFO] http://sparql.tw.rpi.edu/services/datafaqs/faqt/lodcloud/max-1-topic-tag
[INFO] 4.0K of text/turtle results

[INFO] dataset 1/3, FAqT 4/5 (4/15 total)
[INFO] http://thedatahub.org/dataset/congresspeople
[INFO] http://sparql.tw.rpi.edu/services/datafaqs/faqt/redirect-loop
[INFO] 4.0K of text/turtle results

[INFO] dataset 1/3, FAqT 5/5 (5/15 total)
[INFO] http://thedatahub.org/dataset/congresspeople
[INFO] http://sparql.tw.rpi.edu/services/datafaqs/faqt/void-triples
[INFO] 4.0K of text/turtle results

[INFO] dataset 2/3, FAqT 1/5 (6/15 total)
[INFO] http://thedatahub.org/dataset/white-house-visitor-access-records
[INFO] http://sparql.tw.rpi.edu/services/datafaqs/faqt/redirect-loop
[INFO] 4.0K of  results

[INFO] dataset 2/3, FAqT 2/5 (7/15 total)
[INFO] http://thedatahub.org/dataset/white-house-visitor-access-records
[INFO] http://sparql.tw.rpi.edu/services/datafaqs/faqt/lodcloud/max-1-topic-tag
[INFO] 4.0K of  results

[INFO] dataset 2/3, FAqT 3/5 (8/15 total)
[INFO] http://thedatahub.org/dataset/white-house-visitor-access-records
[INFO] http://sparql.tw.rpi.edu/services/datafaqs/faqt/predicate-counter
[INFO] 4.0K of  results

[INFO] dataset 2/3, FAqT 4/5 (9/15 total)
[INFO] http://thedatahub.org/dataset/white-house-visitor-access-records
[INFO] http://sparql.tw.rpi.edu/services/datafaqs/faqt/void-triples
[INFO] 4.0K of  results

[INFO] dataset 2/3, FAqT 5/5 (10/15 total)
[INFO] http://thedatahub.org/dataset/white-house-visitor-access-records
[INFO] http://sparql.tw.rpi.edu/services/datafaqs/faqt/void-properties
[INFO] 4.0K of  results

[INFO] dataset 3/3, FAqT 1/5 (11/15 total)
[INFO] http://thedatahub.org/dataset/farmers-markets-geographic-data-united-states
[INFO] http://sparql.tw.rpi.edu/services/datafaqs/faqt/void-properties
[INFO] 15M of  results

[INFO] dataset 3/3, FAqT 2/5 (12/15 total)
[INFO] http://thedatahub.org/dataset/farmers-markets-geographic-data-united-states
[INFO] http://sparql.tw.rpi.edu/services/datafaqs/faqt/lodcloud/max-1-topic-tag
[INFO] 15M of  results

[INFO] dataset 3/3, FAqT 3/5 (13/15 total)
[INFO] http://thedatahub.org/dataset/farmers-markets-geographic-data-united-states
[INFO] http://sparql.tw.rpi.edu/services/datafaqs/faqt/predicate-counter
[INFO] 15M of  results

[INFO] dataset 3/3, FAqT 4/5 (14/15 total)
[INFO] http://thedatahub.org/dataset/farmers-markets-geographic-data-united-states
[INFO] http://sparql.tw.rpi.edu/services/datafaqs/faqt/redirect-loop
[INFO] 15M of  results

[INFO] dataset 3/3, FAqT 5/5 (15/15 total)
[INFO] http://thedatahub.org/dataset/farmers-markets-geographic-data-united-states
[INFO] http://sparql.tw.rpi.edu/services/datafaqs/faqt/void-triples
[INFO] 15M of  results

The following illustrates the process of

  • (1) obtaining a dataset list from CKAN,
  • (2) obtaining a list of FAqT evaluation services from the SADI registry,
  • (3) obtaining descriptions of the dataset via URI dereference and VoID files,
  • (4) obtaining (via GET) a description of the FAqT evaluation service, and
  • (5) POSTing the dataset description to each FAqT evaluation service to obtain an evaluation described in RDF.

This process is done for each dataset and FAqT evaluation service to create a single slice of the FAqT brick.

dataset descriptions are collected before giving them to each FAqT service for evaluation

Storing the FAqT evaluation service descriptions

FAqT evaluation services describe themselves upon HTTP GET requests

When their URI is requested, FAqT evaluation services provide RDF descriptions of themselves. These are stored in a file faqt-service.ttl that is nested by both the faqt and the epoch. For example, the RDF that was returned by requesting the FAqT service http://sparql.tw.rpi.edu/services/datafaqs/faqt/void-triples during epoch 2012-01-13 is stored at:

faqt-brick/
   sparql.tw.rpi.edu/services/datafaqs/faqt/void-triples/__PIVOT_epoch/
      2012-01-19/faqt-service.ttl

Storing the CKAN dataset descriptions

dataset descriptions that will be POSTed to the FAqT evaluation service

The accumulated descriptions of the CKAN datasets are stored in a file post.ttl that is nested by both the epoch and the dataset. For example, the RDF that is POSTed to all FAqT services during epoch 2012-01-13 to evaluate dataset http://thedatahub.org/dataset/farmers-markets-geographic-data-united-states is stored at:

faqt-brick/
   datafaqs.localhost/epochs/2012-01-13/__PIVOT_dataset/
      thedatahub.org/dataset/farmers-markets-geographic-data-united-states/post.ttl

The contents of post.ttl is the union of the files:

faqt-brick/
   datafaqs.localhost/epochs/2012-01-13/__PIVOT_dataset/
      thedatahub.org/dataset/farmers-markets-geographic-data-united-states/part-*.{ttl,rdf,nt}

part-0.{ttl,rdf,nt} is the result of dereferencing the URI, while remaining part- files come from other resources such as the VoID file or dereferencing the dataset's con:preferredURIs (as provided by an augment-dataset service; see DataFAQs Core Services).

Storing the FAqT evaluation results

POSTing a dataset description to a FAqT evaluation service will return an RDF description of its evaluation

When the RDF description of a dataset it POSTed to a FAqT evaluation service, the service returns an RDF evaluation of the dataset. The response from the FAqT evaluation service is stored in a file evaluation.ttl that is nested by the faqt, dataset, and epoch. For example, the RDF returned by the FAqT service http://sparql.tw.rpi.edu/services/datafaqs/faqt/void-triples during epoch 2012-01-13 when evaluating dataset http://thedatahub.org/dataset/farmers-markets-geographic-data-united-states is stored at:

faqt-brick/
   sparql.tw.rpi.edu/services/datafaqs/faqt/void-triples/__PIVOT_dataset/
      thedatahub.org/dataset/farmers-markets-geographic-data-united-states/__PIVOT_epoch/
         2012-01-13/evaluation.ttl

Forcing a new epoch slice

datafaqs-evaluate.sh assumes that you wouldn't want more than one epoch per day. If that's not the case, go ahead and --force-epoch:

bash-3.2$ datafaqs-evaluate.sh 

An evaluation epoch has already been initiated today (2012-01-13).
Start one tomorrow, use --force-epoch to create another one today, or use --help.

bash-3.2$ datafaqs-evaluate.sh --force-epoch
[INFO] Using datafaqs.localhost/epochs/2012-01-13_17_49_46 
[INFO] Requesting FAqT services from http://sparql.tw.rpi.edu/services/datafaqs/core/select-faqts/via-sparql-query
...

Removing an epoch's slice

If you want to get rid of an epoch, first remove the epoch-specific materials from datafaqs.localhost/epochs and use datafaqs-purge-unlisted-epochs.sh to take care of the rest:

bash-3.2$ rm -rf datafaqs.localhost/epochs/2012-01-13_17_49_46/

bash-3.2$ datafaqs-purge-unlisted-epochs.sh 
usage: datafaqs-purge-unlisted-epochs.sh <-n | -w>

  -n: perform dry run; do not modify anything.
  -w: remove all epochs that are not listed in datafaqs.localhost/epochs/

bash-3.2$ datafaqs-purge-unlisted-epochs.sh -w
[INFO] Removing 2012-01-13_17_49_46
[INFO] Removing 2012-01-13_17_49_46
[INFO] Removing 2012-01-13_17_49_46
[INFO] Removing 2012-01-13_17_49_46
[INFO] Removing 2012-01-13_17_49_46
...

datafaqs-purge-unlisted-epochs.sh walks the rest of the FAqT brick and removes all materials created during epochs that are not listed in datafaqs.localhost/epochs. The example usage above removes the forced epoch that was created in the --force-epoch example earlier.

Reapplying the FAqT service evaluations within an epoch

The CKAN dataset descriptions that were accumulated in an existing epoch can be reused to reapply the FAqT service evaluations within the same epoch. Because this replaces the results within the designated epoch, this should only be done for the latest epoch. The following usage shows that of the two epochs in the FAqT brick, the dataset listing and descriptions from the later one are reused.

ls datafaqs.localhost/epochs/
2012-01-12		2012-01-13

datafaqs-evaluate.sh --reuse-epoch datafaqs:latest
[INFO] Using datafaqs.localhost/epochs/2012-01-13  (datafaqs:latest)
[INFO] Requesting FAqT services from http://sparql.tw.rpi.edu/services/datafaqs/core/select-faqts/via-sparql-query
[INFO] Reusing dataset listing and descriptions from datafaqs.localhost/epochs/2012-01-13

Graph naming URI design and VoID hierarchy

For a given epoch, the following files contain graphs that are interesting for analysis. They need to be named and loaded into a triple store so that they can be available for SPARQL query.

(todo: the rdf config with provo describing core services)

datafaqs.localhost/epochs/2012-01-19/faqt-services.ttl      # the evaluation services that were used.
                                     datasets.ttl           # the datasets that were evaluated.
                                     dataset-references.ttl # rdfs:seeAlso to more descriptions

The following files contain the FAqT evaluation services' descriptions of themselves:

sparql.tw.rpi.edu/services/datafaqs/faqt/lodcloud/max-1-topic-tag/__PIVOT_epoch/2012-01-19/faqt-service.ttl
sparql.tw.rpi.edu/services/datafaqs/faqt/predicate-counter/__PIVOT_epoch/2012-01-19/faqt-service.ttl
sparql.tw.rpi.edu/services/datafaqs/faqt/redirect-loop/__PIVOT_epoch/2012-01-19/faqt-service.ttl
sparql.tw.rpi.edu/services/datafaqs/faqt/void-properties/__PIVOT_epoch/2012-01-19/faqt-service.ttl
sparql.tw.rpi.edu/services/datafaqs/faqt/void-triples/__PIVOT_epoch/2012-01-19/faqt-service.ttl
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

 []
   a sd:NamedGraph;
   sd:name  <http://sparql.tw.rpi.edu/datafaqs/epoch/2012-01-19/faqt/1>;
   sd:graph [ 
      a prov:Account, sd:Graph, void:Graph;
      void:triples 17;
      prov:wasAttributedTo <http://sparql.tw.rpi.edu/services/datafaqs/faqt/void-triples>;
      foaf:primaryTopic    <http://sparql.tw.rpi.edu/datafaqs/epoch/2012-01-19/faqt/1>;
      void:dataDump <http://sparql.tw.rpi.edu/datafaqs/dump/__PIVOT_faqt/sparql.tw.rpi.edu/services/datafaqs/faqt/void-triples/__PIVOT_epoch/2012-01-19/faqt-service.ttl>;
   ]
.
<http://sparql.tw.rpi.edu/datafaqs/epoch/2012-01-19/faqt/1>
   a datafaqs:FAqTService;
   prov:specializationOf <http://sparql.tw.rpi.edu/services/datafaqs/faqt/void-triples>;
   dcterms:date "2012-01-19"^^xsd:date;
.

The following files contain the dataset descriptions (including the additional references):

datafaqs.localhost/epochs/2012-01-19/__PIVOT_dataset/thedatahub.org/dataset/congresspeople/post.ttl
datafaqs.localhost/epochs/2012-01-19/__PIVOT_dataset/thedatahub.org/dataset/farmers-markets-geographic-data-united-states/post.ttl
datafaqs.localhost/epochs/2012-01-19/__PIVOT_dataset/thedatahub.org/dataset/white-house-visitor-access-records/post.ttl
 []
   a sd:NamedGraph;
   sd:name  <http://sparql.tw.rpi.edu/datafaqs/epoch/2012-01-19/dataset/1>;
   sd:graph [ 
      a prov:Account, sd:Graph, void:Graph;
      void:triples 14861;
      prov:wasDerivedFrom 
         <http://thedatahub.org/dataset/farmers-markets-geographic-data-united-states>,
         <http://logd.tw.rpi.edu/source/data-gov/dataset/4383/version/2011-Nov-29>,
         <http://logd.tw.rpi.edu/source/data-gov/file/4383/version/2011-Nov-29/conversion/data-gov-4383-2011-Nov-29.void.ttl>;
      foaf:primaryTopic    <http://sparql.tw.rpi.edu/datafaqs/epoch/2012-01-19/dataset/1>;
      void:dataDump <http://sparql.tw.rpi.edu/datafaqs/dump/__PIVOT_faqt/datafaqs.localhost/epochs/2012-01-19/__PIVOT_dataset/thedatahub.org/dataset/farmers-markets-geographic-data-united-states/post.ttl>;
   ]
.
<http://sparql.tw.rpi.edu/datafaqs/epoch/2012-01-19/dataset/1>
   a void:Dataset;
   prov:specializationOf <http://thedatahub.org/dataset/farmers-markets-geographic-data-united-states>;
   dcterms:date "2012-01-19"^^xsd:date;
.

The following files contain the evaluation of each dataset from each evaluation service:

sparql.tw.rpi.edu/services/datafaqs/faqt/lodcloud/max-1-topic-tag/__PIVOT_dataset/thedatahub.org/dataset/congresspeople/__PIVOT_epoch/2012-01-19/evaluation.ttl
sparql.tw.rpi.edu/services/datafaqs/faqt/lodcloud/max-1-topic-tag/__PIVOT_dataset/thedatahub.org/dataset/farmers-markets-geographic-data-united-states/__PIVOT_epoch/2012-01-19/evaluation.ttl
sparql.tw.rpi.edu/services/datafaqs/faqt/predicate-counter/__PIVOT_dataset/thedatahub.org/dataset/congresspeople/__PIVOT_epoch/2012-01-19/evaluation.ttl
sparql.tw.rpi.edu/services/datafaqs/faqt/predicate-counter/__PIVOT_dataset/thedatahub.org/dataset/farmers-markets-geographic-data-united-states/__PIVOT_epoch/2012-01-19/evaluation.ttl
sparql.tw.rpi.edu/services/datafaqs/faqt/redirect-loop/__PIVOT_dataset/thedatahub.org/dataset/congresspeople/__PIVOT_epoch/2012-01-19/evaluation.ttl
sparql.tw.rpi.edu/services/datafaqs/faqt/redirect-loop/__PIVOT_dataset/thedatahub.org/dataset/farmers-markets-geographic-data-united-states/__PIVOT_epoch/2012-01-19/evaluation.ttl
sparql.tw.rpi.edu/services/datafaqs/faqt/void-properties/__PIVOT_dataset/thedatahub.org/dataset/congresspeople/__PIVOT_epoch/2012-01-19/evaluation.ttl
sparql.tw.rpi.edu/services/datafaqs/faqt/void-properties/__PIVOT_dataset/thedatahub.org/dataset/farmers-markets-geographic-data-united-states/__PIVOT_epoch/2012-01-19/evaluation.ttl
sparql.tw.rpi.edu/services/datafaqs/faqt/void-triples/__PIVOT_dataset/thedatahub.org/dataset/congresspeople/__PIVOT_epoch/2012-01-19/evaluation.ttl
sparql.tw.rpi.edu/services/datafaqs/faqt/void-triples/__PIVOT_dataset/thedatahub.org/dataset/farmers-markets-geographic-data-united-states/__PIVOT_epoch/2012-01-19/evaluation.ttl
@prefix rdfs:     <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd:      <http://www.w3.org/2001/XMLSchema#> .
@prefix dcterms:  <http://purl.org/dc/terms/> .
@prefix void:     <http://rdfs.org/ns/void#> .
@prefix sd:       <http://www.w3.org/ns/sparql-service-description#> .
@prefix formats:  <http://www.w3.org/ns/formats/media_type> .
@prefix prov:     <http://www.w3.org/ns/prov-o/> .
@prefix datafaqs: <http://purl.org/twc/vocab/datafaqs#> .

 []
   a sd:NamedGraph;
   sd:name  <http://sparql.tw.rpi.edu/datafaqs/epoch/2012-01-09/faqt/1/dataset/1>;
   sd:graph <http://sparql.tw.rpi.edu/datafaqs/epoch/2012-01-09/faqt/1/dataset/1>;
 .
 <http://sparql.tw.rpi.edu/datafaqs/epoch/2012-01-09/faqt/1/dataset/1>
    a prov:Account, datafaqs:Evaluation;
    void:triples 14;
    void:dataDump <http://sparql.tw.rpi.edu/datafaqs/dump/__PIVOT_faqt/sparql.tw.rpi.edu/services/datafaqs/faqt/void-triples/__PIVOT_dataset/thedatahub.org/dataset/farmers-markets-geographic-data-united-states/__PIVOT_epoch/2012-01-19/evaluation.ttl>;
 .
<http://sparql.tw.rpi.edu/datafaqs/dump/__PIVOT_faqt/sparql.tw.rpi.edu/services/datafaqs/faqt/void-triples/__PIVOT_dataset/thedatahub.org/dataset/farmers-markets-geographic-data-united-states/__PIVOT_epoch/2012-01-19/evaluation.ttl>
   formats:media_type <http://www.w3.org/ns/formats/Turtle>;
.
<http://www.w3.org/ns/formats/Turtle> 
   rdfs:label "Turtle"; 
   dcterms:identifier "text/turtle";
.
Clone this wiki locally