Skip to content

Module 3B: Advanced Cleaning with Internet Archive Data (Cont'd)

Scotty Carlson edited this page Jul 31, 2019 · 13 revisions

SKOS & RDF

At this point, we need talk a little bit about SKOS and RDF. Without getting too technical (which is kind of difficult for the topic), here's what the W3Schools site says about RDF:

RDF stands for Resource Description Framework. RDF is a framework for describing resources on the web. RDF is designed to be read and understood by computers. RDF is not designed for being displayed to people. RDF is written in XML. RDF is a part of the W3C's Semantic Web Activity. RDF is a W3C Recommendation from 10 February 2004.

At it's most basic level, RDF is a way to express relationships between data, which can be encoded in lots of different ways. A statement in RDF is called a "triple," since it contains a subject, a predicate, and an object in interaction. The Wikipedia page on RDF uses the statement "The sky has the color blue" as a way to express an RDF statement. This statement has:

a subject denoting "the sky", a predicate denoting "has", and an object denoting "the color blue".

The RDF framework is leveraged by using different vocabularies to express relationships. One such vocabulary is the Simple Knowledge Organization System (SKOS), used for representing Knowledge Organization Systems (KOSs) -- thesauri, taxonomies, classification schemes, and lists of discrete data. Its aim is to allow concepts to be ported to a shared space, enabling wider reuse.

The fundamental element of the SKOS vocabulary is the Concept. Concepts are the units of thought -- ideas, meanings, (categories of) objects, events, etc. -- which underlie many knowledge organization systems. SKOS introduces the class skos:Concept, which allows us to assert that a given resource is a concept by (1) finding (or creating) a URI to uniquely identify the concept, and (2) asserting that the resource connected to the URI is a skos:Concept.

"Wait, What Does Any of That Have to Do With Our Dead Data?"

Good question -- let's wrap it all up together: if we had structured data on Grateful Dead concert dates, we could take that data and convert it to an RDF file, which we could then use in Refine to reconcile against the known concert dates in the Inter Archive data download.

The problem is, while there are many web sites that contain detailed Grateful Dead setlist databases, to my knowledge, none of them are formally structured with unique resource identifiers AND openly available to acquire as a complete data dump (though Setlist.FM does have a complicated REST API search).

Luckily, I had the ear of political science professor and Grateful Dead researcher Joe Jupille. For his own research, Joe has amassed a remarkably detailed database on the significant events of the life of Jerry Garcia. Joe was kind enough to supply me with a partial export of that data that details the date, venue, and address of all known Grateful Dead live performances. (Thanks, Joe)

Thanks to Joe's data, we could feasibly create a SKOS-RDF file that will create a known Concept class for each individual concert. Here are the parameters, based on the metadata provided:

  • skos:Concept: connected to a "dummy" URI (more about this in a second)
  • skos:prefLabel: short for "preferred label", this is what will be matched against the data; this will be the date of the concert (formatted as YYYY-MM-DD)
  • skos:definition: here is where we will "define" the concept -- Venue, address (where available) and territory, written in a human-readable statement.

It's important to note here that we are using Refine's reconciliation process in a completely different way than it was intended. Our intention is not to extract URIs -- we already know that no URIs exist for them, so there is no point to look for them. For this purpose, URIs do not matter -- BUT, they are a part of the RDF schema, and need to be there. So for our purposes, the URIs linked to each concept are totally fake and useless. (This would change if we wanted to populate a publicly available database with unique, permanent URIs for each live performance -- THEN the URIs would matter.)

Here is what a single piece of SKOS-encoded data looks like:

<skos:Concept rdf:about="http://www.fakedomain.com/id/19650505">
    <skos:prefLabel>1965-05-05</skos:prefLabel>
    <skos:scopeNote>Magoo's Pizza Parlor, 635 Santa Cruz Avenue, Menlo Park, CA 94025</skos:scopeNote>
</skos:Concept>

Using the above template, the entire data file from Joe was encoded as a SKOS-RDF file. This file is available in the ZIP download of this tutorial, named dead-dates-skos.rdf. We will need it for the next part of the module.

Adding the RDF Extension to Refine

To be able to use our new SKOS-RDF file, we'll need to add another open source extension to Refine. This one can be downloaded here. First download the extension for your specific Refine version, then follow the directions to install the extension from a compiled release.

When you restart Refine, you'll see that another button is now at the top right-hand corner. Click it. Select Add Reconciliation Service > Based on RDF File.

mod3

In the window that shows up, select the SKOS/RDF file found in this GitHub repo (dead-dates-skos.rdf) and use the following parameters:

  • Name: Grateful Data
  • File Format: RDF/XML
  • Label properties: skos:prefLabel (leave no other ones checked)

mod3

We're almost ready. Next, import your Internet Archive data (found in the file IA-Data.xlsx) to a new project in Refine.

Reconciling

The reconciliation process can take a long time, depending on your version of Refine, your computer specs, and the little magic elves shuttling data from the former to the latter. Because of this. we will test out our Reconciliation process on a small amount of data.

Create a text facet on the Year column of your IA Data project. Click on the entry for 1977, isolating our data to the year that some people consider almost a perfect year for Dead performances. ("It's as close to a flawless Grateful Dead tour as I've ever heard," said archivist David Lemieux.)

mod4

Click on the dropdown menu for the Date column and select Reconcile > Start Reconciling. In the 'Reconcile' window, click on the GratefulData RDF file we added. (Refine processing this file may take a minute or two.) When it's ready, be sure that Refine has chosen skos:Concept to reconcile against, and that the box for 'Auto-Match with High Confidence' is checked. Click start.

mod5

The reconciliation process may take a while. When it's finished, move onward.

Next >