Skip to content
rdelbru edited this page May 13, 2011 · 20 revisions

SIREn Build Instructions

Basic steps:

  • Install JDK 1.6 (or greater), Maven 2.0.9 (or greater)

  • Download SIREn and unpack it

  • Connect to the top-level of your SIREn installation

  • Run maven

Step 1: Environment setup

Set up your development environment (JDK 1.6 or greater, Maven 2.0.9 or greater)

We'll assume that you know how to get and set up the JDK - if you don't, then we suggest starting at http://java.sun.com and learning more about Java, before returning to this README. SIREn runs with JDK 1.6 and later.

Like many Open Source java projects, SIREn uses Apache Maven for build control. Specifically, you MUST use Maven version 2.0.9 or greater.

Step 2: Download SIREn

We'll assume you already did this, or you wouldn't be reading this file. However, you might have received this file by some alternate route, or you might have an incomplete copy of SIREn, so: SIREn releases are available for download from http://siren.sindice.com/ and snapshots from https://github.com/rdelbru/SIREn/archives/master.

Download the tarred/gzipped version of the archive, and uncompress it into a directory of your choice.

Step 3: SIREn installation

From the command line, change (cd) into the top-level directory of your SIREn installation.

SIREn's top-level directory contains the pom.xml file. By default, you do not need to change any of the settings in this file, but you do need to run maven from this location so it knows where to find pom.xml.

Step 4: Run maven

Assuming you have maven in your PATH, typing "mvn package" at the shell prompt and command prompt should run maven. Maven will by default look for the "pom.xml" file in your current directory, compile SIREn and run the tests.

The SIREn jar file will be located at "./target/siren-#{version}.jar".

To generate the javadoc, you should type "mvn javadoc:javadoc" at the shellprompt. Maven will generate the javadoc API in the directory "./target/site/apidocs/".

Input Data Syntax: N-Tuples

SIREn extends Lucene with a new field type 'tuples'. The field accepts structured information in a special syntax called N-Tuples which is derived from the [N-Triples|http://www.w3.org/TR/rdf-testcases/#ntriples] syntax. The N-Tuples syntax is a superset of the N-Triples syntax. N-Tuples is a line-based, plain text format for encoding semi-structured data such as RDF graph or other data format. The content of field of type tuples is an ordered list of tuples, each tuple being an ordered list of cells. The current syntax differentiates three types of cells:

  • URIs, or Uniform Resource Identifiers, are enclosed in '<' and '>';
  • Literals, or plain text, are written using double-quotes;
  • Blank nodes, or local identifiers (specific to the RDF data model), are written as '_:nodeID'.

A dot signifies the end of a tuple. In the following, we present various examples of semi-structured data encoded into N-Tuples. The possibilities are not restricted to these examples, and it is up to you to structured your data the way you want.

N-Triples

Here is a sample of a plain N-Triples document that encodes a RDF graph. The document describes itself, i.e., the FOAF file of Renaud Delbru, and the entity identfied by the URI [http://renaud.delbru.fr/rdf/foaf#me].

<http://renaud.delbru.fr/rdf/foaf> <http://www.w3.org/2000/01/rdf-schema#label> "FOAF file of Renaud Delbru" .
<http://renaud.delbru.fr/rdf/foaf> <http://xmlns.com/foaf/0.1/maker> <http://renaud.delbru.fr/rdf/foaf#me> .
<http://renaud.delbru.fr/rdf/foaf#me> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://xmlns.com/foaf/0.1/Person> .
<http://renaud.delbru.fr/rdf/foaf#me> <http://xmlns.com/foaf/0.1/name> "Renaud Delbru" .
<http://renaud.delbru.fr/rdf/foaf#me> <http://xmlns.com/foaf/0.1/givenname> "Renaud" .
<http://renaud.delbru.fr/rdf/foaf#me> <http://xmlns.com/foaf/0.1/family_name> "Delbru" .
<http://renaud.delbru.fr/rdf/foaf#me> <http://xmlns.com/foaf/0.1/homepage> <http://renaud.delbru.fr/> .

Entity-Centric

Here is a sample of entity description using N-Tuples. Compared to the previous example where the first cell was the identifier of an entity, the first cell of a tuple is a predicate (or property name). The subsequent cells of a tuple are the values associated to the predicate.

As you can see, the syntax is flexible. In line 1 and 3, we can model a multi-valued predicate with a first cell representing the predicate and the following cells as values. You can also mix different tuple cell types (URIs, Literals) in a same tuple.

<http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://xmlns.com/foaf/0.1/Person> "A Person" .
<http://xmlns.com/foaf/0.1/name> "Renaud Delbru" .
<http://xmlns.com/foaf/0.1/knows> <http://g1o.net#me> <http://eyaloren.org/foaf.rdf#me>

SIREn Tuple Analyzer

SIREn provides a generic analyzer, the TupleAnalyzer, for parsing a field containing N-Tuples data. The TupleAnalyzer is pre-configured for working with most of the use cases. It integrates by default a StandardAnalyzer for tokenising the Literal cells, and additonal filters for normalising the tokens.

  • URITrailingSlashFilter: It normalises URIs by removing trailing slashes.
"http://xmlns.com/foaf/0.1/" -> "http://xmlns.com/foaf/0.1"
  • URINormalisationFilter: It normalises URIs by breaking down them into subwords and by generating multiple variations.
"http://xmlns.com/foaf/0.1/name" ->
(position:token)
0:"http"
1:"xmlns.com",
2:"foaf",
3:"0.1",
4:"name",
5:"http://xmlns.com/foaf/0.1/name
  • LowerCaseFilter: The original Lucene filter that normalises tokens (of type Literal, URIs, etc.) to lower case.
  • StopFilter: The original Lucene filter that removes stop words.
  • LengthFilter: The original Lucene filter that removes words that are too short (by default 2) or too long (by default 128).

The following example helps to visualise the effects of the TupleAnalyzer on one tuple:

Analysing "<http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://xmlns.com/foaf/0.1/Person> "A Person" ."
[http] [www] [w3] [org] [1999] [02] [22] [rdf] [syntax] [ns] [type] [http] [xmlns] [com] [foaf] [0.1] [person] [person]
| |
[http://www.w3.org/1999/02/22-rdf-syntax-ns#type] [http://xmlns.com/foaf/0.1/person]

SIREn Query Components

SIREn provides a set of query components for performing operations over the content and structure of the tuple table. Those query components are the building blocks for writing semi-structured search.

Searching Content using Primitive Queries

SIREn currently provides two primitive query operators to access (and searhc) the content of a tuple table. These query operators provides the basic operations following operations

  • SirenTermQuery: performs a term lookup, similarly to the original Lucene TermQuery;
  • SirenPhraseQuery: performs a phrase query, similarly to the original Lucene PhraseQuery.

In the future SIREn releases, more advanced primitives operators will be available such as fuzzy or prefix queries.

These operators can then be combined with higher level operators, such as SirenCellQuery and SirenTupleQuery presented next, in order to create semi-structured queries.

Restricting Search within a Cell

The SirenCellQuery allows combining the primitive query components, e.g. SirenTermQuery or SirenPhraseQuery, with boolean operations in order to restrict the search to a single cell of the tuple table. The interface is similar to Lucene BooleanQuery with the possibility of adding multiple clauses using the SirenCellQuery.add(SirenPrimitiveQuery query, Occur occur) method.

A SirenCellQuery provides an interface, i.e. SirenCellQuery.setConstraint(int index), to add a cell index constraint. For example, in the N-Triples tuple table example, the cell index of a subject is always 0. When trying to match the subject cell, all cell matching cells with an index different from 0 should be discarded. This is illustrated in the example below. The index constraint is not hard and can be represented as an interval using SirenCellQuery.setConstraint(int start, int end) in order to search multiple cell at the same time.

// Create a cell query matching either the keyword "renaud" or the full URI "http://renaud.delbru.fr/rdf/foaf#me" at the subject position (cell 0)

SirenCellQuery cq = new SirenCellQuery();
cq.add(new SirenTermQuery(new Term(DEFAULT_FIELD, "renaud")), SirenCellClause.Occur.SHOULD);
cq.add(new SirenTermQuery(new Term(DEFAULT_FIELD, "http://renaud.delbru.fr/rdf/foaf#me")), SirenCellClause.Occur.SHOULD);
// Constraint the cell index to 0 (first column: subject position)
cq.setConstraint(0);

Combining Cells into Tuples

A SirenCellQuery allows one to express a search over the content of a cell. Multiple cell query components can be combined to form a "tuple query" using the SirenTupleQuery component. A tuple query retrieves tuples matching a boolean combination of the cell queries. The SirenTupleQuery provides a similar interface to BooleanQuery with the possibility to add multiple clauses using the SirenTupleQuery.add(SirenCellQuery query, Occur occur) method.

Since 0.2, the SirenTupleQuery provides an interface, i.e., SirenTupleQuery.setConstraint(int index), to add a tuple index constraint. As for the SirenCellQuery, the index constraint is not hard and can be represented as an interval using SirenTupleQuery.setConstraint(int start, int end) in order to restrict the search to multiple tuples at the same time.

// Simple tuple query that lookup a triple pattern (*, name, "renaud delbru")

// Create a cell query matching "name"
SirenCellQuery cq1 = new SirenCellQuery();
cq1.add(new SirenTermQuery(new Term(DEFAULT_FIELD, "name")), SirenCellClause.Occur.MUST);
// Constraint the cell index to 1 (second column: predicate position)
cq1.setConstraint(1);

// Create a cell query matching the phrase "renaud delbru"
SirenCellQuery cq2 = new SirenCellQuery();
SirenPhraseQuery pq = new SirenPhraseQuery();
pq.add(new Term(DEFAULT_FIELD, "renaud"));
pq.add(new Term(DEFAULT_FIELD, "delbru"));
cq2.add(pq, SirenCellClause.Occur.MUST);
// Constraint the cell index to 2 (third column: object position)
cq2.setConstraint(2);

// Create a tuple query that combines the two cell queries
SirenTupleQuery tq = new SirenTupleQuery();
tq.add(cq1, SirenTupleClause.Occur.MUST);
tq.add(cq2, SirenTupleClause.Occur.MUST);

Combining Tuples, Cells and Primitive Queries with Lucene Operators

SirenTupleQuery but also{}SirenCellQuery{}and SirenPrimitiveQuery{}can be combined using Lucene BooleanQuery, and allows one to express more advanced queries, e.g. for matching entities. The query example will retrieve all entities related to "DERI" and having a property labeled name or fullname with a value "Renaud Delbru".

// Complex tuple queries that matches: (*, name, "renaud delbru") AND (*, workplace, deri)

// Create a cell query matching "workplace"
SirenCellQuery cq1 = new SirenCellQuery();
cq1.add(new SirenTermQuery(new Term(DEFAULT_FIELD, "workplace")), SirenCellClause.Occur.MUST);
// Constraint the cell index to 1 (second column: predicate position)
cq1.setConstraint(1);

// Create a cell query matching the "deri"
SirenCellQuery cq2 = new SirenCellQuery();
cq2.add(new SirenTermQuery(new Term(DEFAULT_FIELD, "deri")), SirenCellClause.Occur.MUST);
// Constraint the cell index to 2 (third column: object position)
cq2.setConstraint(2);

// Create a tuple query that combines the two cell queries
SirenTupleQuery tq = new SirenTupleQuery();
tq.add(cq1, SirenTupleClause.Occur.MUST);
tq.add(cq2, SirenTupleClause.Occur.MUST);

// Combine two tuple queries with a Lucene boolean query
BooleanQuery q = new BooleanQuery();
q.add(tq, Occur.MUST);

// Get the tuple query (*, name, "renaud delbru")
q.add(this.getQuery2(), Occur.MUST)