forked from dokieli/dokieli
-
Notifications
You must be signed in to change notification settings - Fork 0
/
linked-statistical-data-analysis.html
431 lines (353 loc) · 66.7 KB
/
linked-statistical-data-analysis.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml"
xml:lang="en" lang="en">
<head>
<meta charset="utf-8"/>
<title>Linked Statistical Data Analysis</title>
<link rel="stylesheet" media="all" title="LNCS" href="media/css/lncs.css"/>
<link rel="stylesheet alternate" media="all" title="ACM" href="media/css/acm.css"/>
<link rel="stylesheet" media="all" href="media/css/lr.css"/>
<script src="http://code.jquery.com/jquery-2.1.3.min.js"></script>
<script src="scripts/html.sortable.min.js"></script>
<script src="scripts/lr.js"></script>
</head>
<body about="[this:]" typeof="schema:CreativeWork sioc:Post schema:ScholarlyArticle prov:Entity" class="h-feed" prefix="rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns# rdfs: http://www.w3.org/2000/01/rdf-schema# owl: http://www.w3.org/2002/07/owl# xsd: http://www.w3.org/2001/XMLSchema# dcterms: http://purl.org/dc/terms/ foaf: http://xmlns.com/foaf/0.1/ v: http://www.w3.org/2006/vcard/ns# pimspace: http://www.w3.org/ns/pim/space# skos: http://www.w3.org/2004/02/skos/core# prov: http://www.w3.org/ns/prov# schema: http://schema.org/ sioc: http://rdfs.org/sioc/ns# rsa: http://www.w3.org/ns/auth/rsa# cert: http://www.w3.org/ns/auth/cert# cal: http://www.w3.org/2002/12/cal/ical# wgs: http://www.w3.org/2003/01/geo/wgs84_pos# bibo: http://purl.org/ontology/bibo/ dbr: http://dbpedia.org/resource/ dbp: http://dbpedia.org/property/ sio: http://semanticscience.org/resource/ opmw: http://www.opmw.org/ontology/ deo: http://purl.org/spar/deo/ doco: http://purl.org/spar/doco/ cito: http://purl.org/spar/cito/ fabio: http://purl.org/spar/fabio/ oa: http://www.w3.org/ns/oa# this: http://csarven.ca/linked-statistical-data-analysis">
<article class="h-entry">
<h1 class="p-name" property="schema:name">Linked Statistical Data Analysis</h1>
<div id="authors">
<dl id="author-name">
<dt>Authors</dt>
<dd id="author-1" rel="bibo:authorList" inlist="" resource="http://csarven.ca/#i"><span about="[this:]" rel="schema:contributor schema:creator schema:publisher schema:author"><a about="http://csarven.ca/#i" typeof="schema:Person" rel="schema:url" property="schema:name" href="http://csarven.ca/">Sarven Capadisli</a></span><sup><a about="http://csarven.ca/#i" rel="schema:memberOf" resource="http://dbpedia.org/resource/Leipzig_University" href="#author-org-1">1</a></sup><sup><a href="#author-email-1">✊</a></sup></dd>
<dd id="author-2" rel="bibo:authorList" inlist="" resource="[this:#SörenAuer]"><span about="[this:]" rel="schema:contributor"><a about="[this:#SörenAuer]" typeof="schema:Person" rel="schema:url" property="schema:name" href="http://www.iai.uni-bonn.de/~auer/">Sören Auer</a></span><sup><a about="[this:#SörenAuer]" rel="schema:memberOf" resource="http://dbpedia.org/resource/University_of_Bonn" href="#author-org-2">2</a></sup><sup><a href="#author-email-2">⚛</a></sup></dd>
<dd id="author-3" rel="bibo:authorList" inlist="" resource="[this:#ReinhardRiedl]"><span about="[this:]" rel="schema:contributor"><a about="[this:#ReinhardRiedl]" typeof="schema:Person" rel="schema:url" property="schema:name" href="http://www.wirtschaft.bfh.ch/de/ueber_uns/kontakt/detailseite.html?tx_bfhpersonalpages_p=rer2&tx_bfhpersonalpages_screen=data">Reinhard Riedl</a></span><sup><a about="[this:#ReinhardRiedl]" rel="schema:memberOf" resource="http://dbpedia.org/resource/Bern_University_of_Applied_Sciences" href="#author-org-3">3</a></sup><sup><a href="#author-email-3">𝄞</a></sup></dd>
</dl>
<ul id="author-org">
<li id="author-org-1"><sup>1</sup><a about="[dbr:Leipzig_University]" typeof="schema:Organization" property="schema:name" rel="schema:url" href="http://www.zv.uni-leipzig.de/">Leipzig University</a>, Institute of Computer Science, <a href="http://aksw.org/">AKSW</a>, Leipzig, Germany</li>
<li id="author-org-2"><sup>2</sup><a about="http://dbpedia.org/resource/University_of_Bonn" typeof="schema:Organization" property="schema:name" rel="schema:url" href="http://uni-bonn.de/">University of Bonn</a>, Enterprise Information Systems Department, Bonn, Germany</li>
<li id="author-org-3"><sup>3</sup><a about="http://dbpedia.org/resource/Bern_University_of_Applied_Sciences" typeof="schema:Organization" property="schema:name" rel="schema:url" href="http://bfh.ch/">Bern University of Applied Sciences</a>, E-Government-Institute, Bern, Switzerland</li>
</ul>
<ul id="author-email">
<li id="author-email-1"><sup>✊</sup><a about="http://csarven.ca/#i" rel="schema:email" href="mailto:[email protected]" class="author_email">[email protected]</a></li>
<li id="author-email-2"><sup>⚛</sup><a about="[this:#SörenAuer]" rel="schema:email" href="mailto:[email protected]">[email protected]</a></li>
<li id="author-email-3"><sup>𝄞</sup><a about="[this:#ReinhardRiedl]" rel="schema:email" href="mailto:[email protected]">[email protected]</a></li>
</ul>
</div>
<dl id="document-identifier">
<dt>Document ID</dt>
<dd><a href="http://csarven.ca/linked-statistical-data-analysis">http://csarven.ca/linked-statistical-data-analysis</a></dd>
</dl>
<dl id="document-published">
<dt>Published</dt>
<dd><time datetime="2015-03-21" property="schema:datePublished" content="2013-07-15T09:00:00Z" datatype="xsd:dateTime">2013-07-15</time></dd>
</dl>
<dl id="document-modified">
<dt>Modified</dt>
<dd><time datetime="2015-03-21" property="schema:dateModified" content="2013-07-15T09:00:00Z" datatype="xsd:dateTime">2014-07-21</time></dd>
</dl>
<dl id="document-license">
<dt>License</dt>
<dd><a about="[this:]" rel="license schema:license" href="http://creativecommons.org/licenses/by-sa/4.0/" title="Creative Commons Attribution-ShareAlike 4.0 Unported">CC BY-SA 4.0</a></dd>
</dl>
<dl id="document-in-reply-to">
<dt>In Reply To</dt>
<dd><a about="[this:]" rel="sioc:reply_of" href="http://semstats.org/2013/call-for-papers" class="u-in-reply-to">SemStats 2013 Call for Papers</a></dd>
</dl>
<dl id="document-appeared">
<dt>Appeared In</dt>
<dd about="[this:]" rel="bibo:citedBy" resource="http://ceur-ws.org/Vol-XXX/">
<span about="http://ceur-ws.org/Vol-XXX/" typeof="bibo:Document">
<span property="bibo:shortTitle">CEUR</span> (<span property="schema:name">Central Europe workshop proceedings</span>): <a rel="schema:url" href="http://ceur-ws.org/Vol-XXX/">Proceedings of the 1st International Workshop on Semantic Statistics</a>,
Volume <span property="bibo:volume" xml:lang="" lang="">XXX</span>,
<span property="bibo:uri" xml:lang="" lang="">urn:nbn:de:0074-XXX-X</span>
</span>
</dd>
</dl>
<dl id="document-purpose">
<dt>Purpose</dt>
<dd property="schema:purpose">A path to using federated queries, statistical analyses and reuse of statistical linked data.</dd>
</dl>
<div id="content" class="e-content">
<section id="abstract" about="[this:]">
<h2>Abstract</h2>
<div property="schema:abstract" class="p-summary">
<p>Linked Data design principles are increasingly employed to publish and consume high-fidelity, heterogeneous statistical datasets in a distributed fashion. While vast amounts of linked statistics are available, access and reuse of the data is subject to expertise in corresponding technologies. There exists no user-centred interfaces for researchers, journalists and interested people to compare statistical data retrieved from different sources on the Web. Given that the RDF Data Cube vocabulary is used to describe statistical data, its use makes it possible to discover and identify statistical data artefacts in a uniform way. In this article, the design and implementation of a user-centric application and service is presented. Behind the scene, the platform utilizes federated SPARQL queries to gather statistical data from distributed data stores. The R language for statistical computing is employed to perform statistical analyses and visualizations. The Shiny application and server bridges the front-end Web user interface with R on the server-side in order to compare statistical macrodata, and stores analyses results in RDF for future research. As a result, distributed linked statistics with accompanying provenance data can be more easily explored and analysed by interested parties.</p>
</div>
</section>
<section id="keywords" about="[this:]">
<h2>Keywords</h2>
<div>
<ul rel="schema:about">
<li><a resource="http://dbpedia.org/resource/Linked_Data" href="http://en.wikipedia.org/wiki/Linked_Data">Linked Data</a></li>
<li><a resource="http://dbpedia.org/resource/SDMX" href="http://en.wikipedia.org/wiki/SDMX">SDMX</a></li>
<li><a resource="http://dbpedia.org/resource/Statistics" href="http://en.wikipedia.org/wiki/Statistics">Statistics</a></li>
<li><a resource="http://dbpedia.org/resource/Statistical_database" href="http://en.wikipedia.org/wiki/Statistical_database">Statistical database</a></li>
<li><a resource="http://dbpedia.org/resource/Data_integration" href="http://en.wikipedia.org/wiki/Data_integration">Data integration</a></li>
<li><a resource="http://dbpedia.org/resource/Regression_analysis" href="http://en.wikipedia.org/wiki/Regression_analysis">Regression analysis</a></li>
<li><a resource="http://dbpedia.org/resource/User_interface" href="http://en.wikipedia.org/wiki/User_interface">User interface</a></li>
</ul>
</div>
</section>
<section id="introduction" about="[this:]" rel="schema:hasPart">
<h2 about="[this:#introduction]" property="schema:name">Introduction</h2>
<div about="[this:#introduction]" property="schema:description" typeof="deo:Introduction">
<p>Statistical data artefacts and the analyses conducted on the data are fundamental for testing scientific theories about our society and the universe we live in. As statistics are often used to add credibility to an argument or advice, they influence the decisions we make. The decisions are, however, complex beings on their own with multiple variables based on facts, cognitive processes, social demands, and maybe even factors that are unknown to us. In order for the society to track and learn from its own vast knowledge about events and things, it needs to be able to gather statistical information from heterogeneous and distributed sources. This is to uncover insights, make predictions, or build smarter systems that society needs to progress.</p>
<p>Due to a range of technical challenges, development teams often face low-level repetitive statistical data management tasks with partial tooling at their disposal. These challenges on the surface include: data integration, synchronization, and access in a uniform way. In addition, designing user-centric interfaces for data analysis that is functionally consistent (i.e., improving usability and learning), reasonably responsive, provenance friendly (e.g., fact checkable) still requires much attention.</p>
<p>This brings us to the core of our research challenge: How do we reliably acquire statistical data in a uniform way and conduct well-formed analyses that are easily accessible and usable by citizens, meanwhile strengthening trust between the user and the system?</p>
<p>This article presents an approach, <em>Statistical Linked Data Analyses</em>, addressing this challenge. In a nutshell, it takes advantage of Linked Data design principles that are widely accepted as a way to publish and consume data without central coordination on the Web. The work herein offers a Web based user-interface for researchers, journalists, or interested people to compare statistical data from different sources against each other without having any knowledge of the technology underneath or the expertise to develop themselves. Our approach is based on performing decentralized (i.e. federated) structured queries to retrieve data from various SPARQL endpoints, conducting various data analyses, and providing analysis results back to the user. For future research, analyses are stored so that they can be discovered and reused.</p>
<p>We have an implementation of a statistical analyses service at <a href="http://stats.270a.info/">stats.270a.info</a> [<a href="#ref-1">1</a>] which addresses the challenge and realizes the approach. The service is intended to allow humans and machines explore statistical analyses. There are two additional products of this service: first, the analysis results are stored for future discovery, and second, it creates valuable statistical artefacts which can be reused in a uniform way.</p>
<p>As a result, we demonstrate with this work, how linked data principles can be applied to statistical data. We show in particular, that federated SPARQL queries facilitate novel statistical analyses, which previously required cumbersome manual statistical data integration efforts. The automatized integration and analysis workflow also enables provenance tracing from visualizations combining statistical data from various sources back to the original raw data.</p>
</div>
</section>
<section id="background" about="[this:]" rel="schema:hasPart">
<h2 about="[this:#background]" property="schema:name">Background and Related Work</h2>
<div about="[this:#background]" property="schema:description">
<p>As we discussed already in <a href="http://csarven.ca/statistical-linked-dataspaces">Statistical Linked Dataspaces</a> [<a href="#ref-2">2</a>] linked statistics enable queries across datasets: Given that the dimension concepts are interlinked, one can learn from a certain observation's dimension value, and enable the automation of cross-dataset queries.</p>
<p>The <a href="http://www.w3.org/TR/vocab-data-cube/">RDF Data Cube vocabulary</a> [<a href="#ref-3">3</a>] is used to describe multi-dimensional statistical data, along with SDMX-RDF as one of the statistical information models. It makes it possible to represent significant amounts of heterogeneous statistical data as Linked Data where they can be discovered and identified in a uniform way. The statistical artefacts that use this vocabulary, are invaluable for statisticians, researchers, and developers.</p>
<p><a href="http://csarven.ca/linked-sdmx-data">Linked SDMX Data</a> [<a href="#ref-4">4</a>] provided templates and tooling to transform SDMX-ML data from statistical agencies to RDF/XML, resulting in linked statistical datasets at <a href="http://270a.info/">270a.info</a> [<a href="#ref-5">5</a>] using the RDF Data Cube vocabulary. In addition to semantically uplifting the original data, information pertaining provenance was kept track using the <a href="http://www.w3.org/TR/prov-o/">PROV Ontology</a> [<a href="#ref-6">6</a>] at transformation time, while incorporating retrieval time provenance data.</p>
<p><a href="http://dcevents.dublincore.org/IntConf/dc-2011/paper/download/27/16">Performing Statistical Methods on Linked Data</a> [<a href="#ref-7">7</a>] investigated simple statistical calculations, such as linear regression and presented the results using R [<a href="#ref-8">8</a>] and SPARQL queries. It highlighted the importance of a wide range of typical issues on data integration for heterogeneous statistical data. The other technical issues raised are SPARQL query performance, and the use of a central SPARQL endpoint, which contained multiple data sources. For future work, the work pointed out a friendly user-interface that allows dataset selection, statistical method and a visualization of the results.</p>
<p><a href="http://iswc2011.semanticweb.org/fileadmin/iswc/Papers/Workshops/COLD/cold2011_submission_13.pdf">Defining and Executing Assessment Tests on Linked Data for Statistical Analysis</a> [<a href="#ref-9">9</a>] explains: identification of data items, analysis of data characteristics, and data matching as key requirements to conduct statistical analysis on integrated Linked Data.</p>
<p><a href="http://www.few.vu.nl/~wrvhage/papers/LOP_JoDS_2012.pdf">Linked Open Piracy: A story about e-Science, Linked Data, and statistics</a> [<a href="#ref-10">10</a>] investigated analysis and visualization of piracy reports to answer domain questions through a <a href="http://cran.r-project.org/web/packages/SPARQL/index.html">SPARQL client for R</a> [<a href="#ref-11">11</a>].</p>
<p><a href="http://www.hicss.hawaii.edu/hicss_46/bp46/hc6.pdf">Towards Next Generation Health Data Exploration</a>: A Data Cube-based Investigation into Population Statistics for Tobacco [<a href="#ref-12">12</a>], presents the <a href="http://orion.tw.rpi.edu/~jimmccusker/qb.js/">qb.js</a> [<a href="#ref-13">13</a>] tool to explore data that is expressed as RDF Data Cubes. It is designed to formulate and explore hypotheses. Under the hood, it makes a SPARQL query to an endpoint which contains the data that it analyzes.</p>
<p><a href="http://svn.aksw.org/papers/2012/ESWC_PublishingStatisticData/public.pdf">Publishing Statistical Data on the Web</a> [<a href="#ref-14">14</a>] explains <a href="http://aksw.org/Projects/CubeViz">CubeViz</a> [<a href="#ref-15">15</a>], which was developed to visualize multidimensional statistical data. It is a faceted browser, which utilizes the RDF Data Cube vocabulary, with a chart visualization component. The inspection and results are for a single dataset.</p>
<p><a href="http://www.google.com/publicdata/">Google Public Data Explorer</a> [<a href="#ref-16">16</a>], derived from the <a href="http://www.gapminder.org/">Gapminder</a> [<a href="#ref-17">17</a>] tool, displays statistical data as line graphs, bar graphs, cross sectional plots or on maps. The process to display the data requires the data to be uploaded in CSV format, and accompanying <a href="https://developers.google.com/public-data/">Dataset Publishing Language</a> (DSPL) [<a href="#ref-18">18</a>] in XML to describe the data and metadata of the datasets. Its visualizations and comparisons are based on one dataset at a time.</p>
<p><a href="http://www.ke.tu-darmstadt.de/bibtex/attachments/single/310">Generating Possible Interpretations for Statistics from Linked Open Data</a> [<a href="#ref-19">19</a>] talks about <a href="http://www.ke.tu-darmstadt.de/resources/explain-a-lod">Explain-a-LOD</a> [<a href="#ref-20">20</a>] tool which focuses on generating hypotheses that explain statistics. It has a configuration to compare two variables, and then provides possible interpretations of the correlation analysis for users to review.</p>
<p>Looking at this state-of-the-art, we can see that the analyses are commonly conducted on central repositories. As statistical Linked Data is published by different parties independently from one another, it is only reasonable to work towards a solution that can gather, integrate and analyse the data without having to resort to centralism.</p>
</div>
</section>
<section id="lsd-analysis" about="[this:]" rel="schema:hasPart">
<h2 about="[this:#lsd-analysis]" property="schema:name">Analysis platform for Linked Statistical Data</h2>
<div about="[this:#lsd-analysis]" property="schema:description">
<p>Our analysis platform focuses on two goals: 1) a Web user interface for researchers to compare macrodata observations and to view plots and analysis results, 2) caching and storage of analyses for future research and reuse. Here, we describe the platform at <a href="http://stats.270a.info/">stats.270a.info</a>. Figure [<a href="#figure-linked-stats-analysis-architecture">1</a>] shows the architecture for Linked Stats Analysis.</p>
<figure id="figure-linked-stats-analysis-architecture">
<object type="image/svg+xml" width="640" height="415" data="linked-stats-analysis-architecture.svg"></object>
<figcaption>Linked Stats Analysis Architecture.</figcaption>
</figure>
<section id="requirements" about="[this:#lsd-analysis]" rel="schema:hasPart">
<h3 about="[this:#requirements]" property="schema:name">Functional Requirements</h3>
<div about="[this:#requirements]" property="schema:description">
<p>The requirements for functionality and performance are that Linked Data design principles are employed behind the scenes to pull in the statistical data that are needed to conduct analyses, and to make the results of the analyses available using the same methods for both, humans and machines. While achieving this workflow includes many steps, the front-end interface for humans should aim for minimum interactivity that is required to accomplish this. Finally, the performance of the system should be reasonable for a Web user interface, as it needs to present the analysis and display visualizations. Additionally, essential parts of the analyses should be cached and stored for future use both, for application responsiveness and data discovery. Finally and most importantly, the interface needs to foster trust while presenting the analyses. Therefore, the interface should be accompanied with data provenance and provide sufficient detail for the user.</p>
</div>
</section>
<section id="user-interface" about="[this:#lsd-analysis]" rel="schema:hasPart">
<h3 about="[this:#user-interface]" property="schema:name">User interface</h3>
<div about="[this:#user-interface]" property="schema:description">
<p>A web application was created to provide users with a simple interface to conduct regression analysis and display of scatter plots. In the case of regression analysis, the interface presents three drop-down selection areas for the user: an independent variable, a dependent variable, and a time series. Both, the independent and dependent variables are composed of a list of datasets with observations, and time series are composed of reference periods of those observations. Upon selecting and submitting datasets to compare, the interface then presents a scatter plot with a line of best fit from a list of tested linear models. Figure [<a href="#figure-stats.270a.info.ui">2</a>] shows a screenshot of the user interface. The points in the scatter plot represent locations, in this case countries, which happen to have a measure value for both variables as well as the reference period that was selected by the user. Below the scatter-plot, a table of analysis results is presented.</p>
<figure id="figure-stats.270a.info.ui">
<img src="stats.270a.info.ui.png" width="480" height="440" alt="stats.270a.info user interface"/>
<figcaption>stats.270a.info analysis user interface.</figcaption>
</figure>
<p>The datasets are compiled by gathering <code>qb:DataSet</code>s (an RDF Data Cube class for datasets) from each statistical dataspace at 270a.info. Similarly, the reference periods are derived from calendar intervals e.g., <code>YYYY</code>, <code>YYYY-MM-DD</code> or <code>YYYY-QQ</code>.</p>
</div>
</section>
<section id="provenance" about="[this:#lsd-analysis]" rel="schema:hasPart">
<h3 about="[this:#provenance]" property="schema:name">Provenance</h3>
<div about="[this:#provenance]" property="schema:description">
<p>In order to foster trust and confidence for the user, the human-centred interface as well as the machine-friendly representation of the data accompanies provenance data. On the analysis interface, an <q>Oh yeah?</q> link guides users to a page about the provenance activity for the information. These previously generated provenance activities provide links to all data sources which were used for the analysis, query construct for data aggregation, as well as metadata about the used tools, assigned license, production timestamps, and responsible agents for the generated analysis. Thus, in addition to analysis metadata, the user is able to track the data all the way back to its origins (at the statistical agencies), and reproduce or compare their results.</p>
</div>
</section>
<section id="comparability" about="[this:#lsd-analysis]" rel="schema:hasPart">
<h3 about="[this:#comparability]" property="schema:name">Comparability</h3>
<div about="[this:#comparability]" property="schema:description">
<p>At this time, the majority of the interlinks in Linked Open Data between statistical concepts (i.e. reference areas) are determined based on their notations and labels. In order to precisely measure the differences between statistical concepts, the following should be factored in: <em>temporality</em>, <em>geographic areas</em>, <em>domains</em>, and <em>drifts</em>, as mentioned in <a href="https://docs.google.com/viewer?a=v&pid=sites&srcid=ZGVmYXVsdGRvbWFpbnxib3N1bmRncmVufGd4OjJiY2M1OWMyMTBiNmNlZDg">Data quality in information systems</a> [<a href="#ref-21">21</a>], and <a href="http://link.springer.com/chapter/10.1007%2F978-3-642-16438-5_17">What Is Concept Drift and How to Measure It?</a> [<a href="#ref-22">22</a>]. In practice for instance, this means that a reference area from a particular reference period is not necessarily the same concept as another one found elsewhere, without incorporating some of these characteristics. To take an example, if an observation has a reference area dimension value as RU (Russia) with reference period 2013, the question is, to what degree can that particular observation be compared or used with another observation with the same reference area value, but with a reference period between 1922–1991 – given that latter reference area historically corresponds to USSR (Union of Soviet Socialist Republics) and is different from RU. If this sort of metadata is not provided by statistical services or incorrectly represented in the data, it is worthwhile to account for it, either when interlinks are made across code lists or when the observations from different data sources are used together. Additionally, all assumptions and accuracies should be documented.</p>
</div>
</section>
<section id="data-requirements" about="[this:#lsd-analysis]" rel="schema:hasPart">
<h3 about="[this:#data-requirements]" property="schema:name">Data Requirements</h3>
<div about="[this:#data-requirements]" property="schema:description">
<p>Our expectation regarding the data is that it is modeled using the RDF Data Cube vocabulary and is <a href="http://www.w3.org/TR/vocab-data-cube/#wf">well-formed</a>. Specifically, it needs to pass some of the integrity constraints as outlined by the vocabulary specification. For our application, some of the essential checks are that: 1) a unique data structure definition (DSD) is used for a dataset, 2) the DSD includes a measure (value of each observation), 3) concept dimensions have code lists, and 4) codes are from the code lists.</p>
<p>In addition to well-formedness, to compare variables from two datasets, there needs to be an agreement on the concepts that are being matched for in respective observations. In the case of regression analysis, the primary concern is about reference areas (i.e. locations), and making sure that the comparison made for the observations from dataset<sub>x</sub> (independent variable) and dataset<sub>y</sub> (dependent variable) are using concepts that are interlinked (using the property <code>skos:exactMatch</code>). Practically, a concept, for example Switzerland, from at least one of the dataset's code lists should have an arc to the other dataset's concept. It ensures that there is a reliable degree of confidence that the particular concept is interchangeable. Hence, the measure corresponding to the phenomenon being observed, is about the same location in both datasets. Concepts in the datasets were interlinked using the <a href="http://aksw.org/Projects/limes">LInk discovery framework for MEtric Spaces</a> (LIMES) [<a href="#ref-23">23</a>]. Figure [<a href="#figure-270a.info.interlinks">3</a>] shows outbound interlinks for the datasets at <a href="http://270a.info/">http://270a.info/</a>.</p>
<figure id="figure-270a.info.interlinks">
<object type="image/svg+xml" width="640" height="480" data="http://270a.info/media/images/270a.cloud.svg"></object>
<figcaption>Outbound interlinks for 270a.info datasets.</figcaption>
</figure>
<p>One additional requirement from the datasets is that the RDF Data Cube component properties (e.g., dimensions, measures) either use <code>sdmx-dimension:refArea</code>, <code>sdmx-dimension:refPeriod</code>, <code>sdmx-measure:obsValue</code> directly or respective sub-properties (<code>rdfs:subPropertyOf</code>). Given decentralized mappings of the statistical datasets (published as SDMX-ML), their commonality is expected to be the use, or a reference to SDMX-RDF properties in order to achieve generalized federated queries without having complete knowledge of the structures of the datasets, but rather only the essential bits.</p>
<p>In order to proceed with the analysis, we use the selections made by the user: dataset<sub>x</sub> and dataset<sub>y</sub>, reference period, and then gather all observations with corresponding reference areas, and measures. Only the observations with reference areas which have a interlinks are retained in the final result.</p>
</div>
</section>
<section id="application" about="[this:#lsd-analysis]" rel="schema:hasPart">
<h3 about="[this:#application]" property="schema:name">Application</h3>
<div about="[this:#application]" property="schema:description">
<p>The R package <a href="http://www.rstudio.com/shiny/">Shiny</a> [<a href="#ref-24">24</a>] along with <a href="https://github.com/rstudio/shiny-server">Shiny server</a> [<a href="#ref-25">25</a>] is used to build an interactive web application. A Shiny application was built to essentially allow an interaction between the front-end Web application and R. User inputs are set to trigger an event which is sent to the Shiny server and handled by the application written in R. While the application uses R for statistical analysis and visualizations, to achieve the goals of this research, other statistical computing software can be used. The motivation to use R is due to it being a popular open-source software for statistical analysis and it being a requirement of Shiny server.</p>
<p>The application assembles a SPARQL query using the input values and then sends them to the SPARQL endpoint at <a href="http://stats.270a.info/sparql">stats.270a.info/sparql</a>, which dispatches federated queries to the two SPARQL endpoints where the datasets are located. The SPARQL query request is handled by the <a href="http://cran.r-project.org/web/packages/SPARQL/index.html">SPARQL client for R</a>. The query results are retrieved and given to R for statistical data analysis. R generates a scatter plot containing the independent and dependent variable, where each point in the chart is a reference area (e.g., country) for that particular reference period selection. Regression analysis is done where correlation, p-value, and the line of best fit is determined after testing several linear models, and shown in the user interface.</p>
</div>
</section>
<section id="federated-queries" about="[this:#lsd-analysis]" rel="schema:hasPart">
<h3 about="[this:#federated-queries]" property="schema:name">Federated Queries</h3>
<div about="[this:#federated-queries]" property="schema:description">
<p>During this research, establishing a correct and reasonably performing federated querying was one of the most challenging steps. This was due in part by ensuring dataset integrity, finding a balance between processing and filtering applicable observations at remote endpoints and at the originating endpoint. The challenge was compromising between what should be processed remotely and sent over the wire versus handling some of that workload by the parent endpoint. Since one of the requirements was to ensure that the concepts are interlinked at either one of the endpoints (in which case, it is optional per endpoint), each endpoint had to include each observation's reference area as well as its interlinked concept. The result from both endpoints was first joined and then filtered in order to avoid false negatives. That is, either concept<sub>x</sub> has a <code>skos:exactMatch</code> relationship to concept<sub>y</sub>, or vice versa, or concept<sub>x</sub> and concept<sub>y</sub> are the same. One quick and simple way to minimize the number of results was to filter out exact matches at each endpoint which did not contain the other dataset's domain name. Hence, minimizing the number of <em>join</em> operations which had to be handled by the parent endpoint.</p>
<p>In order to put the cost of queries briefly into perspective i.e., the conducted tests and sample sizes of the dataspaces that were used; the total number of triples (including observations and metadata) per endpoint are: 50 thousand (<a href="http://transparency.270a.info/">Transparency International</a> [<a href="#ref-26">26</a>]), 54 million (<a href="http://fao.270a.info/">Food and Agriculture Organization of the United Nations</a> [FAO] [<a href="#ref-27">27</a>]), 305 million (<a href="http://oecd.270a.info/">Organisation for Economic Co-operation and Development</a> [OECD] [<a href="#ref-28">28</a>]), 221 million (<a href="http://worldbank.270a.info/">World Bank</a> [<a href="#ref-29">29</a>]), 470 million (<a href="http://ecb.270a.info/">European Central Bank</a> [ECB] [<a href="#ref-30">30</a>]), 36 million (<a href="http://imf.270a.info/">International Monetary Fund</a> [IMF] [<a href="#ref-31">31</a>]).</p>
<p>The anatomy of the query is shown in Figure [<a href="#federated-sparql-query">3</a>]. The SPARQL Endpoint and the dataset URIs are the only requirements. The structure of the statements and operations tries to get the most out of <a href="http://jena.apache.org/">Apache Jena</a>'s [<a href="#ref-32">32</a>] <a href="http://incubator.apache.org/jena/documentation/tdb/">TDB</a> storage system [<a href="#ref-33">33</a>], <a href="http://jena.apache.org/documentation/tdb/optimizer.html">TDB Optimizer</a> [<a href="#ref-34">34</a>] and <a href="http://incubator.apache.org/jena/documentation/serving_data/index.html">Fuseki</a> [<a href="#ref-35">35</a>] SPARQL endpoints. Better performing queries can be achieved by knowing the predicate frequency upfront, and ordering them in for a dataset to avoid processing of false negatives.</p>
<figure id="federated-sparql-query" about="[this:#federated-queries]" rel="schema:hasPart" resource="[this:#federated-sparql-query]" class="listing">
<pre about="[this:#federated-sparql-query]" typeof="fabio:Script" property="schema:description">
<code>SELECT DISTINCT ?refAreaY ?x ?y ?identityX ?identityY</code>
<code>WHERE {</code>
<code>SERVICE <strong><http://example.org/sparql></strong> {</code>
<code>SELECT DISTINCT ?identityX ?refAreaX ?refAreaXExactMatch ?measureX</code>
<code>WHERE {</code>
<code> ?observationX qb:dataSet <strong><http://example.org/dataset/X></strong> .</code>
<code> ?observationX ?propertyRefPeriodX <strong>exampleRefPeriod:1234</strong> .</code>
<code> ?propertyRefAreaX rdfs:subPropertyOf* sdmx-dimension:refArea .</code>
<code> ?observationX ?propertyRefAreaX ?refAreaX .</code>
<code> ?propertyMeasureX rdfs:subPropertyOf* sdmx-measure:obsValue .</code>
<code> ?observationX ?propertyMeasureX ?x .</code>
<code> <strong><http://example.org/dataset/X></strong></code>
<code> qb:structure/stats:identityDimension ?propertyIdentityX .</code>
<code> ?observationX ?propertyIdentityX ?identityX .</code>
<code> OPTIONAL {</code>
<code> ?refAreaX skos:exactMatch ?refAreaXExactMatch .</code>
<code> FILTER (STRSTARTS(STR(?refAreaXExactMatch), "<strong>http://example.net/</strong>"))</code>
<code> }</code>
<code>}</code>
<code>}</code>
<code>SERVICE <strong><http://example.net/sparql></strong> {</code>
<code>SELECT DISTINCT ?identityY ?refAreaY ?refAreaYExactMatch ?measureY</code>
<code>WHERE {</code>
<code> ?observationY qb:dataSet <strong><http://example.net/dataset/Y></strong> .</code>
<code> ?observationY ?propertyRefPeriodY <strong>exampleRefPeriod:1234</strong> .</code>
<code> ?propertyRefAreaY rdfs:subPropertyOf* sdmx-dimension:refArea .</code>
<code> ?observationY ?propertyRefAreaY ?refAreaY .</code>
<code> ?propertyMeasureY rdfs:subPropertyOf* sdmx-measure:obsValue .</code>
<code> ?observationY ?propertyMeasureY ?y .</code>
<code> <strong><http://example.net/dataset/Y></strong></code>
<code> qb:structure/stats:identityDimension ?propertyIdentityY .</code>
<code> ?observationY ?propertyIdentityY ?identityY .</code>
<code> OPTIONAL {</code>
<code> ?refAreaY skos:exactMatch ?refAreaYExactMatch .</code>
<code> FILTER (STRSTARTS(STR(?refAreaYExactMatch), "<strong>http://example.org/</strong>"))</code>
<code> }</code>
<code>}</code>
<code>}</code>
<code>FILTER (SAMETERM(?refAreaYExactMatch, ?refAreaX)</code>
<code> || SAMETERM(?refAreaXExactMatch, ?refAreaY)</code>
<code> || SAMETERM(?refAreaY, ?refAreaX))</code>
<code>}</code>
</pre>
<figcaption><span about="[this:#federated-sparql-query]" property="schema:name">Federated SPARQL query integrating statistical linked data.</span></figcaption>
</figure>
<p>For the time being, the use of named graphs in the SPARQL queries were excluded for a good reason. For federated queries to work with the goal of minimal knowledge about store organization, the queries had to work without including graph names. However, by employing the <a href="http://www.w3.org/TR/void/">Vocabulary of Interlinked Datasets</a> (VoID) [<a href="#ref-36">36</a>], it is possible to extract both, the location of the SPARQL endpoint, as well as the graph names within. This is left as a future enhancement.</p>
<p>As statistical datasets are multi-dimensional, slicing the datasets with only reference area and reference period are insufficient to distinguish records. It is likely that there would be duplicate results if we leave the column order to reference area, measure<sub>x</sub>, measure<sub>y</sub>. For this reason, there is an additional expectation from the datasets indicating one other dimension to group the observations with. This grouping is also used to display faceted scatter-plots.</p>
<p>Recommendations from <a href="http://arxiv.org/abs/1304.0567">On the Formulation of Performant SPARQL Queries</a> [<a href="#ref-37">37</a>] and <a href="http://arxiv.org/abs/1306.1723">Querying over Federated SPARQL Endpoints — A State of the Art Survey</a> [<a href="#ref-38">38</a>] were applied where applicable.</p>
</div>
</section>
<section id="analysis-caching-and-storing" about="[this:#lsd-analysis]" rel="schema:hasPart">
<h3 about="[this:#analysis-caching-and-storing]" property="schema:name">Analysis caching and storing</h3>
<div about="[this:#analysis-caching-and-storing]" property="schema:description">
<p>In order to optimize application reactivity for all users, previously user selected options for analysis are cached in the Shiny server session. That is, the service is able to provide cached results which were triggered by different users.</p>
<p>In addition to a cache that is closest to the user, results from the federated queries as well as the R analysis, which was previously conducted, is stored back into the RDF store with a SPARQL Update. This serves multiple purposes. In the event that the Shiny server is restarted and the cache is no longer available, previously calculated results in the store can be reused, which is still more cost efficient than making new federated queries.</p>
<p>Another reason for storing the results back in the RDF store is to offer them over the stats.270a.info SPARQL endpoint for additional discovery and reuse of analysis for researchers. Interesting use cases from this approach emerge immediately. For instance, a researcher or journalist can investigate analysis that meets their criteria. Some examples are as follows:</p>
<ul>
<li>analysis which is statistically significant, and has to do with Gross Domestic Product (GDP) and health subjects,</li>
<li>a list of indicator pairs with strong correlations,</li>
<li>using the line of best fit of a regression analysis to predict or forecast possible outcomes,</li>
<li>countries which have less mortality rate than average with high corruption.</li>
</ul>
</div>
</section>
<section id="uri-patterns" about="[this:#lsd-analysis]" rel="schema:hasPart">
<h3 about="[this:#uri-patterns]" property="schema:name">URI patterns</h3>
<div about="[this:#uri-patterns]" property="schema:description">
<p>The design pattern for analyses URIs are aimed to keep the length as minimal as possible, while leaving a trace to encourage self exploration and reuse. The general URI pattern with base <code>http://stats.270a.info/analysis/</code> is as follows for regression analysis:</p>
<pre>{independentVariable}/{dependentVariable}/{referencePeriod}</pre>
<p>As URIs for both independent and dependent variable are based on datasets, and the reference period is codified, their prefixed names are used instead in the analysis URI to keep them short and friendly:</p>
<pre>{prefix}:{dataset}/{prefix}:{dataset}/{prefix}:{refPeriod}</pre>
<p>For example, the URI <a href="http://stats.270a.info/analysis/worldbank:SP.DYN.IMRT.IN/transparency:CPI2009/year:2009">http://stats.270a.info/analysis/worldbank:SP.DYN.IMRT.IN/transparency:CPI2009/year:2009</a> refers to an analysis which entails the infant mortality rate from the World Bank dataset as the independent variable, 2009 corruption perceptions index from the Transparency International dataset as the dependent variable, and reference interval for year 2009. The variable values are prefixed names, which correspond to their respective datasets, i.e., <code>worldbank:SP.DYN.IMRT.IN</code> becomes <a href="http://worldbank.270a.info/dataset/SP.DYN.IMRT.IN">http://worldbank.270a.info/dataset/SP.DYN.IMRT.IN</a>, and <code>transparency:CPI2009</code> becomes <a href="http://transparency.270a.info/dataset/CPI2009">http://transparency.270a.info/dataset/CPI2009</a> when processed.</p>
</div>
</section>
<section id="vocabularies" about="[this:#lsd-analysis]" rel="schema:hasPart">
<h3 about="[this:#vocabularies]" property="schema:name">Vocabularies</h3>
<div about="[this:#vocabularies]" property="schema:description">
<p>Besides the common vocabularies: RDF, RDFS, XSD, OWL, the RDF Data Cube vocabulary is used to describe multi-dimensional statistical data, and SDMX-RDF for the statistical information model. PROV-O is used for provenance coverage.</p>
<p>A statistical vocabulary (<a href="http://stats.270a.info/vocab">http://stats.270a.info/vocab</a>)[<a href="#ref-39">39</a>] is created to describe analyses. It contains classes for analyses, summaries and each data row that is retrieved. Some of the properties include: graph (e.g., scatter plot), independent and dependent variables, reference period, sample size, p-value, correlation value, correlation method that is used, adjusted R-squared, best model that is tested, reference area, measure values for both variables, and the identity concept for both variables.</p>
<p>Future plans for this vocabulary is to reflect back on the experience, and to consider alignment with <a href="http://semanticscience.org/ontology/sio.owl">Semanticscience Integrated Ontology</a> (SIO) [<a href="#ref-40">40</a>]. While SIO is richer, queries are more complex than necessary for simple analysis reuse at stats.270a.info.</p>
</div>
</section>
</div>
</section>
<section id="results" about="[this:]" rel="schema:hasPart">
<h2 about="[this:#results]" property="schema:name">Discussion and Evaluation</h2>
<div about="[this:#results]" property="schema:description">
<p>Putting it all together: following the Linked Data design principles, the platform for linked statistical data analyses is now available for different types of users. Human users with a Web browser can interact with the application with a few clicks. This is arguably the simplest approach for researchers and journalists without having to go down the development road. Additionally, humans as well as machines can consume the same analysis as an RDF or JSON serialization. In the case of JSON, the analyses can be used as part of a widget on a webpage. The Scalar Vector Graphics (SVG) format of the scatter plot can be used in articles on the Web. Storing the analyses permanently and having it accessible over a SPARQL endpoint opens up the possibility for researchers to discover interesting statistics. Finally, with the help of Apache Rewrites, <a href="http://csarven.ca/statistical-linked-dataspaces#linked-data-pages">Linked Data Pages</a> [<a href="#ref-41">41</a>] handles the top down direction of these requests and provides dereferenceable URIs for a <em>follow your nose</em> type of exploration. The <a href="https://github.com/csarven/lsd-analysis">source code</a> [<a href="#ref-42">42</a>] is available at a public repository.</p>
</div>
</section>
<section id="discussion-and-evaluation" about="[this:]" rel="schema:hasPart">
<h2 about="[this:#discussion-and-evaluation]" property="schema:name">Discussion and Evaluation</h2>
<div about="[this:#discussion-and-evaluation]" property="schema:description">
<p>In order to test and experiment with the techniques outlined in our work, we postulate that the approaches to conduct federated queries to gather necessary data for analysis can be summarized by either constructing general or custom query patterns. The generalized approach is essentially where the same query pattern is used for all endpoints. The custom approach is where each endpoint gets a uniquely constructed query pattern.</p>
<p>The approach to writing general queries that can work over any endpoint which hosts well-formed statistical data passing the integrity checks of the RDF Data Cube vocabulary. Therefore, in this case, the queries do not consist of any predetermined information about the endpoint. The approach is essentially aimed towards the possibility to scale the number of endpoints within its reach. Achieving optimal query patterns tend to be challenging in this case.</p>
<p>In contrast, the custom approach will offer an improvement over the generalized approach when performance is in question, since it can factor in information about the data organization for each endpoint. This may typically include information like named graphs to look into, available interlinks, metadata about comparability and so on. When this information is available to the system in a controlled environment e.g., endpoint monitoring, industrial use-cases would benefit from the custom approach as performance of the system is essential.</p>
<p>For our research, the technique we focused on favoured the scalability of the system with minimal human intervention. This was achieved by having an implementation where by only making the system be aware of new statistical Linked Data endpoints, the rest of the pipeline and interface functioned as consistently as before. Having said that, it is worth repeating that it is not ideal if performance is top priority.</p>
</div>
</section>
<section id="conclusions" about="[this:]" rel="schema:hasPart">
<h2 about="[this:#conclusions]" property="schema:name">Conclusions and Future Work</h2>
<div about="[this:#conclusions]" property="schema:description" typeof="deo:Conclusion">
<p>We believe that the presented work here and the prior <a href="http://csarven.ca/linked-sdmx-data">Linked SDMX Data</a> effort contributed towards strengthening the relationship between Semantic Web / Linked Data and statistical communities. The stats.270a.info service is intended to allow humans and machines explore statistical analyses.</p>
<p>In the following we discuss some research and application areas that are planned in future work:</p>
<p>Making the query optimization file from Jena TDB available in RDF and at SPARQL endpoints (or placed in VoID along with <a href="https://github.com/AKSW/LODStats">LODStats</a> [<a href="#ref-43">43</a>]) can help to devise better performing federated queries.</p>
<p>With the availability of more interlinks across datasets, we can investigate analyses that are not dependent on reference areas. For instance, interlinking currencies, health matters, policies, or concepts on comparability can contribute towards various analyses.</p>
<p>Enriching the datasets with information on comparability can lead to achieving more coherent results. This is particularly important given that the <a href="http://epp.eurostat.ec.europa.eu/portal/page/portal/quality/code_of_practice">European Statistics Code of Practice</a> [<a href="#ref-44">44</a>] from the <a href="http://ec.europa.eu/">European Commission</a> lists <em>Coherence and Comparability</em> as one of the principles that national and community statistical authorities should adhere to. While the research at hand is not obligated to follow those guidelines, they are highly relevant for providing quality statistical analyses.</p>
<p>The availability of the analysis in a JSON serialization, and the cached scatter plot in SVG format, makes it possible for a webpage widget to use them. For instance, they can be dynamically used in articles or wiki pages with all references intact. As the Linked Data approach allows one to explore resources from one item to another, consumers of the article can follow the trace all the way back to the source. This is arguably an ideal scenario to show provenance and references for fact-checking in online or journal articles. Moreover, since the analysis is stored, and the queried data can also be exported in different formats, it can be reused to reproduce the results.</p>
<p>This brings us to an outlook for Linked Statistical Data Analyses. The reuse of Linked analyses artefacts as well as the approach to collect data from different sources can help us build smarter systems. It can be employed in fact-checking scenarios as well as uncovering decision-making processes, where knowledge from different sources is put to their potential use when combined.</p>
</div>
</section>
<section id="acknowledgements" about="[this:]" rel="schema:hasPart">
<h2 about="[this:#acknowledgements]" property="schema:name">Acknowledgements</h2>
<div about="[this:#acknowledgements]" property="schema:description">
<p>Many thanks to colleagues whom helped one way or another during the course of this work (not implying any endorsement); in no particular order: <a href="http://www.transparency.org/whoweare/contact#S_deborah_hardoon">Deborah Hardoon</a> (<a href="http://transparency.org/">Transparency International</a>), <a href="http://bis.uni-leipzig.de/AxelNgonga">Axel-Cyrille Ngonga Ngomo</a> (<a href="http://www.zv.uni-leipzig.de/">Universität Leipzig</a>, <a href="http://aksw.org/">AKSW</a>), <a href="http://www.wirtschaft.bfh.ch/de/ueber_uns/kontakt/detailseite.html?tx_bfhpersonalpages_p=rca2&tx_bfhpersonalpages_screen=data">Alberto Rascón</a> (<a href="http://bfh.ch/">Berner Fachhochshule</a> [BFS]), <a href="http://www.wirtschaft.bfh.ch/de/ueber_uns/kontakt/detailseite.html?tx_bfhpersonalpages_p=mam10&tx_bfhpersonalpages_screen=data">Michael Mosimann</a> (BFS), <a href="http://www.joecheng.com/">Joe Cheng</a> (<a href="http://www.rstudio.com/">RStudio, Inc.</a>), <a href="http://www.w3.org/2011/gld/wiki/Main_Page">Government Linked Data Working Group</a>, <a href="https://groups.google.com/forum/#!forum/publishing-statistical-data">Publishing Statistical Data</a> group, <a href="http://jena.apache.org/">Apache Jena</a>, <a href="http://www.epimorphics.com/web/about#afs">Andy Seaborne</a> (<a href="http://epimorphics.com/">Epimorphics Ltd</a>), <a href="http://richard.cyganiak.de/#">Richard Cyganiak</a> (<a href="http://deri.ie/">Digital Enterprise Research Institute</a> [DERI]). And, DERI for graciously offering to host this work on their servers.</p>
</div>
</section>
<section id="references">
<h2>References</h2>
<div>
<ol>
<li id="ref-1">stats.270a.info, <a about="[this:]" rel="schema:citation" href="http://stats.270a.info/">http://stats.270a.info/</a></li>
<li id="ref-2">Capadisli, S.: Statistical Linked Dataspaces. Master's thesis, National University of Ireland (2012), <a about="[this:]" rel="schema:citation" href="http://csarven.ca/statistical-linked-dataspaces">http://csarven.ca/statistical-linked-dataspaces</a></li>
<li id="ref-3">The RDF Data Cube vocabulary, <a about="[this:]" rel="schema:citation" href="http://www.w3.org/TR/vocab-data-cube/">http://www.w3.org/TR/vocab-data-cube/</a></li>
<li id="ref-4">Capadisli, S., Auer, S. Ngonga Ngomo, A.-C., Linked SDMX Data, Semantic Web Journal (2013), <a about="[this:]" rel="schema:citation" href="http://csarven.ca/linked-sdmx-data">http://csarven.ca/linked-sdmx-data</a></li>
<li id="ref-5">270a.info, <a about="[this:]" rel="schema:citation" href="http://270a.info/">http://270a.info/</a></li>
<li id="ref-6">The PROV Ontology, <a about="[this:]" rel="schema:citation" href="http://www.w3.org/TR/prov-o/">http://www.w3.org/TR/prov-o/</a></li>
<li id="ref-7">Zapilko, B., Mathiak, B.: Performing Statistical Methods on Linked Data, Proc. Int'l Conf. on Dublin Core and Metadata Applications (2011), <a about="[this:]" rel="schema:citation" href="http://dcevents.dublincore.org/IntConf/dc-2011/paper/download/27/16">http://dcevents.dublincore.org/IntConf/dc-2011/paper/download/27/16</a></li>
<li id="ref-8">The R Project for Statistical Computing, <a href="http://www.r-project.org/">http://www.r-project.org/</a></li>
<li id="ref-9">Zapilko, B., Mathiak, B.: Defining and Executing Assessment Tests on Linked Data for Statistical Analysis, COLD, ISWC (2011), <a about="[this:]" rel="schema:citation" href="http://iswc2011.semanticweb.org/fileadmin/iswc/Papers/Workshops/COLD/cold2011_submission_13.pdf">http://iswc2011.semanticweb.org/fileadmin/iswc/Papers/Workshops/COLD/cold2011_submission_13.pdf</a></li>
<li id="ref-10">Hage, W. R. v., Marieke v., Malaisé., V.: Linked Open Piracy: A story about e-Science, Linked Data, and statistics (2012), <a about="[this:]" rel="schema:citation" href="http://www.few.vu.nl/~wrvhage/papers/LOP_JoDS_2012.pdf">http://www.few.vu.nl/~wrvhage/papers/LOP_JoDS_2012.pdf</a></li>
<li id="ref-11">SPARQL client for R, <a about="[this:]" rel="schema:citation" href="http://cran.r-project.org/web/packages/SPARQL/index.html">http://cran.r-project.org/web/packages/SPARQL/index.html</a></li>
<li id="ref-12">McCusker, J. P., McGuinness, D. L., Lee, J., Thomas, C., Courtney, P., Tatalovich, Z., Contractor, N., Morgan, G., Shaikh, A.: Towards Next Generation Health Data Exploration: A Data Cube-based Investigation into Population Statistics for Tobacco, Hawaii International Conference on System Sciences (2012), <a about="[this:]" rel="schema:citation" href="http://www.hicss.hawaii.edu/hicss_46/bp46/hc6.pdf">http://www.hicss.hawaii.edu/hicss_46/bp46/hc6.pdf</a></li>
<li id="ref-13">qb.js, <a about="[this:]" rel="schema:citation" href="http://orion.tw.rpi.edu/~jimmccusker/qb.js/">http://orion.tw.rpi.edu/~jimmccusker/qb.js/</a></li>
<li id="ref-14">Percy E. Rivera Salas, P. E. R., Mota, F. M. D., Martin, M., Auer, S., Breitman, K., Casanova, M. A.: Publishing Statistical Data on the Web, ISWC (2012), <a about="[this:]" rel="schema:citation" href="http://svn.aksw.org/papers/2012/ESWC_PublishingStatisticData/public.pdf">http://svn.aksw.org/papers/2012/ESWC_PublishingStatisticData/public.pdf</a></li>
<li id="ref-15">CubeViz, <a about="[this:]" rel="schema:citation" href="http://aksw.org/Projects/CubeViz">http://aksw.org/Projects/CubeViz</a></li>
<li id="ref-16">Google Public Data Explorer, <a about="[this:]" rel="schema:citation" href="http://www.google.com/publicdata/">http://www.google.com/publicdata/</a></li>
<li id="ref-17">Gapminder, <a about="[this:]" rel="schema:citation" href="http://www.gapminder.org/">http://www.gapminder.org/</a></li>
<li id="ref-18">Dataset Publishing Language, <a about="[this:]" rel="schema:citation" href="https://developers.google.com/public-data/">https://developers.google.com/public-data/</a></li>
<li id="ref-19">Paulheim, H.: Generating Possible Interpretations for Statistics from Linked Open Data, ESWC (2012), <a about="[this:]" rel="schema:citation" href="http://www.ke.tu-darmstadt.de/bibtex/attachments/single/310">http://www.ke.tu-darmstadt.de/bibtex/attachments/single/310</a></li>
<li id="ref-20">Explain-a-LOD, <a about="[this:]" rel="schema:citation" href="http://www.ke.tu-darmstadt.de/resources/explain-a-lod">http://www.ke.tu-darmstadt.de/resources/explain-a-lod</a></li>
<li id="ref-21">Sundgren, B.: <a about="[this:]" rel="schema:citation" href="https://docs.google.com/viewer?a=v&pid=sites&srcid=ZGVmYXVsdGRvbWFpbnxib3N1bmRncmVufGd4OjJiY2M1OWMyMTBiNmNlZDg">Data quality in information systems</a>, Workshop on Data Quality (2013)</li>
<li id="ref-22">Wang, S., Schlobach, S., Klein, M.C.A.: What Is Concept Drift and How to Measure It? In: Knowledge Engineering and Management by the Masses - 17th International Conference, EKAW 2010. Proceedings. pp. 241–256. Lecture Notes in Computer Science, 6317, Springer, (2010)<a about="[this:]" rel="schema:citation" href="http://link.springer.com/chapter/10.1007%2F978-3-642-16438-5_17">http://link.springer.com/chapter/10.1007%2F978-3-642-16438-5_17</a></li>
<li id="ref-23">Ngonga Ngomo, A.-C.: LInk discovery framework for MEtric Spaces (LIMES): A Time-Efficient Hybrid Approach to Link Discovery (2011), <a about="[this:]" rel="schema:citation" href="http://aksw.org/Projects/limes">http://aksw.org/Projects/limes</a></li>
<li id="ref-24">Shiny, <a about="[this:]" rel="schema:citation" href="http://www.rstudio.com/shiny/">http://www.rstudio.com/shiny/</a></li>
<li id="ref-25">Shiny server, <a about="[this:]" rel="schema:citation" href="https://github.com/rstudio/shiny-server">https://github.com/rstudio/shiny-server</a></li>
<li id="ref-26">Transparency International, <a about="[this:]" rel="schema:citation" href="http://transparency.270a.info/">http://transparency.270a.info/</a></li>
<li id="ref-27">Food and Agriculture Organization of the United Nations, <a about="[this:]" rel="schema:citation" href="http://fao.270a.info/">http://fao.270a.info/</a></li>
<li id="ref-28">Organisation for Economic Co-operation and Development, <a about="[this:]" rel="schema:citation" href="http://oecd.270a.info/">http://oecd.270a.info/</a></li>
<li id="ref-29">World Bank, <a about="[this:]" rel="schema:citation" href="http://worldbank.270a.info/">http://worldbank.270a.info/</a></li>
<li id="ref-30">European Central Bank, <a about="[this:]" rel="schema:citation" href="http://ecb.270a.info/">http://ecb.270a.info/</a></li>
<li id="ref-31">International Monetary Fund, <a about="[this:]" rel="schema:citation" href="http://imf.270a.info/">http://imf.270a.info/</a></li>
<li id="ref-32">Apache Jena, <a about="[this:]" rel="schema:citation" href="http://jena.apache.org/">http://jena.apache.org/</a></li>
<li id="ref-33">Jena TDB, <a about="[this:]" rel="schema:citation" href="http://jena.apache.org/documentation/tdb/index.html">http://jena.apache.org/documentation/tdb/index.html</a></li>
<li id="ref-34">Jena TDB Optimizer, <a about="[this:]" rel="schema:citation" href="http://jena.apache.org/documentation/tdb/optimizer.html">http://jena.apache.org/documentation/tdb/optimizer.html</a></li>
<li id="ref-35">Jena Fuseki, <a about="[this:]" rel="schema:citation" href="https://jena.apache.org/documentation/serving_data/">https://jena.apache.org/documentation/serving_data/</a></li>
<li id="ref-36">Vocabulary of Interlinked Datasets, <a about="[this:]" rel="schema:citation" href="http://www.w3.org/TR/void/">http://www.w3.org/TR/void/</a></li>
<li id="ref-37">Loizou, A., Groth, P.: On the Formulation of Performant SPARQL Queries, arXiv:1304.0567 (2013) <a about="[this:]" rel="schema:citation" href="http://arxiv.org/abs/1304.0567">http://arxiv.org/abs/1304.0567</a></li>
<li id="ref-38">Rakhmawati, N.R., Umbrich, J., Karnstedt, M., Hasnain, A., Hausenblas, M.: Querying over Federated SPARQL Endpoints — A State of the Art Survey, arXiv:1306.1723 (2013) <a about="[this:]" rel="schema:citation" href="http://arxiv.org/abs/1306.1723">http://arxiv.org/abs/1306.1723</a></li>
<li id="ref-39">Stats Vocab, <a about="[this:]" rel="schema:citation" href="http://stats.270a.info/vocab">http://stats.270a.info/vocab</a></li>
<li id="ref-40">Semanticscience Integrated Ontology, <a about="[this:]" rel="schema:citation" href="http://semanticscience.org/ontology/sio.owl">http://semanticscience.org/ontology/sio.owl</a></li>
<li id="ref-41">Linked Data Pages, <a about="[this:]" rel="schema:citation" href="http://csarven.ca/statistical-linked-dataspaces#linked-data-pages">http://csarven.ca/statistical-linked-dataspaces#linked-data-pages</a></li>
<li id="ref-42">LSD Analysis code at GitHub, <a about="[this:]" rel="schema:citation" href="https://github.com/csarven/lsd-analysis">https://github.com/csarven/lsd-analysis</a></li>
<li id="ref-43">Demter, J., Auer, S., Martin, M., Lehmann, J.: LODStats – An Extensible Framework for High-performance Dataset Analytics, EKAW (2012), <a about="[this:]" rel="schema:citation" href="http://svn.aksw.org/papers/2011/RDFStats/public.pdf">http://svn.aksw.org/papers/2011/RDFStats/public.pdf</a></li>
<li id="ref-44">European Statistics Code of Practice, <a about="[this:]" rel="schema:citation" href="http://ec.europa.eu/eurostat/web/products-manuals-and-guidelines/-/KS-32-11-955">http://ec.europa.eu/eurostat/web/products-manuals-and-guidelines/-/KS-32-11-955</a></li>
</ol>
</div>
</section>
</div>
</article>
</body>
</html>