forked from dokieli/dokieli
-
Notifications
You must be signed in to change notification settings - Fork 0
/
sense-of-lsd-analysis.html
325 lines (265 loc) · 45.3 KB
/
sense-of-lsd-analysis.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml"
xml:lang="en" lang="en">
<head>
<meta charset="utf-8"/>
<title>Semantic Similarity and Correlation of Linked Statistical Data Analysis</title>
<link rel="stylesheet" media="all" title="LNCS" href="media/css/lncs.css"/>
<link rel="stylesheet alternate" media="all" title="ACM" href="media/css/acm.css"/>
<link rel="stylesheet" media="all" href="media/css/lr.css"/>
<script src="http://code.jquery.com/jquery-2.1.3.min.js"></script>
<script src="scripts/html.sortable.min.js"></script>
<script src="scripts/lr.js"></script>
</head>
<body about="[this:]" typeof="schema:CreativeWork sioc:Post schema:ScholarlyArticle prov:Entity" class="h-feed" prefix="rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns# rdfs: http://www.w3.org/2000/01/rdf-schema# owl: http://www.w3.org/2002/07/owl# xsd: http://www.w3.org/2001/XMLSchema# dcterms: http://purl.org/dc/terms/ foaf: http://xmlns.com/foaf/0.1/ v: http://www.w3.org/2006/vcard/ns# pimspace: http://www.w3.org/ns/pim/space# cc: http://creativecommons.org/ns# skos: http://www.w3.org/2004/02/skos/core# prov: http://www.w3.org/ns/prov# schema: http://schema.org/ rsa: http://www.w3.org/ns/auth/rsa# cert: http://www.w3.org/ns/auth/cert# cal: http://www.w3.org/2002/12/cal/ical# wgs: http://www.w3.org/2003/01/geo/wgs84_pos# org: http://www.w3.org/ns/org# biblio: http://purl.org/net/biblio# bibo: http://purl.org/ontology/bibo/ book: http://purl.org/NET/book/vocab# ov: http://open.vocab.org/terms/ doap: http://usefulinc.com/ns/doap# dbr: http://dbpedia.org/resource/ dbp: http://dbpedia.org/property/ sio: http://semanticscience.org/resource/ opmw: http://www.opmw.org/ontology/ deo: http://purl.org/spar/deo/ doco: http://purl.org/spar/doco/ cito: http://purl.org/spar/cito/ fabio: http://purl.org/spar/fabio/ oa: http://www.w3.org/ns/oa# this: http://csarven.ca/sense-of-lsd-analysis">
<article class="h-entry">
<h1 class="p-name" property="schema:name">Semantic Similarity and Correlation of Linked Statistical Data Analysis</h1>
<div id="authors">
<dl id="author-name">
<dt>Authors</dt>
<dd id="author-1"><span about="[this:]" rel="schema:creator schema:publisher schema:contributor schema:author"><a about="http://csarven.ca/#i" typeof="schema:Person" rel="schema:url" property="schema:name" href="http://csarven.ca/">Sarven Capadisli</a></span><span about="http://csarven.ca/#i" rel="schema:memberOf" resource="[dbr:Bern_University_of_Applied_Sciences]"></span><span about="http://csarven.ca/#i" rel="schema:memberOf" resource="[dbr:University_of_Bonn]"></span><sup><a href="#author-org-1">1</a></sup><sup><a href="#author-org-2">2</a></sup><sup><a href="#author-email-1">✊</a></sup></dd>
<dd id="author-3"><span about="[this:]" rel="schema:contributor"><a href="http://www.albertmeronyo.org/" property="schema:name">Albert Meroño-Peñuela</a></span><span about="http://www.albertmeronyo.org/" rel="schema:memberOf" resource="[dbr:VU_University_Amsterdam]"></span><sup><a href="#author-org-3">3</a></sup><sup><a href="#author-email-2">ℹ</a></sup></dd>
<dd id="author-2"><span about="[this:]" rel="schema:contributor"><a about="[this:#SörenAuer]" typeof="schema:Person" rel="schema:url" property="schema:name" href="http://www.iai.uni-bonn.de/~auer/">Sören Auer</a></span><span about="[this:#SörenAuer]" rel="schema:memberOf" resource="[dbr:University_of_Bonn]"></span><sup><a href="#author-org-2">2</a></sup><sup><a href="#author-email-3">⚛</a></sup></dd>
<dd id="author-4"><span about="[this:]" rel="schema:contributor"><a about="[this:#ReinhardRiedl]" typeof="schema:Person" rel="schema:url" property="schema:name" href="http://www.wirtschaft.bfh.ch/de/ueber_uns/kontakt/detailseite.html?tx_bfhpersonalpages_p=rer2&tx_bfhpersonalpages_screen=data">Reinhard Riedl</a></span><span about="[this:#ReinhardRiedl]" rel="schema:memberOf" resource="[dbr:Bern_University_of_Applied_Sciences]"></span><sup><a href="#author-org-1">1</a></sup><sup><a href="#author-email-4">𝄞</a></sup></dd>
</dl>
<ul id="author-org">
<li id="author-org-1"><sup>1</sup><a about="[dbr:Bern_University_of_Applied_Sciences]" typeof="schema:Organization" property="schema:name" rel="schema:url" href="http://bfh.ch/">Bern University of Applied Sciences</a>, E-Government-Institute, Bern, Switzerland</li>
<li id="author-org-2"><sup>2</sup><a about="[dbr:University_of_Bonn]" typeof="schema:Organization" property="schema:name" rel="schema:url" href="http://uni-bonn.de/">University of Bonn</a>, Enterprise Information Systems Department, Bonn, Germany</li>
<li id="author-org-3"><sup>3</sup><a about="[dbr:VU_University_Amsterdam]" typeof="schema:Organization" property="schema:name" rel="schema:url" href="http://vu.nl/">VU University Amsterdam</a>, Department of Computer Science, Amsterdam, Netherlands</li>
<li></li>
</ul>
<ul id="author-email">
<li id="author-email-1"><sup>✊</sup><a about="http://csarven.ca/#i" rel="schema:email" href="mailto:[email protected]" class="author-email">[email protected]</a></li>
<li id="author-email-2"><sup>ℹ</sup><a about="http://www.albertmeronyo.org/" rel="schema:email" href="mailto:[email protected]" class="author-email">[email protected]</a></li>
<li id="author-email-3"><sup>⚛</sup><a about="http://aksw.org/SoerenAuer" rel="schema:email" href="mailto:[email protected]">[email protected]</a></li>
<li id="author-email-4"><sup>𝄞</sup><a about="http://www.wirtschaft.bfh.ch/de/ueber_uns/kontakt/detailseite.html?tx_bfhpersonalpages_p=rer2&tx_bfhpersonalpages_screen=data" rel="schema:email" href="mailto:[email protected]">[email protected]</a></li>
</ul>
</div>
<dl id="document-identifier">
<dt>Document ID</dt>
<dd><a href="http://csarven.ca/sense-of-lsd-analysis">http://csarven.ca/sense-of-lsd-analysis</a></dd>
</dl>
<dl id="document-published">
<dt>Published</dt>
<dd><time datetime="2014-07-21" property="schema:datePublished" content="2014-07-21T09:00:00Z" datatype="xsd:dateTime">2014-07-21</time></dd>
</dl>
<dl id="document-modified">
<dt>Modified</dt>
<dd><time datetime="2015-03-21" property="schema:dateModified" content="2015-03-21T09:00:00Z" datatype="xsd:dateTime">2015-03-21</time></dd>
</dl>
<dl id="document-license">
<dt>License</dt>
<dd><a about="[this:]" rel="license schema:license" href="http://creativecommons.org/licenses/by-sa/4.0/" title="Creative Commons Attribution-ShareAlike 4.0 Unported">CC BY-SA 4.0</a></dd>
</dl>
<dl id="document-purpose">
<dt>Purpose</dt>
<dd property="schema:purpose">Making “sense” of Linked Statistical Data and Analysis.</dd>
</dl>
<div id="content" class="e-content">
<section id="abstract" about="[this:]">
<h2>Abstract</h2>
<div property="schema:abstract" class="p-summary">
<p>Statistical data is increasingly made available in the form of Linked Data on the Web. As more and more statistical datasets become available, a fundamental question on statistical data comparability arises: To what extent can arbitrary statistical datasets be faithfully compared? Besides a purely statistical comparability, we are interested in the role that semantics plays in the data to be compared. Our hypothesis is that semantic relationships between different components of statistical datasets might have a relationship with their statistical correlation. Our research focuses in studying whether these statistical and semantic relationships influence each other, by comparing the correlation of statistical data with their semantic similarity. The ongoing research problem is, hence, to investigate why machines have a difficulty in revealing meaningful correlations or establishing non-coincidental connection between variables in statistical datasets. We describe a fully reproducible pipeline to compare statistical correlation with semantic similarity in arbitrary Linked Statistical Data. We present a use case using World Bank data expressed as RDF Data Cube, and we highlight whether dataset titles can help predict strong correlations.</p>
</div>
</section>
<section id="keywords" about="[this:]">
<h2>Keywords</h2>
<div>
<ul rel="schema:about">
<li><a resource="http://dbpedia.org/resource/Linked_Data" href="http://en.wikipedia.org/wiki/Linked_Data">Linked Data</a></li>
<li><a resource="http://dbpedia.org/resource/Statistics" href="http://en.wikipedia.org/wiki/Statistics">Statistics</a></li>
<li><a resource="http://dbpedia.org/resource/Statistical_database" href="http://en.wikipedia.org/wiki/Statistical_database">Statistical database</a></li>
<li><a resource="http://dbpedia.org/resource/Semantic_similarity" href="http://en.wikipedia.org/wiki/Semantic_similarity">Semantic Similarity</a></li>
<li><a resource="http://dbpedia.org/resource/Correlation_and_dependence" href="http://en.wikipedia.org/wiki/Correlation_and_dependence">Correlation</a></li>
</ul>
</div>
</section>
<section id="introduction" about="[this:]" rel="schema:hasPart">
<h2 about="[this:#introduction]" property="schema:name">Introduction</h2>
<div about="[this:#introduction]" property="schema:description" typeof="deo:Introduction">
<p><span about="[this:]" rel="schema:hasPart"><q about="[this:#prologue]" typeof="deo:Prologue" property="schema:description" cite="http://www.open.edu/openlearn/science-maths-technology/mathematics-and-statistics/statistics/the-joy-stats-meaningless-and-meaningful-correlations">There was this American who was afraid of a heart attack and he found out that the Japanese ate very little fat and almost did not drink wine but they had much less heart attacks than the Americans. But on the other hand he also found out that the French eat as much fat as the Americans and they drink much more wine but they also have less heart attacks. So he concluded that what kills you is speaking English</q></span> [<a href="#ref-1">1</a>]. While computers can assist us in discovering strong correlations in large amounts of statistical datasets, whether by chance or through sophisticated methods, humans (or sometimes also known as <em>domain experts</em>) still need to be critical about the results and interpret them appropriately. This implies that we are still very much involved in the process in discovering meaningful correlations by filtering through everything that is presented to us.</p>
<p about="[this:#motivation]" typeof="deo:Motivation" property="schema:description">If we could however improve the situation slightly by having machines present us with only <em>useful</em> correlations from a random mass of correlations, then we can give more of our attention to what is interesting. Hence, our goal is to set a path towards identifying why some variables have a semantic link between them. Before we establish that, our ongoing approach (as outlined in this research and afterwards) will be to refute or cancel out things which may be in disguise for semantic similarity.</p>
<p>Therefore, we set our investigation with a workflow to experiment with Linked Statistical Datasets in the <a href="http://270a.info/">270a Cloud</a> [<a href="#ref-2">2</a>]. We have first set our hypothesis to uncover the possibility that <em>semantically similar</em> variables or datasets need to incorporate semantically rich information in order to find thought-provoking correlations. Then, the question is, what do exceptional or intriguing linkages for semantic similarity look like? We start with our null hypothesis by checking to see whether the dataset titles in World Bank indicators can help indicate strong correlations. Our results show that dataset titles by themselves or within a particular topic area is not a good indicator to predict correlations.</p>
</div>
</section>
<section id="methodology" about="[this:]" rel="schema:hasPart">
<h2 about="[this:#methodology]" property="schema:name">Methodology</h2>
<div about="[this:#methodology]" typeof="deo:Methods" property="schema:description">
<p>We first state our research design and hypothesis, then discuss how we employed Linked Statistical Data (LSD) and Semantic Similarity approaches for a workflow in our <a href="https://github.com/csarven/lsd-sense">LSD Sense</a> [<a href="#ref-3">3</a>] implementation.</p>
<section id="research-design" about="[this:#methodology]" rel="schema:hasPart">
<h3 about="[this:#research-design]" rel="schema:name">Research design</h3>
<div about="[this:#research-design]" typeof="deo:ProblemStatement" property="schema:description">
<p><strong>Research problem</strong>: Why do machines have difficulty in revealing meaningful correlations or establishing non-coincidental connection between variables in statistical datasets? Put another way: How can machines uncover interesting correlations?</p>
<p>Over this ongoing investigation, we want to uncover some of the fundamental components for measuring and declaring semantic similarity between datasets, in order to better predict relevant strong relationships. Can semantic relatedness between datasets imply statistical correlation of the related data points in the datasets?</p>
</div>
</section>
<section id="hypothesis" about="[this:#methodology]" rel="schema:hasPart">
<h3 about="[this:#hypothesis]" property="schema:name">Hypothesis</h3>
<div about="[this:#hypothesis]" typeof="deo:ProblemStatement" property="schema:description">
<p>Given our research question, we would like to propose a viable research hypothesis, followed by our investigation with the null hypothesis:</p>
<p id="hypothesis-alternative" about="[this:#hypothesis-alternative]" typeof="sio:SIO_000284" property="rdfs:label">H₁: If the absence of semantically rich connection between datasets is inadequate to distinguish meaningful relationships, then making relevant information about dataset connectivity will improve predicting dataset correlations by observing their semantic similarity.</p>
<p id="hypothesis-null" about="[this:#hypothesis-null]" typeof="sio:SIO_000284" property="rdfs:label">H₀: There exists a significant relationship between the semantic similarity of statistical dataset titles and the correlation among those datasets, because dataset titles can indicate rich connectivity.</p>
<span about="[this:#variable-semantic-similarity-lsd-titles]" typeof="sio:SIO_000367" property="rdfs:label" content="semantic similarity"></span><span about="[this:#variable-correlation-lsd]" typeof="sio:SIO_000367" property="rdfs:label" content="correlation"></span>
<p>We set the significance level to 5% probability.</p>
</div>
</section>
<section id="semantic-similarity" about="[this:#methodology]" rel="schema:hasPart">
<h3 about="[this:#semantic-similarity]" property="schema:name">Linked Statistical Data and Semantic Similarity</h3>
<div about="[this:#semantic-similarity]" property="schema:description">
<p>The RDF Data Cube vocabulary does not only allow one to express statistical data in a Web exchangeable format, but also to represent the (semantic) links within those statistical data. This ability poses some new interesting research questions around the relationship between the statistical and semantic relatedness of datasets. We are interested in the interplay of statistical correlation of LSD and their semantic similarity, in order to answer questions like: Does correlation between statistical datasets imply some kind of semantic relation? Do certain semantic links imply the existence of correlation? We propose a generic workflow for studying whether or not this relation between correlation and similarity holds for arbitrary LSD. We aim at generic correlation and similarity measures, and our workflow enables the use of any correlation and similarity indicators. For the specific goal of this paper, though, we stick to the use of Kendall's correlation coefficient and Latent Semantic Analysis (LSA) similarity.</p>
</div>
</section>
<section id="workflow" about="[this:#methodology]" rel="schema:hasPart">
<h3 about="[this:#workflow]" rel="schema:name">Workflow</h3>
<div about="[this:#workflow]" typeof="opmw:WorkflowTemplate deo:Model">
<p>Based on preliminary experimentation from data acquisition to analysis, we have created the <span property="rdfs:label">LSD Sense workflow</span>:</p>
<ol>
<li id="workflow-create-hypothesis" about="[this:#workflow-create-hypothesis]" typeof="opmw:WorkflowTemplateProcess" rel="opmw:isStepOfTemplate" resource="[this:#workflow]" property="rdfs:label">Create hypothesis</li>
<li id="workflow-configure" about="[this:#workflow-configure]" typeof="opmw:WorkflowTemplateProcess" rel="opmw:isStepOfTemplate" resource="[this:#workflow]" property="rdfs:label">Determine datasets and configurations</li>
<li id="workflow-get-metadata-lsd" about="[this:#workflow-get-metadata-lsd]" typeof="opmw:WorkflowTemplateProcess" rel="opmw:isStepOfTemplate" resource="[this:#workflow]" property="rdfs:label">Get metadata of datasets.</li>
<li id="workflow-get-observations-lsd" about="[this:#workflow-get-observations-lsd]" typeof="opmw:WorkflowTemplateProcess" rel="opmw:isStepOfTemplate" resource="[this:#workflow]" property="rdfs:label">Get each dataset's observations.</li>
<li id="workflow-create-analysis-lsd" about="[this:#workflow-create-analysis-lsd]" typeof="opmw:WorkflowTemplateProcess" rel="opmw:isStepOfTemplate" resource="[this:#workflow]" property="rdfs:label">Create correlations and other analysis for each dataset pair combination.</li>
<li id="workflow-create-preprocess-semantic-similarity" about="[this:#workflow-create-preprocess-semantic-similarity]" typeof="opmw:WorkflowTemplateProcess" rel="opmw:isStepOfTemplate" resource="[this:#workflow]" property="rdfs:label">Create dataset metadata subset for semantic similarity.</li>
<li id="workflow-create-analysis-semantic-similarity" about="[this:#workflow-create-analysis-semantic-similarity]" typeof="opmw:WorkflowTemplateProcess" rel="opmw:isStepOfTemplate" resource="[this:#workflow]" property="rdfs:label">Create semantic similarity for each dataset pair combination.</li>
<li id="workflow-create-analysis-semantic-similarity-correlation" about="[this:#workflow-create-analysis-semantic-similarity-correlation]" typeof="opmw:WorkflowTemplateProcess" rel="opmw:isStepOfTemplate" resource="[this:#workflow]" property="rdfs:label">Create correlation and other analysis using variables semantic similarity and correlation of LSD.</li>
<li id="workflow-test-verify-hypothesis" about="[this:#workflow-test-verify-hypothesis]" typeof="opmw:WorkflowTemplateProcess" rel="opmw:isStepOfTemplate" resource="[this:#workflow]" property="rdfs:label">Test and verify hypothesis.</li>
<li id="workflow-analysis" about="[this:#workflow-analysis]" typeof="opmw:WorkflowTemplateProcess" rel="opmw:isStepOfTemplate" resource="[this:#workflow]" property="rdfs:label">Analysis.</li>
</ol>
</div>
</section>
<section id="implementation" about="[this:#methodology]" rel="schema:hasPart">
<h3 about="[this:#implementation]" property="schema:name">Implementation</h3>
<div about="[this:#implementation]" property="schema:description">
<p>We have an implementation of the <a href="https://github.com/csarven/lsd-sense">LSD Sense</a> workflow which can be used to both, reproduce our experiments, as well as run it on new input datasets. With the exception of determining which datasets to inspect, and the system configuration, LSD Sense is automated.</p>
<p id="semantic-correlation"><strong>Semantic Correlation</strong>: The semantic similarity algorithm is based on a <a href="http://www.cs.bham.ac.uk/~pxt/IDA/lsa_ind.pdf">Latent Semantic Index</a> (LSI) [<a href="#ref-4">4</a>]. We use the dataset titles to check for their similarity. Essentially, LSI puts each dataset title into a cluster. The number of clusters can be adjusted (default to 200). It remains as an open research question as to what it should be. Generally, research has demonstrated that optimal values depend on the size and nature of the dataset [<a href="#ref-5">5</a>]. We use <a href="http://radimrehurek.com/gensim/index.html">gensim</a> [<a href="#ref-6">6</a>] in our <a href="https://github.com/albertmeronyo/SemanticCorrelation">Semantic Correlation</a> [<a href="#ref-7">7</a>] implementation for LSD Sense.</p>
<p>Concerning the quality of the dataset titles, it is possible to come across datasets that differ only by one word e.g., <q>male</q>, <q>female</q>. This potentially lowers the accuracy to differentiate datasets. As mentioned earlier, we removed the attribute information from the dataset titles with the assumption that it reduced noise.</p>
</div>
</section>
</div>
</section>
<section id="experiment" about="[this:]" rel="schema:hasPart">
<h2 about="[this:#experiment]" resource="[this:#experiment]">Experiment</h2>
<div about="[this:#experiment]" typeof="sio:SIO_000994 deo:Materials">
<p>Two experiments were conducted using the same workflow. Experiments differed only by their input data. In the first experiment, the analysis was done for a particular reference year over all available datasets. In the second experiment, however, we restricted the data further for only a particular dataset domain (topic), thereby making it possible to compare whether a control over a topic can be significant for semantic similarity of the dataset titles.</p>
<section id="data" about="[this:#experiment]" rel="schema:hasPart">
<h3 about="[this:#data]" rel="schema:name">Data</h3>
<div about="[this:#data]" typeof="deo:Data" property="schema:description">
<p>We decided to conduct our experiment on a simple dataset structure, containing two dimensions; <em>reference area</em>, and <em>reference period</em>, and one measure <em>value</em> for its observations, where the <span about="[this:]" rel="schema:hasPart"><a id="data-worldbank-indicators" about="[this:#data-worldbank-indicators]" typeof="deo:DatasetDescription" property="rdfs:label" rel="rdfs:seeAlso" href="http://worldbank.270a.info/dataset/world-bank-indicators">World Bank indicators</a></span> was a good candidate from the 270a Cloud. The rationale for using only one dataspace (at this time) was to remain within a consistent classification space to measure semantic similarity. We fixed the reference period to 2012, and datasets that are part of <span about="[this:]" rel="schema:hasPart"><a id="data-worldbank-indicators-topic-4" about="[this:#data-worldbank-indicators-topic-4]" typeof="deo:DatasetDescription" property="rdfs:label" rel="rdfs:seeAlso" href="http://worldbank.270a.info/classification/topic/4">World Bank's education topic</a></span>. We have identified one downside concerning the data quality i.e., the attribute/unit information was incorporated as part of the dataset title, usually as a suffix within brackets. We dealt with this by removing the attribute information from the titles as part of preprocessing in the semantic similarity phase.</p>
</div>
</section>
<section id="experiment-workflow-worldbank" about="[this:#experiment]" rel="schema:hasPart">
<h3 about="[this:#experiment-workflow-worldbank]" rel="schema:name">World Bank Indicators workflow</h3>
<div about="[this:#experiment-workflow-worldbank]" property="schema:description">
<p>The workflow of our experiment is summarized as follows:</p>
<section id="experiment-workflow-worldbank-correlations" about="[this:#experiment-workflow-worldbank]" rel="schema:hasPart">
<h4 about="[this:#experiment-workflow-worldbank-correlations]" rel="schema:name">Correlations for each dataset pair</h4>
<div about="[this:#experiment-workflow-worldbank-correlations]" rel="schema:description">
<p>We retrieved the 2012 World Bank Indicators datasets, 3267 in total, via SPARQL queries from the <a href="http://worldbank.270a.info/">World Bank Linked Dataspace</a> [<a href="#ref-8">8</a>]. The correlations were computed using R, the statistical software, by joining each dataset pair by their reference area (one of the dimensions of the dataset structure), and using their measure values for the correlation coefficient. Based on preliminary inspection for normality distribution on sample datasets, we noted that observations did not come from a bivariate normal distribution. Hence, we computed Kendall's rank correlation coefficient in our analysis. Initially we computed and stored the correlations for dataset pairs with a sample size, n>10, resulting in 2126912 correlation values. The information on the analysis we generated consisted of the following headers: <code>datasetX</code>, <code>datasetY</code>, <code>correlation</code>, <code>pValue</code>, <code>n</code>, where <code>datasetX</code> and <code>datasetY</code> are the identifiers for each dataset pair that is being compared. We later filtered sample values, n<50, for our threshold for significance. The population size i.e., the number of potential reference areas that can have an observation, is 260. That is the number of reference area codes in the World Bank classification, however, it is not known as to which reference areas may occur in a given dataset beforehand. We retained majority of the computations in any case, giving us the possibility to do better pruning in the future, in light of more information.</p>
</div>
</section>
<section id="experiment-workflow-worldbank-similarity" about="[this:#experiment-workflow-worldbank]" rel="schema:hasPart">
<h4 about="[this:#experiment-workflow-worldbank-similarity]" rel="schema:name">Semantic similarity for each dataset pair</h4>
<div about="[this:#experiment-workflow-worldbank-similarity]" rel="schema:description">
<p>Before doing the semantic similarity, we first took an unique list of the dataset identifiers from <code>datasetX</code> and <code>datasetY</code> so that what is to be checked for their similarity is only in relation to those datasets, as opposed to the complete set of datasets which we originally retrieved. At this point, we have 2200 unique datasets. The similarity was measured based on dataset titles. They are in short sentences e.g., <q>Mortality rate, infant (per 1,000 live births)</q>. After minor preprocessing e.g., removal of the text pertaining the unit within brackets, it was left with <q>Mortality rate, infant</q>. The semantic similarity algorithm is based on LSA. Essentially LSA puts each dataset title into a cluster (default number is 200). The resulting headers of the output was: <code>datasetX</code>, <code>datasetY</code>, <code>similarity</code>.</p>
</div>
</section>
<section id="experiment-workflow-worldbank-similarity-correlations" about="[this:#experiment-workflow-worldbank]" rel="schema:hasPart">
<h4 about="[this:#experiment-workflow-worldbank-similarity-correlations]" rel="schema:name">Correlation analysis with variables semantic similarity and correlation of dataset</h4>
<div about="[this:#experiment-workflow-worldbank-similarity-correlations]" rel="schema:description">
<p>We then took the absolute values for both variables; <code>|similarity|</code>, <code>|correlation|</code> (caring only for the strength of the relationships as opposed to their directionality). We then filtered both similarity and correlation values <0.05 and >0.95, as well as correlation values with <em>p</em>-value>0.05, for reasons to exclude potential outliers, or misleading perfect relations, as well as to exclude insignificant correlations. The final correlation and scatter plot was generated by joining the similarity and correlation tables on datasetX and datasetY columns. Finally the correlation of the final data table was conducted using the Kendall method as the data had a non-normal distribution and we were not interested in modeling (line fitting).</p>
<p>The second experiment followed the same procedure for the analysis, but considering only the datasets associated with the topic education for the same reference period.</p>
</div>
</section>
</div>
</section>
</div>
</section>
<section id="results" about="[this:]" rel="schema:hasPart">
<h2 resource="[this:#results]" property="schema:name">Results</h2>
<div about="[this:#results]" typeof="deo:Results" property="schema:description">
<p>All of the experiment results are available at the <a href="https://github.com/csarven/lsd-sense">LSD Sense</a> GitHub repository, and can be reproduced. Table [<a href="#experiment-results">Experiment results</a>] provides our findings, with Figures [<a href="#figure_lsd-sense-worldbank-2012">1</a>] and [<a href="#figure_lsd-sense-worldbank-2012-topic-4">2</a>]:</p>
<table id="experiment-results">
<caption>Experiment Results</caption>
<thead>
<tr>
<th></th>
<th>All topics</th>
<th>One topic (<em>education</em>)</th>
</tr>
</thead>
<tfoot>
<tr><td colspan="3">Datasets are from 2012 World Bank indicators. n is the number of dataset pairs with semantic similarity and correlation as variables.</td></tr>
</tfoot>
<tbody>
<tr><th>Correlation</th><td>0.182</td><td>0.227</td></tr>
<tr><th><em>p</em>-value</th><td>< 2.2e-16</td><td>< 2.2e-16</td></tr>
<tr><th>n</th><td>92819</td><td>33184</td></tr>
</tbody>
</table>
<div class="figure-column-2">
<figure id="figure_lsd-sense-worldbank-2012">
<img src="lsd-sense-worldbank-2012.png" width="300" height="300" alt="Figure of scatter plot showing 2012 World Bank indicators with all topics"/>
<figcaption>2012 World Bank indicators with all topics</figcaption>
</figure>
<figure id="figure_lsd-sense-worldbank-2012-topic-4">
<img src="lsd-sense-worldbank-2012-4.png" width="300" height="300" alt="Figure of scatter plot showing 2012 World Bank indicators with topic education"/>
<figcaption>2012 World Bank indicators with topic education.</figcaption>
</figure>
</div>
<p>Given that both experiments resulted in <em>p</em>-values that are statistically significant and that the strength of the correlation values are weak, we reject our null hypothesis. For extra measure, we can also verify the meaninglessness by looking at the plots. There is <strong>nothing</strong> interesting <strong>to see here</strong>. We will <strong>move along</strong> with our alternative hypothesis.</p>
</div>
</section>
<section id="related-work" about="[this:]" rel="schema:hasPart">
<h2 about="[this:#related-work]" property="schema:name">Related Work</h2>
<div about="[this:#related-work]" typeof="deo:RelatedWork" property="schema:description">
<p><a href="http://csarven.ca/linked-statistical-data-analysis">Linked Statistical Data Analysis</a> [<a href="#ref-9">9</a>], explores a way to reuse statistical linked dataspaces, federated queries, and generation of statistical analyses e.g., regression, for humans and machines. The <a href="http://stats.270a.info/">stats.270a.info</a> [<a href="#ref-10">10</a>] service stores computed analysis, and makes it possible for future discovery.</p>
<p><a href="http://www.hicss.hawaii.edu/hicss_46/bp46/hc6.pdf">Towards Next Generation Health Data Exploration</a>: A Data Cube-based Investigation into Population Statistics for Tobacco [<a href="#ref-11">11</a>], presents the <a href="http://orion.tw.rpi.edu/~jimmccusker/qb.js/">qb.js</a> [<a href="#ref-12">12</a>] tool to explore data that is expressed as RDF Data Cubes. It is designed to formulate and explore hypotheses. Under the hood, it makes a SPARQL query to an endpoint which contains the data that it analyzes.</p>
<p><a href="http://www.ke.tu-darmstadt.de/bibtex/attachments/single/310">Generating Possible Interpretations for Statistics from Linked Open Data</a> [<a href="#ref-13">13</a>] talks about the Explain-a-LOD tool which focuses on generating hypotheses that explain statistics. It has a configuration to compare two variables, and then provides possible interpretations of the correlation analysis for users to review.</p>
<p><a href="http://svn.aksw.org/papers/2013/LODSEM/ISWC2013_AZ_LODSEM_public.pdf">Using Linked Data to Evaluate the Impact of Research and Development in Europe</a>: A Structural Equation Model [<a href="#ref-14">14</a>], presents the feasibility of combining different LOD sources to assess the impact of one variable over others.</p>
<p><a href="http://tylervigen.com/">Spurious Correlations</a> [<a href="#ref-15">15</a>] reveals correlations that are not genuine for practical use. In other words, the correlations are type I errors. It emphasizes on the importance for humans to be critical of random correlations, and to investigate whether there is a direct relation between the variables.</p>
<p><a href="http://www.ontologymatching.org/">Ontology Matching</a> [<a href="#ref-16">16</a>] is perhaps the most mature field in the Semantic Web dealing with the general problem of finding semantically related entities of ontologies and Linked Data, although resources like WordNet and DBpedia are also related.</p>
<p>These studies and engineering efforts have created, inspected, and hypothesized possible correlations. However, the missing gap in research is that there is no integrated study on how semantic relatedness between datasets may enhance the detection of meaningful or useful correlations in statistical data. Our contribution is the investigation of highly probably elements which would lead to better prediction of interesting correlations by employing linked statistical datasets and semantic analysis.</p>
</div>
</section>
<section id="conclusions" about="[this:]" rel="schema:hasPart">
<h2 about="[this:#conclusions]" rel="schema:name">Conclusions and Future Work</h2>
<div about="[this:#conclusions]" property="schema:description" typeof="deo:Conclusion">
<p>We believe that the presented work here and the prior Linked Statistical Data Analysis effort contributes towards strengthening the relationship between Semantic Web and statistical research. What we have set out to investigate was to minimize human involvement for discovering useful correlations in statistical data. We have implemented a workflow in which we can automate the analysis process, from data retrieval to outputting analysis results for candidate semantic linkages in Linked Statistical Data.</p>
<p>We have evaluated our results by testing and verifying the null hypothesis which we have put forward. While it turned out that the semantic similarity between datasets titles were not useful to determine strong and meaningful correlations — which is a useful finding, in any case — it left us with the remaining alternative hypothesis that can be used in future research.</p>
<p about="[this:]" rel="schema:hasPart" resource="[this:#future-work]"><span id="future-work" about="[this:#future-work]" typeof="deo:FutureWork" property="schema:description">Possibly fruitful future work might want to run a similar experiment with the semantic similarity of dataset descriptions, test manually configured useful relations for a controlled set of datasets, or looking into interlinked topic domains across linked dataspaces.</span></p>
<p>Where is <em>interestingness</em> hidden?</p>
</div>
</section>
<section id="acknowledgements" about="[this:]" rel="schema:hasPart">
<h2 about="[this:#acknowledgements]" property="schema:name">Acknowledgements</h2>
<div about="[this:#acknowledgements]" typeof="deo:Acknowledgements" property="schema:description">
<p>This work was supported by a STSM Grant from the <a href="http://www.cost.eu/domains_actions/mpns/Actions/TD1210">COST Action TD1210</a>. Many thanks to colleagues whom helped one way or another during the course of this work (not implying any endorsement); in no particular order: <a href="http://bosamber.wordpress.com/">Amber van den Bos</a> (Dakiroa), <a href="http://www.wirtschaft.bfh.ch/de/ueber_uns/kontakt/detailseite.html?tx_bfhpersonalpages_p=mam10&tx_bfhpersonalpages_screen=data">Michael Mosimann</a> (BFS), <a href="http://nl.linkedin.com/pub/anton-heijs/1/489/861">Anton Heijs</a> (Treparel b.v.), <a href="http://en.wikipedia.org/wiki/Frank_van_Harmelen">Frank van Harmelen</a> (VU Amsterdam).</p>
</div>
</section>
<section id="references">
<h2>References</h2>
<div>
<ol>
<li id="ref-1">Rosling, H., Marmot, M.: The Joy Of Stats: Meaningless and meaningful correlations, <a about="[this:]" rel="schema:citation" href="http://www.open.edu/openlearn/science-maths-technology/mathematics-and-statistics/statistics/the-joy-stats-meaningless-and-meaningful-correlations">http://www.open.edu/openlearn/science-maths-technology/mathematics-and-statistics/statistics/the-joy-stats-meaningless-and-meaningful-correlations</a></li>
<li id="ref-2">270a.info, <a about="[this:]" rel="schema:citation" href="http://270a.info/">http://270a.info/</a></li>
<li id="ref-3">LSD Sense code at GitHub, <a about="[this:]" rel="schema:citation" href="https://github.com/csarven/lsd-sense">https://github.com/csarven/lsd-sense</a></li>
<li id="ref-4">Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., Harshman, R.: Indexing by Latent Semantic Analysis, Journal of the American Society for Information Science, 41(6), pp.391–407 (1990), <a about="[this:]" rel="schema:citation" href="http://www.cs.bham.ac.uk/~pxt/IDA/lsa_ind.pdf">http://www.cs.bham.ac.uk/~pxt/IDA/lsa_ind.pdf</a></li>
<li id="ref-5">Bradford, R.: An Empirical Study of Required Dimensionality for Large-scale Latent Semantic Indexing Applications, Proceedings of the 17th ACM Conference on Information and Knowledge Management, pp.153–162 (2008), <a about="[this:]" rel="schema:citation" href="http://dl.acm.org/citation.cfm?id=1458105">http://dl.acm.org/citation.cfm?id=1458105</a></li>
<li id="ref-6">gensim: Topic modeling for humans, <a about="[this:]" rel="schema:citation" href="http://radimrehurek.com/gensim/index.html">http://radimrehurek.com/gensim/index.html</a></li>
<li id="ref-7">SemanticCorrelation code at GitHub, <a about="[this:]" rel="schema:citation" href="https://github.com/albertmeronyo/SemanticCorrelation">https://github.com/albertmeronyo/SemanticCorrelation</a></li>
<li id="ref-8">World Bank Linked Dataspace, <a about="[this:]" rel="schema:citation" href="http://worldbank.270a.info/">http://worldbank.270a.info/</a></li>
<li id="ref-9">Capadisli, S., Auer, S. Riedl, R.: Linked Statistical Data Analysis, ISWC SemStats (2013), <a about="[this:]" rel="schema:citation" href="http://csarven.ca/linked-statistical-data-analysis">http://csarven.ca/linked-statistical-data-analysis</a></li>
<li id="ref-10">stats.270a.info, <a about="[this:]" rel="schema:citation" href="http://stats.270a.info/">http://stats.270a.info/</a></li>
<li id="ref-11">McCusker, J. P., McGuinness, D. L., Lee, J., Thomas, C., Courtney, P., Tatalovich, Z., Contractor, N., Morgan, G., Shaikh, A.: Towards Next Generation Health Data Exploration: A Data Cube-based Investigation into Population Statistics for Tobacco, Hawaii International Conference on System Sciences (2012), <a about="[this:]" rel="schema:citation" href="http://www.hicss.hawaii.edu/hicss_46/bp46/hc6.pdf">http://www.hicss.hawaii.edu/hicss_46/bp46/hc6.pdf</a></li>
<li id="ref-12">qb.js, <a about="[this:]" rel="schema:citation" href="http://orion.tw.rpi.edu/~jimmccusker/qb.js/">http://orion.tw.rpi.edu/~jimmccusker/qb.js/</a></li>
<li id="ref-13">Paulheim, H.: Generating Possible Interpretations for Statistics from Linked Open Data, ESWC (2012), <a about="[this:]" rel="schema:citation" href="http://www.ke.tu-darmstadt.de/bibtex/attachments/single/310">http://www.ke.tu-darmstadt.de/bibtex/attachments/single/310</a></li>
<li id="ref-14">Zaveri, A., Vissoci, J. R. N., Daraio, C., Pietrobon, R.: Using Linked Data to Evaluate the Impact of Research and Development in Europe: A Structural Equation Model, pp.244–259, ISWC (2013), <a about="[this:]" rel="schema:citation" href="http://svn.aksw.org/papers/2013/LODSEM/ISWC2013_AZ_LODSEM_public.pdf">http://svn.aksw.org/papers/2013/LODSEM/ISWC2013_AZ_LODSEM_public.pdf</a></li>
<li id="ref-15">Vigen, T.: Spurious Correlations, <a about="[this:]" rel="schema:citation" href="http://tylervigen.com/">http://tylervigen.com/</a></li>
<li id="ref-16">Shvaiko, P., Euzenat, J.: Ontology matching: state of the art and future challenges. IEEE Transactions on Knowledge and Data Engineering (2013), <a about="[this:]" rel="schema:citation" href="http://disi.unitn.it/~p2p/RelatedWork/Matching/SurveyOMtkde_SE.pdf">http://disi.unitn.it/~p2p/RelatedWork/Matching/SurveyOMtkde_SE.pdf</a></li>
</ol>
</div>
</section>
</div>
</article>
</body>
</html>