forked from Knowledge-Graphs-Book/HTML-Book
-
Notifications
You must be signed in to change notification settings - Fork 0
/
index-mathjax.html
3715 lines (3316 loc) · 885 KB
/
index-mathjax.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE html>
<html lang="en">
<head>
<title>Knowledge Graphs</title>
<meta charset="UTF-8"/>
<link rel="stylesheet" href="css/style.css"/>
<link rel="stylesheet" href="css/prism.css"/>
<link rel="stylesheet" href="css/fonts.css"/>
<link rel="stylesheet" href="css/print.css" media="print"/>
<script src="https://polyfill.io/v3/polyfill.min.js?features=es6"></script>
<script id="MathJax-script" async="async" src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js"></script>
<script src="js/prism.js"></script>
</head>
<body>
<div style="display:none" id="tex-macros">
\(
\newcommand{\coloneqq}{\mathrel{\vcenter{:}}=}
\newcommand{\con}{\mathbf{Con}}
\newcommand{\var}{\mathbf{Var}}
\newcommand{\term}{\mathbf{Term}}
\newcommand{\dom}{\mathbf{dom}}
\newcommand{\datatype}[1]{\Delta_{#1}}
\newcommand{\datatypeL}[1]{\datatype{\texttt{#1}}}
\newcommand{\gelab}[1]{{\color{blue}\textsf{#1}}}
\newcommand\arc[2]{\xrightarrow{#1}#2}
\newcommand{\qualified}[4]{\arc{#1}{#2}\{#3,#4\}}
\newcommand{\qualifiedcard}[3]{\arc{#1}{#2}~#3}
\newcommand{\qualifiedL}[4]{\qualified{\gelab{#1}}{#2}{#3}{#4}}
\newcommand{\qualifiedcardL}[3]{\qualifiedcard{\gelab{#1}}{#2}{#3}}
\newcommand{\semantics}[4]{[#1]^{#2,#3,#4}}
\newcommand{\inp}[1]{#1^I}
\newcommand{\inpdom}{\inp{\Delta}}
\newcommand{\T}[1]{#1^{\rm T}}
\newcommand{\D}[1]{#1^{\rm D}}
\)
</div>
<div class="cover"><img alt="mock cover" src="images/mock-cover.jpg"/></div>
<header>
<h1 id="title"><span class="big-letter">K</span>nowledge <span class="big-letter">G</span>raphs</h1>
<ul class="authorlist">
<li><span class="author"><a href="https://orcid.org/0000-0001-9482-1982">Aidan Hogan</a></span> <span class="affiliation">IMFD, DCC, Universidad de Chile <span class="country">Chile</span></span></li>
<li><span class="author"><a href="https://orcid.org/0000-0003-0036-6662">Eva Blomqvist</a></span> <span class="affiliation">Linköping University <span class="country">Sweden</span></span></li>
<li><span class="author"><a href="https://orcid.org/0000-0001-5726-4638">Michael Cochez</a></span> <span class="affiliation">Vrije Universiteit and Discovery Lab, Elsevier <span class="country">The Netherlands</span></span></li>
<li><span class="author"><a href="https://orcid.org/0000-0002-3385-987X">Claudia d’Amato</a></span> <span class="affiliation">University of Bari <span class="country">Italy</span></span></li>
<li><span class="author"><a href="https://orcid.org/0000-0002-2930-2059">Gerard de Melo</a></span> <span class="affiliation">HPI, University of Potsdam and Rutgers University <span class="country">USA</span></span></li>
<li><span class="author"><a href="https://orcid.org/0000-0002-4559-6544">Claudio Gutierrez</a></span> <span class="affiliation">IMFD, DCC, Universidad de Chile <span class="country">Chile</span></span></li>
<li><span class="author"><a href="https://orcid.org/0000-0002-6955-7718">Sabrina Kirrane</a></span> <span class="affiliation">WU Vienna <span class="country">Austria</span></span></li>
<li><span class="author"><a href="https://orcid.org/0000-0001-8907-5348">José Emilio Labra Gayo</a></span> <span class="affiliation">Universidad de Oviedo <span class="country">Spain</span></span></li>
<li><span class="author"><a href="https://orcid.org/0000-0003-3831-9706">Roberto Navigli</a></span> <span class="affiliation">Sapienza University of Rome <span class="country">Italy</span></span></li>
<li><span class="author"><a href="https://orcid.org/0000-0002-9804-4882">Sebastian Neumaier</a></span> <span class="affiliation">St. Pölten University of Applied Sciences <span class="country">Austria</span></span></li>
<li><span class="author"><a href="https://orcid.org/0000-0001-7112-3516">Axel-Cyrille Ngonga Ngomo</a></span> <span class="affiliation">DICE, Universität Paderborn <span class="country">Germany</span></span></li>
<li><span class="author"><a href="https://orcid.org/0000-0001-5670-1146">Axel Polleres</a></span> <span class="affiliation">WU Vienna <span class="country">Austria</span></span></li>
<li><span class="author"><a href="https://orcid.org/0000-0002-4162-8334">Sabbir M. Rashid</a></span> <span class="affiliation">Tetherless World Constellation, Rensselaer Polytechnic Institute <span class="country">USA</span></span></li>
<li><span class="author"><a href="https://orcid.org/0000-0002-8046-7502">Anisa Rula</a></span> <span class="affiliation">University of Milano-Bicocca <span class="country">Italy</span> and University of Bonn <span class="country">Germany</span></span></li>
<li><span class="author"><a href="https://orcid.org/0000-0002-2108-2303">Lukas Schmelzeisen</a></span> <span class="affiliation">Universität Stuttgart <span class="country">Germany</span></span></li>
<li><span class="author"><a href="https://orcid.org/0000-0003-3112-9299">Juan Sequeda</a></span> <span class="affiliation">data.world <span class="country">USA</span></span></li>
<li><span class="author"><a href="https://orcid.org/0000-0002-0780-4154">Steffen Staab</a></span> <span class="affiliation">Universität Stuttgart <span class="country">Germany</span> and University of Southampton <span class="country">UK</span></span></li>
<li><span class="author"><a href="https://orcid.org/0000-0003-1502-6986">Antoine Zimmermann</a></span> <span class="affiliation">École des mines de Saint-Étienne <span class="country">France</span></span></li>
</ul>
<div id="about" class="info">
<h2>About the book</h2>
<p>The book is published by <a href="http://www.morganclaypool.com/">Morgan & Claypool</a> in the series <a href="http://www.morganclaypool.com/toc/wbe.1/7/1">Synthesis Lectures on the Semantic Web: Theory and Technology</a> edited by <a href="http://info.slis.indiana.edu/~dingying/">Ying Ding</a> and Paul Groth. Please, cite the book as:</p>
<blockquote class="quote">
Aidan Hogan, Eva Blomqvist, Michael Cochez, Claudia d’Amato, Gerard de Melo, Claudio Gutierrez, Sabrina Kirrane, José Emilio Labra Gayo, Roberto Navigli, Sebastian Neumaier, Axel-Cyrille Ngonga Ngomo, Axel Polleres, Sabbir M. Rashid, Anisa Rula, Lukas Schmelzeisen, Juan Sequeda, Steffen Staab, Antoine Zimmermann (2021) <em>Knowledge Graphs</em>, Synthesis Lectures on the Semantic Web: Theory and Technology, No. 22, 1-233, DOI: 10.2200/S01125ED1V01Y202109DSK022, Morgan & Claypool
</blockquote>
<dl>
<dt>ISBN paperback:</dt>
<dd>9781636392356</dd>
<dt>ISBN ebook:</dt>
<dd>9781636392363</dd>
<dt>ISBN hardcover:</dt>
<dd>9781636392370</dd>
</dl>
<p>Copyright © 2021 by Morgan & Claypool. All rights reserved.</p>
<!--<p><a href="bibtex.txt">Bibtex</a></p>-->
<h2>Access options</h2>
<dl>
<dt>HTML version:</dt>
<dd>You are currently reading the free HTML version of the book, the most recent of which is available at <a href="https://kg-book.org/">https://kg-book.org/</a></dd>
<dt>PDF Version:</dt>
<dd>You can download or buy the book from <a href="http://www.morganclaypool.com/">Morgan & Claypool</a>. Academic and Corporate licences are available.</dd>
<dt>Hard copy:</dt>
<dd>You can order from from <a href="http://www.morganclaypool.com/">Morgan & Claypool</a> or <a href="https://www.amazon.com/">Amazon</a>.</dd>
</dl>
</div>
<p><em>SYNTHESIS LECTURES ON ON THE SEMANTIC WEB #22</em></p>
</header>
<h2 id="abstract"><span>Abstract</span></h2>
<p>This book provides a comprehensive and accessible introduction to knowledge graphs, which have recently garnered notable attention from both industry and academia. Knowledge graphs are founded on the principle of applying a graph-based abstraction to data, and are now broadly deployed in scenarios that require integrating and extracting value from multiple, diverse sources of data at large scale.</p>
<p>The book is divided into ten chapters. The first chapter provides a general introduction to the area, defines the concept of a “knowledge graph”, and provides a high-level overview of how knowledge graphs are currently being used. The second chapter presents and contrasts popular graph models that are commonly used to represent data as graphs, and the languages by which they can be queried. The third chapter describes how the resulting data graph can be enhanced with notions of schema, identity and context. The fourth chapter discusses how ontologies and rules can be used to encode knowledge, and how they enable deductive forms of reasoning. The fifth chapter delves into how inductive techniques – based on statistics, graph analytics, machine learning, etc. – can be used to encode and extract knowledge. The sixth chapter is dedicated to techniques for the creation and enrichment of knowledge graphs from legacy sources of data. The seventh chapter enumerates a variety of quality measures that can be used to assess a knowledge graph in terms of its fitness for use in a variety of applications. The eighth chapter presents key methods for the refinement of knowledge graphs, with the goal of improving their completeness and correctness. The ninth chapter provides a survey of the open and enterprise knowledge graphs that have emerged in recent years, along with the industries within which, and the applications for which, they have been most widely adopted. The tenth chapter wraps up the book with discussion of the current limitations and future directions along which knowledge graphs are likely to evolve. An appendix further covers knowledge graphs from an historical perspective, establishing their significance in the broader context of the academic study of data and knowledge, as well as surveying prior definitions of “knowledge graphs” from the literature.</p>
<p>The book is aimed at students, researchers and practitioners who wish to learn more about knowledge graphs, and how they facilitate extracting value from diverse data at large-scale. To make the book accessible for newcomers, running examples and graphical notation are used throughout. Formal definitions and extensive references are also provided for those who opt to delve more deeply into specific topics.</p>
<h2 id="keywords"><span>Keywords</span></h2>
<p>knowledge graphs, artificial intelligence, semantic web, machine learning</p>
<nav id="toc">
<div><p>Table of ▼ Contents</p></div>
<ol>
<li><a href="#sec-preface">Preface</a></li>
<li><a href="#sec-ack">Acknowledgements</a></li>
<li><a href="#chap-intro">1. Introduction</a></li>
<li><a href="#chap-graph">2. Data Graphs</a>
<ol>
<li><a href="#ssec-graphModels">2.1. Models</a></li>
<li><a href="#ssec-querying">2.2. Querying</a></li>
</ol></li>
<li><a href="#chap-knowledge">3. Schema, Identity, Context</a>
<ol>
<li><a href="#sec-schema">3.1. Schema</a></li>
<li><a href="#sec-identity">3.2. Identity</a></li>
<li><a href="#ssec-knowledgeContext">3.3. Context</a></li>
</ol></li>
<li><a href="#chap-deductive">4. Deductive Knowledge</a>
<ol>
<li><a href="#ssec-ontologies">4.1. Ontologies</a></li>
<li><a href="#ssec-reasoning">4.2. Reasoning</a></li>
</ol></li>
<li><a href="#chap-inductive">5. Inductive Knowledge</a>
<ol>
<li><a href="#sec-gAnalytics">5.1. Graph Analytics</a></li>
<li><a href="#ssec-embeddings">5.2. Knowledge Graph Embeddings</a></li>
<li><a href="#ssec-gnns">5.3. Graph Neural Networks</a></li>
<li><a href="#ssec-symlearn">5.4. Symbolic Learning</a></li>
</ol></li>
<li><a href="#chap-create">6. Creation and Enrichment</a>
<ol>
<li><a href="#sssec-graphCreationHuman">6.1. Human Collaboration</a></li>
<li><a href="#sssec-graphCreationText">6.2. Text Sources</a></li>
<li><a href="#sssec-graphCreationSemistructured">6.3. Markup Sources</a></li>
<li><a href="#sssec-graphCreationStructured">6.4. Structured Sources</a></li>
<li><a href="#ssec-knowledgeConceptual">6.5. Schema/Ontology Creation</a></li>
</ol></li>
<li><a href="#chap-quality">7. Quality Assessment</a>
<ol>
<li><a href="#ssec-accuracy">7.1. Accuracy</a></li>
<li><a href="#sssec-coverage">7.2. Coverage</a></li>
<li><a href="#ssec-coherency">7.3. Coherency</a></li>
<li><a href="#ssec-succinctness">7.4. Succinctness</a></li>
<li><a href="#ssec-other-quality">7.5. Other Quality Dimensions</a></li>
</ol></li>
<li><a href="#chap-refine">8. Refinement</a>
<ol>
<li><a href="#ssec-completion">8.1. Completion</a></li>
<li><a href="#ssec-correction">8.2. Correction</a></li>
<li><a href="#ssec-other-refinement-tasks">8.3. Other Refinement Tasks</a></li>
</ol></li>
<li><a href="#chap-publish">9. Publication</a>
<ol>
<li><a href="#ssec-principles">9.1. Best Practices</a></li>
<li><a href="#ssec-access">9.2. Access Protocols</a></li>
<li><a href="#ssec-UsageControl">9.3. Usage Control</a></li>
</ol></li>
<li><a href="#chap-kgs">10. Knowledge Graphs in Practice</a>
<ol>
<li><a href="#sec-openkgs">10.1. Open Knowledge Graphs</a></li>
<li><a href="#ssec-enterprise-kgs">10.2. Enterprise Knowledge Graphs</a></li>
</ol></li>
<li><a href="#chap-conclude">11. Summary and Conclusion</a></li>
<li id="toc-ref"><a href="#sec-references">Bibliography</a></li>
<li class="toc-app"><a href="#chap-defs">A. Background</a>
<ol>
<li><a href="#app-historical">A.1. Historical Perspective</a></li>
<li><a href="#app-pre2012">A.2. “Knowledge Graphs”: Pre 2012</a></li>
<li><a href="#app-post2012">A.3. “Knowledge Graphs”: 2012 Onwards</a></li>
</ol></li>
<li><a href="#sec-bio">Authors’ Biography</a></li>
<li id="toc-filler"> </li>
<li id="about"><a href="#about">about this book</a></li>
</ol>
</nav>
<section id="sec-preface" class="prechapter">
<h2 id="preface">Preface</h2>
<p>The origins of this book can be traced back to a Dagstuhl Seminar, held in 2018, on the topic of Knowledge Graphs. At the time of the seminar, the topic was quickly becoming mainstream in academia and industry, but there were conflicting messages as to what a “knowledge graph” was. Much of the discussion of the seminar centred on this question, and there were divergent opinions as to how knowledge graphs could (or should) be defined; how they relate to previous concepts such as graph databases, knowledge bases, ontologies, RDF graphs, property graphs, semantic networks, etc.; and how the emerging area of Knowledge Graphs should be positioned with respect to the established areas of Artificial Intelligence, Big Data, Databases, Graph Theory, Logic, Machine Learning, Knowledge Representation, Natural Language Processing, Networks (in their various forms), and the Semantic Web. As the discussion continued, a consensus began to emerge: Knowledge Graphs, as a topic, involves a novel confluence of techniques stemming from previously disparate scientific communities, with the unifying goal of developing novel graph-based techniques for better integrating and extracting value from diverse knowledge sources at large scale.</p>
<p>As a follow-up to the seminar, the attendees agreed that in order to foster this unifying view of Knowledge Graphs, there was a need for a manuscript that would serve as a general introduction to the area. This manuscript would:</p>
<ul>
<li>motivate knowledge graphs and the value of abstracting data as graphs;</li>
<li>survey the historical context of knowledge graphs and the key initiatives leading to their popularisation;</li>
<li>draw together disparate views of knowledge graphs into a unifying definition;</li>
<li>provide an introduction to the key techniques that knowledge graphs enable, relating to querying, validation, reasoning, learning, refinement, enrichment, quality assessment, and more besides;</li>
<li>describe how knowledge graphs are used in practice, surveying the companies using knowledge graphs, the applications they are used for, the open knowledge graphs that have been published, etc.;</li>
<li>delineate future research directions for knowledge graphs.</li>
</ul>
<p>The manuscript would then serve as an introductory text for students, practitioners and researchers new to the area, helping to form a consensus in terms of what is a knowledge graph, laying the foundations for future developments.</p>
<p>The goal of preparing this manuscript was an ambitious one, and involved drawing together and distilling down a vast amount of literature on a diverse range of topics into a set of key concepts described in an accessible way. For this reason, the manuscript has been prepared by many authors, who have lent their knowledge and expertise to the preparation of specific sections.</p>
<p>A key aim of this book is to be accessible to a broader audience. While background knowledge of related topics such as Databases, Logic, Machine Learning, Semantic Web, etc., will help to understand some of the particular topics mentioned, such a background is not necessary to follow the general concepts described within. The book aims to motivate and illustrate the various concepts it introduces from a practical perspective, and in order to be as accessible as possible, relies heavily on an example-driven presentation using a graphical notation. For the reader wishing to dig more into the technical minutiae, we complement this discussion with formal definitions throughout; however, the reader more interested in understanding the general concepts and their rationale will find the discussion to be self-contained if they choose to skip the definitions presented in visually distinctive boxes.</p>
<p>The book serves as an entry point for those new to the topic, and may thus serve as a useful textbook for university courses, for researchers who are venturing into the topic for the first time, and for practitioners who wish to understand more about how knowledge graphs might be of use within their company or organisation, or indeed, how to maximise the value of the knowledge graphs that they are currently developing. Readers who are already active within specific sub-areas of Knowledge Graphs may further appreciate the technical definitions included, the references to other literature provided, and the broader perspective that this book offers in terms of the other related sub-areas and how they complement each other.</p>
<p>By drawing together diverse techniques from disparate areas, Knowledge Graphs has become an exciting topic in terms of both research and applications. We expect to see growing interest on this topic as the years advance, and indeed hope that this book will help to more firmly establish the foundations of this topic, and to foster future developments upon these foundations, potentially by its readers.</p>
<p style="text-align: right; font-style: italic;">Aidan Hogan, Eva Blomqvist, Michael Cochez, Claudia d’Amato, Gerard de Melo, Claudio Gutierrez, Sabrina Kirrane, José Emilio Labra Gayo, Roberto Navigli, Sebastian Neumaier, Axel-Cyrille Ngonga Ngomo, Axel Polleres, Sabbir M. Rashid, Anisa Rula, Lukas Schmelzeisen, Juan Sequeda, Steffen Staab, Antoine Zimmermann<br/>
September 2021</p>
</section>
<section id="sec-ack" class="prechapter">
<h2 id="ack">Acknowledgements</h2>
<p>We thank the organisers and attendees of the Dagstuhl Seminar on “Knowledge Graphs”. We also thank those who provided feedback on this content.</p>
<p>Hogan was funded by Fondecyt Grant No. 1181896. Hogan & Gutierrez were funded by ANID – Millennium Science Initiative Program – Code ICN17_002. Cochez did part of the work while employed at Fraunhofer FIT, Germany and was later partially funded by Elsevier’s Discovery Lab. Kirrane, Ngonga Ngomo, Polleres & Staab received funding through the project “KnowGraphs” from the European Union’s Horizon programme under the Marie Skłodowska-Curie grant agreement No. 860801. Kirrane & Polleres were supported by the European Union’s Horizon 2020 research and innovation programme under grant 731601. Labra was supported by the Spanish Ministry of Economy and Competitiveness (Society challenges: TIN2017-88877-R). Navigli was supported by the MOUSSE ERC Grant No. 726487 under the European Union’s Horizon 2020 research and innovation programme. Rashid was supported by IBM Research AI through the AI Horizons Network. Schmelzeisen was supported by the German Research Foundation (DFG) grant STA 572/18-1.</p>
<p style="text-align: right; font-style: italic;">Aidan Hogan, Eva Blomqvist, Michael Cochez, Claudia d’Amato, Gerard de Melo, Claudio Gutierrez, Sabrina Kirrane, José Emilio Labra Gayo, Roberto Navigli, Sebastian Neumaier, Axel-Cyrille Ngonga Ngomo, Axel Polleres, Sabbir M. Rashid, Anisa Rula, Lukas Schmelzeisen, Juan Sequeda, Steffen Staab, Antoine Zimmermann<br/>
September 2021</p>
</section>
<section id="chap-intro" class="chapter">
<h2>Introduction</h2>
<p>Though the phrase “knowledge graph” has been used in the literature since at least 1972 [<a href="#ref-Schneider72">Schneider, 1973</a>], the modern incarnation of the phrase stems from the 2012 announcement of the Google Knowledge Graph [<a href="#ref-GoogleKG">Singhal, 2012</a>], followed by further announcements of knowledge graphs by Airbnb [<a href="#ref-AirBnBKG">Chang, 2018</a>], Amazon [<a href="#ref-AmazonKG">Krishnan, 2018</a>], eBay [<a href="#ref-eBayKG">Pittman et al., 2017</a>], Facebook [<a href="#ref-NoyGJNPT19">Noy et al., 2019</a>], IBM [<a href="#ref-IBMKG">Devarajan, 2017</a>], LinkedIn [<a href="#ref-LinkedInKG">He et al., 2016</a>], Microsoft [<a href="#ref-BingKG">Shrivastava, 2017</a>], Uber [<a href="#ref-UberKG">Hamad et al., 2018</a>], and more besides. The growing industrial uptake of the concept proved difficult for academia to ignore: more and more scientific literature is being published on knowledge graphs, which includes books (e.g., [<a href="#ref-PVGW2017">Pan et al., 2017</a>, <a href="#ref-QiCLWJW19">Qi et al., 2021</a>, <a href="#ref-FenselSAHKPTUW20">Fensel et al., 2020</a>, <a href="#ref-KejriwalKS2021">Kejriwal et al., 2021</a>]), as well as papers outlining definitions (e.g., [<a href="#ref-EhrlingerW16">Ehrlinger and Wöß, 2016</a>]), novel techniques (e.g., [<a href="#ref-PujaraMGC13">Pujara et al., 2013</a>, <a href="#ref-wang2014knowledge">Wang et al., 2014</a>, <a href="#ref-lin2015learning">Lin et al., 2015</a>]), and surveys of specific aspects of knowledge graphs (e.g., [<a href="#ref-Paulheim17">Paulheim, 2017</a>, <a href="#ref-Wang2017KGEmbedding">Wang et al., 2017</a>]).</p>
<p>Underlying all such developments is the core idea of using graphs to represent data, often enhanced with some way to explicitly represent knowledge [<a href="#ref-NoyGJNPT19">Noy et al., 2019</a>]. The result is most often used in application scenarios that involve integrating, managing and extracting value from diverse sources of data at large scale [<a href="#ref-NoyGJNPT19">Noy et al., 2019</a>]. Employing a graph-based abstraction of knowledge has numerous benefits in such settings when compared with, for example, a relational model or NoSQL alternatives. Graphs provide a concise and intuitive abstraction for a variety of domains, where edges capture the (potentially cyclical) relations between the entities inherent in social data, biological interactions, bibliographical citations and co-authorships, transport networks, and so forth [<a href="#ref-AnglesG08">Angles and Gutierrez, 2008</a>]. Graphs allow maintainers to postpone the definition of a schema, allowing the data – and its scope – to evolve in a more flexible manner than typically possible in a relational setting, particularly for capturing incomplete knowledge [<a href="#ref-Abiteboul97">Abiteboul, 1997</a>]. Unlike (other) NoSQL models, specialised graph query languages support not only standard relational operators (joins, unions, projections, etc.), but also navigational operators for recursively finding entities connected through arbitrary-length paths [<a href="#ref-AnglesABHRV17">Angles et al., 2017</a>]. Standard knowledge representation formalisms – such as ontologies [<a href="#ref-OWL2">Hitzler et al., 2012</a>, <a href="#ref-RDFS">Brickley and Guha, 2014</a>, <a href="#ref-obof">Mungall et al., 2012</a>] and rules [<a href="#ref-swrl">Horrocks et al., 2004</a>, <a href="#ref-rif">Kifer and Boley, 2013</a>] – can be employed to define and reason about the semantics of the terms used to label and describe the nodes and edges in the graph. Scalable frameworks for graph analytics [<a href="#ref-MalewiczABDHLC10">Malewicz et al., 2010</a>, <a href="#ref-XinGFS13">Xin et al., 2013a</a>, <a href="#ref-signalcollect">Stutz et al., 2016</a>] can be leveraged for computing centrality, clustering, summarisation, etc., in order to gain insights about the domain being described. Various representations have also been developed that support applying machine learning techniques both directly and indirectly over graphs [<a href="#ref-Wang2017KGEmbedding">Wang et al., 2017</a>, <a href="#ref-abs-1901-00596">Wu et al., 2019</a>].</p>
<p>In summary, the decision to build and use a knowledge graph opens up a range of techniques that can be brought to bear for integrating and extracting value from diverse sources of data et large scale. The goal of this book is to motivate and give a comprehensive introduction to knowledge graphs: to describe their foundational data models and how they can be queried; to discuss representations relating to schema, identity, and context; to discuss deductive and inductive ways to make knowledge explicit; to present a variety of techniques that can be used for the creation and enrichment of graph-structured data; to describe how the quality of knowledge graphs can be discerned and how they can be refined; to discuss standards and best practices by which knowledge graphs can be published; and to provide an overview of existing knowledge graphs found in practice. Our intended audience includes researchers and practitioners who are new to knowledge graphs. As such, we do not assume that readers have specific expertise on knowledge graphs.</p>
<p><em class="paragraph">Knowledge graph</em>. The definition of a “<em>knowledge graph</em>” remains contentious [<a href="#ref-EhrlingerW16">Ehrlinger and Wöß, 2016</a>, <a href="#ref-BonattiDPP18">Bonatti et al., 2018</a>, <a href="#ref-Bergman19">Bergman, 2019</a>], where a number of (sometimes conflicting) definitions have emerged, varying from specific technical proposals to more inclusive general proposals; we address these prior definitions in Appendix <a href="#chap-defs">A</a>. Herein we adopt an inclusive definition, where we view a knowledge graph as <em>a graph of data intended to accumulate and convey knowledge of the real world, whose nodes represent entities of interest and whose edges represent relations between these entities</em>. The graph of data (aka <em>data graph</em>) conforms to a graph-based data model, which may be a <em>directed edge-labelled graph</em>, a <em>property graph</em>, etc. (we discuss concrete alternatives in Chapter <a href="#chap-graph">2</a>). By <em>knowledge</em>, we refer to something that is <em>known</em>. Such knowledge may be accumulated from external sources, or extracted from the knowledge graph itself. Knowledge may be composed of simple statements, such as “<em>Santiago is the capital of Chile</em>”, or quantified statements, such as “<em>all capitals are cities</em>”. Simple statements can be accumulated as edges in the data graph. If the knowledge graph intends to accumulate quantified statements, a more expressive way to represent knowledge – such as <em>ontologies</em> or <em>rules</em> – is required. <em>Deductive methods</em> can then be used to entail and accumulate further knowledge (e.g., “<em>Santiago is a city</em>”). Additional knowledge – based on simple or quantified statements – can also be extracted from and accumulated by the knowledge graph using <em>inductive methods</em>.</p>
<p>Knowledge graphs are often assembled from numerous sources, and as a result, can be highly diverse in terms of structure and granularity. To address this diversity, representations of <em>schema</em>, <em>identity</em>, and <em>context</em> often play a key role, where a <em>schema</em> defines a high-level structure for the knowledge graph, <em>identity</em> denotes which nodes in the graph (or in external sources) refer to the same real-world entity, while <em>context</em> may indicate a specific setting in which some unit of knowledge is held true. As aforementioned, effective methods for <em>extraction</em>, <em>enrichment</em>, <em>quality assessment</em>, and <em>refinement</em> are required for a knowledge graph to grow and improve over time.</p>
<p><em class="paragraph">In practice</em>. Knowledge graphs aim to serve as an ever-evolving shared substrate of knowledge within an organisation or community [<a href="#ref-NoyGJNPT19">Noy et al., 2019</a>]. We distinguish two types of knowledge graphs in practice: <em>open knowledge graphs</em> and <em>enterprise knowledge graphs</em>. Open knowledge graphs are published online, making their content accessible for the public good. The most prominent examples – DBpedia [<a href="#ref-LehmannIJJKMHMK15">Lehmann et al., 2015</a>], Freebase [<a href="#ref-bollacker2007freebase">Bollacker et al., 2007b</a>], Wikidata [<a href="#ref-VrandecicK14">Vrandečić and Krötzsch, 2014</a>], YAGO [<a href="#ref-YAGO">Hoffart et al., 2011</a>], etc. – cover many domains and are either extracted from Wikipedia [<a href="#ref-LehmannIJJKMHMK15">Lehmann et al., 2015</a>, <a href="#ref-YAGO">Hoffart et al., 2011</a>], or built by communities of volunteers [<a href="#ref-bollacker2007freebase">Bollacker et al., 2007b</a>, <a href="#ref-VrandecicK14">Vrandečić and Krötzsch, 2014</a>]. Open knowledge graphs have also been published within specific domains, such as media [<a href="#ref-RaimondFSA14">Raimond et al., 2014</a>], government [<a href="#ref-HendlerHMT12">Hendler et al., 2012</a>, <a href="#ref-ShadboltO13">Shadbolt and O'Hara, 2013</a>], geography [<a href="#ref-StadlerLHA12">Stadler et al., 2012</a>], tourism [<a href="#ref-LuLS16">Lu et al., 2016</a>, <a href="#ref-abs-1805-05744">Kärle et al., 2018</a>, <a href="#ref-MaturanaALMH18">Maturana et al., 2018</a>, <a href="#ref-ZhangCHYAL19">Zhang et al., 2019</a>], life sciences [<a href="#ref-CallahanCAD13">Callahan et al., 2013</a>], and more besides. Enterprise knowledge graphs are typically internal to a company and applied for commercial use-cases [<a href="#ref-NoyGJNPT19">Noy et al., 2019</a>]. Prominent industries using enterprise knowledge graphs include Web search (e.g., Bing [<a href="#ref-BingKG">Shrivastava, 2017</a>], Google [<a href="#ref-GoogleKG">Singhal, 2012</a>]), commerce (e.g., Airbnb [<a href="#ref-AirBnBKG">Chang, 2018</a>], Amazon [<a href="#ref-AmazonKG">Krishnan, 2018</a>, <a href="#ref-dong2019building">Dong, 2019</a>], eBay [<a href="#ref-eBayKG">Pittman et al., 2017</a>], Uber [<a href="#ref-UberKG">Hamad et al., 2018</a>]), social networks (e.g., Facebook [<a href="#ref-NoyGJNPT19">Noy et al., 2019</a>], LinkedIn [<a href="#ref-LinkedInKG">He et al., 2016</a>]), finance (e.g., Accenture [<a href="#ref-AccentureKG">Okorafor and Ray, 2019</a>], Banca d’Italia [<a href="#ref-BellomariniFGS19">Bellomarini et al., 2019</a>], Bloomberg [<a href="#ref-BloombergKG">Meij, 2019</a>], Capital One [<a href="#ref-CapitalOneKG">Branum and Sehon, 2019</a>], Wells Fargo [<a href="#ref-WellsFargoKG">Newman, 2019</a>]), among others. Applications include search [<a href="#ref-BingKG">Shrivastava, 2017</a>, <a href="#ref-GoogleKG">Singhal, 2012</a>], recommendations [<a href="#ref-AirBnBKG">Chang, 2018</a>, <a href="#ref-UberKG">Hamad et al., 2018</a>, <a href="#ref-LinkedInKG">He et al., 2016</a>, <a href="#ref-NoyGJNPT19">Noy et al., 2019</a>], personal agents [<a href="#ref-eBayKG">Pittman et al., 2017</a>], advertising [<a href="#ref-LinkedInKG">He et al., 2016</a>], business analytics [<a href="#ref-LinkedInKG">He et al., 2016</a>], risk assessment [<a href="#ref-ThompsonReutersKG">Tobin, 2017</a>, <a href="#ref-MaanaKG">Dalgliesh, 2016</a>], automation [<a href="#ref-HensonSTK19">Henson et al., 2019</a>], and more besides. We will provide more details on the use of knowledge graphs in practice in Chapter <a href="#chap-kgs">10</a>.</p>
<p><em class="paragraph">Running example</em>. To keep the discussion accessible, throughout the book, we present concrete examples in the context of a hypothetical knowledge graph relating to tourism in Chile (loosely inspired by related use-cases [<a href="#ref-abs-1805-05744">Kärle et al., 2018</a>, <a href="#ref-LuLS16">Lu et al., 2016</a>]). The knowledge graph is managed by a tourism board that aims to increase tourism in the country and promote new attractions in strategic areas. The knowledge graph itself will eventually describe tourist attractions, cultural events, services, businesses, travel routes, etc. Some applications the organisation envisages are to:</p>
<ul>
<li>create a tourism portal that allows visitors to search for attractions, upcoming events, and other related services (in multiple languages);</li>
<li>gain insights into tourism demographics in terms of season, nationalities, etc.;</li>
<li>analyse sentiment about tourist attractions, including positive reviews, summaries of complaints about events and services, crime reports, etc.;</li>
<li>understand tourism trajectories: the sequence of attractions, events, etc., that tourists often visit;</li>
<li>cross-reference these tourism trajectories with currently available flights, buses, etc., to suggest new strategic routes for public transport;</li>
<li>offer personalised recommendations of places to visit;</li>
<li>and so forth.</li>
</ul>
<p><em class="paragraph">Outline</em>. The remainder of the book is structured as follows:</p>
<dl id="outline">
<dt>Chapter <a href="#chap-graph">2</a></dt>
<dd>outlines graph data models and the languages used to query them.</dd>
<dt>Chapter <a href="#chap-knowledge">3</a></dt>
<dd>describes representations of schema, identity, and context for graphs.</dd>
<dt>Chapter <a href="#chap-deductive">4</a></dt>
<dd>presents deductive formalisms for representing and entailing knowledge.</dd>
<dt>Chapter <a href="#chap-inductive">5</a></dt>
<dd>describes inductive techniques for learning from graphs.</dd>
<dt>Chapter <a href="#chap-create">6</a></dt>
<dd>discusses the creation and enrichment of knowledge graphs.</dd>
<dt>Chapter <a href="#chap-quality">7</a></dt>
<dd>enumerates dimensions for assessing knowledge graph quality.</dd>
<dt>Chapter <a href="#chap-refine">8</a></dt>
<dd>discusses various techniques for knowledge graph refinement.</dd>
<dt>Chapter <a href="#chap-publish">9</a></dt>
<dd>introduces principles and protocols for publishing knowledge graphs.</dd>
<dt>Chapter <a href="#chap-kgs">10</a></dt>
<dd>surveys some prominent knowledge graphs and their applications.</dd>
<dt>Chapter <a href="#chap-conclude">11</a></dt>
<dd>concludes with future directions for knowledge graphs.</dd>
<dt>Appendix <a href="#chap-defs">A</a></dt>
<dd>outlines the historical background for knowledge graphs.</dd>
</dl>
</section>
<section id="chap-graph" class="chapter">
<h2>Data Graphs</h2>
<p>At the foundation of any knowledge graph is the principle of first applying a graph abstraction to data, resulting in an initial data graph. We now discuss a selection of graph-structured data models that are commonly used in practice to represent data graphs. We then discuss the primitives that form the basis of graph query languages used to interrogate such data graphs.</p>
<section id="ssec-graphModels" class="section">
<h3>Models</h3>
<p>Leaving aside graphs, let us assume that the tourism board from our running example has not yet decided how to model relevant data about attractions, events, services, etc. The board first considers using a tabular structure – in particular, relational databases – to represent the required data, and though they do not know precisely what data they will need to capture, they begin to design an initial relational schema. They begin with an <span class="sf">Event</span> table with five columns:</p>
<p class="mathblock"><span class="sf">Event</span>(<span class="sf underline">name</span>, <span class="sf">venue</span>, <span class="sf">type</span>, <span class="sf underline">start</span>, <span class="sf">end</span>)</p>
<p>where <span class="sf underline">name</span> and <span class="sf underline">start</span> together form the primary key of the table in order to uniquely identify recurring events. But as they start to populate the data, they encounter various issues: events may have multiple names (e.g., in different languages), events may have multiple venues, they may not yet know the start and end date-times for future events, events may have multiple types, and so forth. Incrementally addressing these modelling issues as the data become more diverse, they generate internal identifiers for events and adapt their relational schema until they have:</p>
<p class="mathblock" id="al-schema"><span class="sf">EventName</span>(<span class="sf underline">id</span>,<span class="sf underline">name</span>), <span class="sf">EventStart</span>(<span class="sf underline">id</span>,<span class="sf">start</span>), <span class="sf">EventEnd</span>(<span class="sf underline">id</span>,<span class="sf">end</span>), <span class="sf">EventVenue</span>(<span class="sf underline">id</span>,<span class="sf underline">venue</span>), <span class="sf">EventType</span>(<span class="sf underline">id</span>,<span class="sf underline">type</span>)<span style="float: right;">(1)</span></p>
<p>With the above schema, the organisation can now model events with \(0{-}n\) names, venues, and types, and \(0{-}1\) start dates and end dates (without needing relational nulls).</p>
<p>Along the way, the board has to incrementally change the schema several times in order to support new sources of data. Each such change requires a costly remodelling, reloading, and reindexing of data; here we only considered one table. The tourism board struggles with the relational model because they do not know, <em>a priori</em>, what data will need to be modelled or what sources they will use. But once they reach the latter relational schema, the board finds that they can integrate further sources without more changes: with minimal assumptions on <em>multiplicities</em> (\(1{-}1\), \(1{-}n\), etc.) this schema offers a lot of flexibility for integrating incomplete and diverse data.</p>
<p>In fact, the refined, flexible schema that the board ends up with – as shown in (<a href="#al-schema">2.1</a>) – is modelling a set of binary relations between entities, which indeed can be viewed as modelling a graph. By instead adopting a graph data model from the outset, the board could forgo the need for an upfront schema, and could define any (binary) relation between any pair of entities at any time.</p>
<p>We now introduce graph data models popular in practice [<a href="#ref-AnglesABHRV17">Angles et al., 2017</a>].</p>
<h4 id="sssec-directedelg" class="subsection">Directed edge-labelled graphs</h4>
<p>A directed edge-labelled graph (sometimes known as a <em>multi-relational graph</em> [<a href="#ref-nickel2013tensor">Nickel and Tresp, 2013</a>, <a href="#ref-bordes2013translating">Bordes et al., 2013</a>, <a href="#ref-BalazevicAH19">Balazevic et al., 2019a</a>]) is defined as a set of nodes – like <span class="gnode">Santiago</span>, <span class="gnode">Arica</span>, <span class="gnode">EID16</span>, <span class="gnode">2018-03-22 12:00</span> – and a set of directed labelled edges between those nodes, like <span class="gnode">Santa Lucía</span><img class="tip" src="images/edge-source.png" width="8" alt="arrow source"/><span class="edge">city</span><img class="tip" src="images/edge-tip.png" width="15" alt="arrow tip rightward"/><span class="gnode">Santiago</span>. In the case of knowledge graphs, nodes are used to represent entities and edges are used to represent (binary) relations between those entities. Figure <a href="#fig-delg">2.1</a> provides an example of how the tourism board could model some relevant event data as a directed edge-labelled graph. The graph includes data about the names, types, start and end date-times, and venues for events.<sup class="fnmark" id="fnm1"><a href="#fn1">1</a></sup><span class="footnote" id="fn1"><sup><a href="#fnm1">note 1</a></sup> We represent bidirectional edges as <span class="gnode">Viña del Mar</span><img class="tip" src="images/edge-revtip.png" width="15" alt="arrow tip leftward"/><span class="edge">bus</span><img class="tip" src="images/edge-tip.png" width="15" alt="arrow tip rightward"/><span class="gnode">Arica</span>, which more concisely depicts two directed edges: <span class="gnode">Viña del Mar</span><img class="tip" src="images/edge-source.png" width="8" alt="arrow source"/><span class="edge">bus</span><img class="tip" src="images/edge-tip.png" width="15" alt="arrow tip rightward"/><span class="gnode">Arica</span> and <span class="gnode">Viña del Mar</span><img class="tip" src="images/edge-revtip.png" width="15" alt="arrow tip leftward"/><span class="edge">bus</span><img class="tip" src="images/edge-source.png" width="8" alt="arrow source"/><span class="gnode">Arica</span>. Also while some naming conventions recommend more complete edge labels that include a verb, such as <span class="gelab">has venue</span> or <span class="gelab">is valid from</span>, in this book, for presentation purposes, we will omit the “<code>has</code>” and “<code>is</code>” verbs from such labels, using simply <span class="gelab">venue</span> or <span class="gelab">valid from</span>.</span> Adding information to such a graph typically involves adding new nodes and edges (with some exceptions discussed later). Representing incomplete information requires simply omitting a particular edge; for example, the graph does not yet define a start/end date-time for the Food Truck festival.</p>
<figure id="fig-delg">
<img src="images/fig-delg.svg" alt="Directed edge-labelled graph describing events and their venues" />
<figcaption>Directed edge-labelled graph describing events and their venues</figcaption>
</figure>
<p>Modelling data as a graph in this way offers more flexibility for integrating new sources of data, compared to the standard relational model, where a schema must be defined upfront and followed at each step. While other structured data models such as trees (XML, JSON, etc.) would offer similar flexibility, graphs do not require organising the data hierarchically (should <code>venue</code> be a parent, child, or sibling of <code>type</code> for example?). They also allow cycles to be represented and queried (e.g., note the directed cycle in the routes between Santiago, Arica, and Viña del Mar).</p>
<p>A standardised data model based on directed edge-labelled graphs is the Resource Description Framework (RDF) [<a href="#ref-rdf11">Cyganiak et al., 2014</a>], which has been recommended by the W3C. The RDF model defines different types of nodes, including <em>Internationalized Resource Identifiers</em> (IRIs) [<a href="#ref-rfc3987">Dürst and Suignard, 2005</a>] which allow for global identification of entities on the Web; <em>literals</em>, which allow for representing strings (with or without language tags) and other datatype values (integers, dates, etc.); and <em>blank nodes</em>, which are anonymous nodes that are not assigned an identifier (for example, rather than create internal identifiers like <code>EID15</code>, <code>EID16</code>, in RDF, we have the option to use blank nodes). We will discuss these different types of nodes further in Section <a href="#sec-identity">3.2</a> when we speak about issues relating to identity.</p>
<div class="formal">
<p>We now formally define a directed edge-labelled graph, where we denote by \(\con\) a countably infinite set of constants.</p>
<dl class="definition" id="def-delg">
<dt>Directed edge-labelled graph</dt>
<dd>A <em>directed edge-labelled graph</em> is a tuple \(G = (V,E,L)\), where \(V \subseteq \con\) is a set of nodes, \(L \subseteq \con\) is a set of edge labels, and \(E \subseteq V \times L \times V\) is a set of edges.</dd>
</dl>
<div class="example">
<p>In reference to Figure <a href="#fig-delg">2.1</a>, the set of nodes \(V\) has 15 elements, including <code>Arica</code>, <code>EID16</code>, etc. The set of edges \(E\) has 23 triples, including (<code>Arica</code>, <code>flight</code>, <code>Santiago</code>). Bidirectional edges are represented with two edges. The set of edge labels \(L\) has 8 elements, including <code>start</code>, <code>flight</code>, etc.</p>
</div>
<p>Definition <a href="#def-delg">2.1</a> does not state that \(V\) and \(L\) are disjoint: though not present in the example, a node can also serve as an edge-label. The definition also permits that nodes and edge labels can be present without any associated edge. Either restriction could be explicitly stated – if necessary – in a particular application while still conforming to a directed edge-labelled graph.</p>
<p>For ease of presentation presentation, we may treat a set of (directed labelled) edges \(E \subseteq V \times L \times V\) as a directed edge-labelled graph \((V,E,L)\), in which case we refer to the graph induced by \(E\) assuming that \(V\) and \(L\) contain all and only those nodes and edge labels, respectively, used in \(E\). We may similarly apply set operators on directed edge-labelled graphs, which should be interpreted as applying to their sets of edges; for example, given \(G_1 = (V_1,E_1,L_1)\) and \(G_2 = (V_2,E_2,L_2)\), by \(G_1 \cup G_2\) we refer to the directed edge-labelled graph induced by \(E_1 \cup E_2\).</p>
</div>
<h4 id="subsub-heterograph" class="subsection">Heterogeneous graphs</h4>
<p>A heterogeneous graph [<a href="#ref-HusseinYC18">Hussein et al., 2018</a>, <a href="#ref-WangJSWYCY19">Wang et al., 2019</a>, <a href="#ref-YangXJWHW20">Yang et al., 2020</a>] (or <em>heterogeneous information network</em> [<a href="#ref-sun2011pathsim">Sun et al., 2011</a>, <a href="#ref-2012Sun">Sun and Han, 2012</a>]) is a directed graph where each node and edge is assigned one type. Heterogeneous graphs are thus akin to directed edge-labelled graphs – with edge labels corresponding to edge types – but where the type of node forms part of the graph model itself, rather than being expressed with a relation (as seen in Figure <a href="#fig-capital">2.2</a>). An edge is called <em>homogeneous</em> if it is between two nodes of the same type (e.g., <span class="gelab">borders</span> in Figure <a href="#fig-capital">2.2</a>); otherwise it is called <em>heterogeneous</em> (e.g., <span class="gelab">capital</span> in Figure <a href="#fig-capital">2.2</a>). Heterogeneous graphs allow for partitioning nodes according to their type, for example, for the purposes of machine learning tasks [<a href="#ref-HusseinYC18">Hussein et al., 2018</a>, <a href="#ref-WangJSWYCY19">Wang et al., 2019</a>, <a href="#ref-YangXJWHW20">Yang et al., 2020</a>]. Conversely, such graphs typically only support a one-to-one relation between nodes and types, which is not the case for directed edge-labelled graphs (see, for example, the node <span class="gnode">Santiago</span> with zero types and <span class="gnode">EID15</span> with multiple types in Figure <a href="#fig-delg">2.1</a>.</p>
<figure id="fig-capital">
<figure id="fig-cap">
<img src="images/fig-cap.svg" alt="Del graph"/>
<figcaption>Directed edge-labelled graph</figcaption>
</figure>
<figure id="fig-hg">
<img src="images/fig-hg.svg" alt="Heterogenous graph"/>
<figcaption>Heterogenous graph</figcaption>
</figure>
<figcaption>Comparing directed edge-labelled graphs and heterogeneous graphs</figcaption>
</figure>
<div class="formal">
<p>We next define the notion of a heterogeneous graph.</p>
<dl class="definition" id="def-hg">
<dt>Heterogeneous graph</dt>
<dd>A <em>heterogeneous graph</em> is a tuple \(G = (V,E,L,l)\), where \(V \subseteq \con\) is a set of nodes, \(L \subseteq \con\) is a set of edge/node labels, \(E \subseteq V \times L \times V\) is a set of edges, and \(l : V \rightarrow L\) maps each node to a label.</dd>
</dl>
<div class="example">
<p>In reference to Figure <a href="#fig-hg">2.2b</a>, the set of nodes \(V\) has three elements: <code>Santiago</code>, <code>Chile</code>, and <code>Perú</code>. The set of edges \(E\) has 3 triples, including (<code>Santiago</code>, <code>capital</code>, <code>Chile</code>). The set of edge labels \(L\) has 4 elements: <code>capital</code>, <code>borders</code>, <code>City</code>, <code>Country</code>. Finally, with respect to the node labels, \(l(\)<code>Santiago</code>\() =\) <code>City</code>, \(l(\)<code>Chile</code>\() =\) <code>Country</code>, and \(l(\)<code>Perú</code>\() =\) <code>Country</code>.</p>
</div>
<p>In heterogeneous graphs, edge and node labels are often called <em>types</em>. By rather defining edges with labels as per directed edge-labelled graphs – rather than separately labelling edges with \(l\) – two nodes can be related by \(n\) edges with \(n\) different labels; for example, we can represent both \((\)<code>Santiago</code>, <code>capital</code>, <code>Chile</code>\()\) and \((\)<code>Santiago</code>, <code>country</code>, <code>Chile</code>\()\) as edges in the heterogeneous graph.</p>
</div>
<h4 id="sssec-propgraph" class="subsection">Property graphs</h4>
<p>Property graphs constitute an alternative graph model that offers additional flexibility when modelling more complex relations. Consider integrating incoming data that provide further details on which companies offer fares on which flights, allowing the board to better understand available routes between cities (for example, on national airlines). In the case of directed edge-labelled graphs, we cannot directly annotate an edge like <span class="gnode">Santiago</span><img class="tip" src="images/edge-source.png" width="8" alt="arrow source"/><span class="edge">flight</span><img class="tip" src="images/edge-tip.png" width="15" alt="arrow tip rightward"/><span class="gnode">Arica</span> with the company (or companies) offering that route. But we could add a new node denoting a flight, connect it with the source, destination, companies, and mode, as shown in Figure <a href="#fig-fsa">2.3a</a>. Applying this modelling to all routes in Figure <a href="#fig-delg">2.1</a> would, however, involve significant changes.</p>
<p>The property graph model was thus proposed to offer additional flexibility when modelling data as a graph [<a href="#ref-Miller13">Miller, 2013</a>, <a href="#ref-AnglesABHRV17">Angles et al., 2017</a>]. A property graph allows a set of <em>property–value</em> pairs and a <em>label</em> to be associated with both nodes and edges. Figure <a href="#fig-pg">2.3b</a> depicts an example of a property graph with data analogous to Figure <a href="#fig-fsa">2.3a</a>. We use property–value pairs on edges to model the companies. The type of relation is captured by the label <code>flight</code>}. We further use node labels to indicate the types of the two nodes, and property–value pairs for their latitude and longitude.</p>
<figure id="fig-flghts">
<figure id="fig-fsa">
<img src="images/fig-fsa.svg" alt="Directed edge-labelled graph"/>
<figcaption>Directed edge-labelled graph</figcaption>
</figure>
<figure id="fig-pg">
<img src="images/fig-pg.svg" alt="Property graph"/>
<figcaption>Property graph</figcaption>
</figure>
<figcaption>Comparing directed edge-labelled graphs and property graphs</figcaption>
</figure>
<p>Property graphs are prominently used in graph databases, such as Neo4j [<a href="#ref-Miller13">Miller, 2013</a>, <a href="#ref-AnglesABHRV17">Angles et al., 2017</a>]. Property graphs can be converted to/from directed edge-labelled graphs [<a href="#ref-HernandezHK15">Hernández et al., 2015</a>, <a href="#ref-AnglesTT19">Angles et al., 2019</a>] (per, e.g., Figure <a href="#fig-pg">2.3b</a>). In summary, directed edge-labelled graphs offer a more minimal model, while property graphs offer a more flexible one. Often the choice of model will be secondary to other practical factors, such as the implementations available for different models, etc.</p>
<div class="formal">
<p>We formally define a property graph.</p>
<dl class="definition" id="def-pg">
<dt>Property graph</dt>
<dd>A <em>property graph</em> is a tuple \(G = (V,E,L,P,U,e,l,p)\), where \(V \subseteq \con\) is a set of node ids, \(E \subseteq \con\) is a set of edge ids, \(L \subseteq \con\) is a set of labels, \(P \subseteq \con\) is a set of properties, \(U \subseteq \con\) is a set of values, \(e : E \rightarrow V \times V\) maps an edge id to a pair of node ids, \(l : V \cup E \rightarrow 2^L\) maps a node or edge id to a set of labels, and \(p : V \cup E \rightarrow 2^{P \times U}\) maps a node or edge id to a set of property–value pairs.</dd>
</dl>
<div class="example">
<p>Returning to Figure <a href="#fig-pg">2.3b</a>:</p>
<ul>
<li>the set \(V\) contains <code>Santiago</code> and <code>Arica</code>;</li>
<li>the set \(E\) contains <code>LA380</code> and <code>LA381</code>;</li>
<li>the set \(L\) contains <code>Capital City</code>, <code>Port City</code>, and <code>flight</code>;</li>
<li>the set \(P\) contains <code>lat</code>, <code>long</code>, and <code>company</code>;</li>
<li>the set \(U\) contains <code>–33.45</code>, <code>–70.66</code>, <code>LATAM</code>, <code>–18.48</code>, and <code>–70.33</code>;</li>
<li>the mapping \(e\) gives, for example, \(e(\)<code>LA380</code>\() = (\)<code>Santiago</code>, <code>Arica</code>\()\);</li>
<li>the mapping \(l\) gives, for example, \(l(\)<code>Santiago</code>\() =\{ \)<code>Capital City</code>\(\}\) and \(l(\)<code>LA380</code>\() =\{ \)<code>flight</code>\(\}\);</li>
<li>the mapping \(p\) gives, for example, \(p(\)<code>LA380</code>\() =\{ (\)<code>company</code>, <code>LATAM</code>\() \}\) and \(p(\)<code>Santiago</code>\() =\{ (\)<code>lat</code>, <code>–33.45</code>\(), (\)<code>long</code>, <code>–70.66</code>\() \}\).</li>
</ul>
</div>
<p>Unlike previous definitions [<a href="#ref-AnglesABHRV17">Angles et al., 2017</a>], we allow a node or edge to have several values for a given property. In practice, systems like Neo4j [<a href="#ref-Miller13">Miller, 2013</a>] may rather support this by allowing a single array (i.e., list) of values.</p>
</div>
<h4 id="subsub-graphdataset" class="subsection">Graph dataset</h4>
<p>Although multiple directed edge-labelled graphs can be merged by taking their union, it is often desirable to manage several graphs rather than one monolithic graph; for example, it may be beneficial to manage multiple graphs from different sources, making it possible to update or refine data from one source, to distinguish untrustworthy sources from more trustworthy ones, and so forth. A graph dataset then consists of a set of <em>named graphs</em> and a <em>default graph</em>. Each named graph is a pair of a graph ID and a graph. The default graph is a graph without an ID, and is referenced “by default” if a graph ID is not specified. Figure <a href="#fig-gd">2.4</a> provides an example where events and routes are stored in two named graphs, and the default graph manages metadata about the named graphs. Graph names can also be used as nodes in a graph. Furthermore, nodes and edges can be repeated across graphs, where the same node in different graphs will typically refer to the same entity, allowing data on that entity to be integrated when merging graphs. Though the example depicts a dataset of directed edge-labelled graphs, the concept generalises straightforwardly to datasets of other types of graphs.</p>
<figure id="fig-gd">
<img src="images/fig-gd.svg" alt="Graph dataset with two named graphs and a default graph describing events and routes"/>
<figcaption>Graph dataset based on directed edge-labelled graphs with two named graphs and a default graph describing events and routes</figcaption>
</figure>
<p>An RDF dataset is a graph dataset model standardised by the W3C [<a href="#ref-rdf11">Cyganiak et al., 2014</a>] where each graph is an RDF graph, and graph names can be blank nodes or IRIs. A prominent use-case for RDF datasets is to manage and query <em>Linked Data</em> composed of interlinked documents of RDF graphs spanning the Web. When dealing with Web data, tracking the source of data becomes of key importance [<a href="#ref-Dividino09">Dividino et al., 2009</a>, <a href="#ref-BonattiHPS11">Bonatti et al., 2011</a>, <a href="#ref-zimm-etal-2012-JWS">Zimmermann et al., 2012</a>]. We will discuss Linked Data later in Section <a href="#sec-identity">3.2</a> and further discuss provenance in Section <a href="#ssec-knowledgeContext">3.3</a>.</p>
<div class="formal">
<p>We more formally define a graph dataset. We assume that all data graphs featured in a given graph dataset follow the same model (directed edge-labelled graph, heterogeneous graph, property graph, etc).</p>
<dl class="definition" id="def-gd">
<dt>Graph dataset</dt>
<dd>A <em>named graph</em> is a pair \((n,G)\) where \(G\) is a data graph, and \(n \in \con\) is a graph name. A <em>graph dataset</em> is a pair \(D = (G_D,N)\) where \(G_D\) is a data graph called the <em>default graph</em> and \(N\) is either the empty set, or a set of named graphs \(\{ (n_1,G_1), \ldots (n_k,G_k) \}\) (\(k > 0\)) such that if \(i \neq j\) then \(n_i \neq n_j\) (for all \(1 \leq i \leq k\), \(1 \leq j \leq k\)).</dd>
</dl>
<div class="example">
<p>Figure <a href="#fig-gd">2.4</a> provides an example of a directed edge-labelled graph dataset \(D\) consisting of two named graphs and a default graph. The default graph does not have a name associated with it. The two graph names are <code>Events</code> and <code>Routes</code>; these are also used as nodes in the default graph.</p>
</div>
</div>
<h4 id="sssec-othergraphs" class="subsection">Other graph data models</h4>
<p>The previous models are popular examples of graph representations. Other graph data models exist with <em>complex nodes</em> that may contain individual edges [<a href="#ref-AnglesG08">Angles and Gutierrez, 2008</a>, <a href="#ref-Hartig14">Hartig and Thompson, 2014</a>] or nested graphs [<a href="#ref-AnglesG08">Angles and Gutierrez, 2008</a>, <a href="#ref-n3">Berners-Lee and Connolly, 2011</a>] (sometimes called <em>hypernodes</em> [<a href="#ref-LeveneP89">Levene and Poulovassilis, 1989</a>]. Likewise the mathematical notion of a <em>hypergraph</em> defines <em>complex edges</em> that connect sets rather than pairs of nodes. In our view, a knowledge graph can adopt any such graph data model based on nodes and edges: often data can be converted from one model to another (see Figure <a href="#fig-fsa">2.3a</a> vs. Figure <a href="#fig-pg">2.3b</a>). In the rest of the paper, we prefer discussing directed edge-labelled graphs given their relative succinctness, but most discussion extends naturally to other models.</p>
<h4 id="sssec-graphstore" class="subsection">Graph stores</h4>
<p>A variety of techniques have been proposed for storing and indexing graphs, facilitating the efficient evaluation of queries (as discussed next). Directed edge-labelled graphs can be stored in relational databases either as a single relation of arity three (<em>triple table</em>), as a binary relation for each property (<em>vertical partitioning</em>), or as \(n\)-ary relations for entities of a given type (<em>property tables</em>) [<a href="#ref-WylotHCS18">Wylot et al., 2018</a>]. Custom (so-called <em>native</em>) storage techniques have also been developed for a variety of graph models, providing efficient access for finding nodes, edges and their adjacent elements [<a href="#ref-AnglesG08">Angles and Gutierrez, 2008</a>, <a href="#ref-Miller13">Miller, 2013</a>, <a href="#ref-WylotHCS18">Wylot et al., 2018</a>]. A number of systems further allow for distributing graphs over multiple machines based on popular NoSQL stores or custom partitioning schemes [<a href="#ref-WylotHCS18">Wylot et al., 2018</a>, <a href="#ref-JankeS18">Janke and Staab, 2018</a>]. For further details we refer to the book chapter by <a href="#ref-JankeS18">Janke and Staab [2018]</a> and the survey by <a href="#ref-WylotHCS18">Wylot et al. [2018]</a> dedicated to this topic.</p>
</section>
<section id="ssec-querying" class="section">
<h3>Querying</h3>
<p>A number of languages have been proposed for querying graphs [<a href="#ref-AnglesABHRV17">Angles et al., 2017</a>], including the SPARQL query language for RDF graphs [<a href="#ref-sparql11">Harris et al., 2013</a>]; and Cypher [<a href="#ref-FrancisGGLLMPRS18">Francis et al., 2018</a>], Gremlin [<a href="#ref-Rodriguez15">Rodriguez, 2015</a>], and G-CORE [<a href="#ref-AnglesABBFGLPPS18">Angles et al., 2018</a>] for querying property graphs. We refer to <a href="#ref-seifer19">Seifer et al. [2019]</a> for an investigation of the popularity of these languages. Underlying these query languages are some common primitives, including (basic) graph patterns, relational operators, path expressions, and more besides [<a href="#ref-AnglesABHRV17">Angles et al., 2017</a>]. We now describe these core features for querying graphs in turn, starting with basic graph patterns.</p>
<h4 id="sssec-graphpatterns" class="subsection">Basic graph patterns</h4>
<p>At the core of every structured query language for graphs lie <em>basic graph patterns</em> [<a href="#ref-ConsensM90">Consens and Mendelzon, 1990</a>, <a href="#ref-AnglesABHRV17">Angles et al., 2017</a>], which follow the same model as the data graph being queried (see Section <a href="#ssec-graphModels">2.1</a>), additionally allowing variables as terms.<sup class="fnmark" id="fnm2"><a href="#fn2">2</a></sup><span class="footnote" id="fn2"><sup><a href="#fnm2">note 2</a></sup> The terms of a directed edge-labelled graph are its nodes and edge-labels. The terms of a property graph are its ids, labels, properties, and values (as used on either edges or nodes).</span> Terms in basic graph patterns are thus divided into constants, such as <span class="gnode">Arica</span> or <span class="gelab">venue</span>, and variables, which we prefix with question marks, such as <span class="gvar">?event</span> or <span class="gelab" style="color: black">?rel</span>. A basic graph pattern is then evaluated against the data graph by generating mappings from the variables of the graph pattern to constants in the data graph such that the image of the graph pattern under the mapping (replacing variables with the assigned constants) is contained within the data graph.</p>
<p>Figure <a href="#fig-gp">2.5</a> provide an example of a basic graph pattern looking for the venues of Food Festivals, along with the possible mappings generated by the graph pattern against the data graph of Figure <a href="#fig-delg">2.1</a>. In some of the presented mappings (the last two listed), multiple variables are mapped to the same term, which may or may not be desirable depending on the application. Hence a number of semantics have been proposed for evaluating basic graph patterns [<a href="#ref-AnglesABHRV17">Angles et al., 2017</a>], amongst which the most important are: <em>homomorphism-based semantics</em>, which allows multiple variables to be mapped to the same term such that all mappings shown in Figure <a href="#fig-gp">2.5</a> would be considered results; and <em>isomorphism-based semantics</em>, which requires variables on nodes and/or edges to be mapped to unique terms, thus excluding the latter three mappings of Figure <a href="#fig-gp">2.5</a> from the results. Different languages may adopt different semantics for evaluating basic graph patterns; for example, SPARQL adopts a homomorphism-based semantics, while Cypher adopts an isomorphism-based semantics specifically on edges (while allowing multiple variables to map to one node).</p>
<figure id="fig-gp">
<img src="images/fig-gp.svg" alt="Graph pattern" class="multi" />
<div style="height:.5em;"> </div>
<table class="condensedTable">
<thead>
<tr>
<th><span class="sf">?ev</span></th>
<th><span class="sf">?vn1</span></th>
<th><span class="sf">?vn2</span></th>
</tr>
</thead>
<tbody>
<tr>
<td><code>EID16</code></td>
<td><code>Piscina Olímpica</code></td>
<td><code>Sotomayor</code></td>
</tr>
<tr>
<td><code>EID16</code></td>
<td><code>Sotomayor</code></td>
<td><code>Piscina Olímpica</code></td>
</tr>
<tr>
<td><code>EID16</code></td>
<td><code>Piscina Olímpica</code></td>
<td><code>Piscina Olímpica</code></td>
</tr>
<tr>
<td><code>EID16</code></td>
<td><code>Sotomayor</code></td>
<td><code>Sotomayor</code></td>
</tr>
<tr>
<td><code>EID15</code></td>
<td><code>Santa Lucía</code></td>
<td><code>Santa Lucía</code></td>
</tr>
</tbody>
</table>
<div style="height:.5em;"> </div>
<figcaption>basic directed edge-labelled graph pattern (left) with mappings generated over the directed edge-labelled graph of Figure <a href="#fig-delg">2.1</a> (right)</figcaption>
</figure>
<p>As we will see in later examples (particularly Figure <a href="#fig-cgp">2.7</a>), basic graph patterns may also form cycles (be they directed or undirected), and may replace edge labels with variables. Basic graph patterns in the context of other models – such as property graphs – can be defined analogously by allowing variables to replace constants in any position of the model.</p>
<div class="formal">
<p>We formalise basic graph patterns first for directed edge-labelled graphs, and subsequently for property graphs [<a href="#ref-AnglesABHRV17">Angles et al., 2017</a>]. For these definitions, we introduce a countably infinite set of <em>variables</em> \(\var\) ranging over (but disjoint from: \(\con \cap \var = \emptyset\)) the set of constants. We refer generically to constants and variables as <em>terms</em>, denoted and defined as \(\term = \con \cup \var\). We define a basic graph pattern for a model by simply replacing constants with terms (that may be variables). Though we focus on directed edge-labelled graphs and property graphs, basic graph patterns for other graph models can be defined analogously.</p>
<dl class="definition" id="def-delgp">
<dt>Basic directed edge-labelled graph pattern</dt>
<dd>We define a <em>basic directed edge-labelled graph pattern</em> as a tuple \(Q = (V,E,L)\), where \(V \subseteq \term\) is a set of node terms, \(L \subseteq \term\) is a set of edge terms, and \(E \subseteq V \times L \times V\) is a set of edges (triple patterns).</dd>
</dl>
<div class="example">
<p>Returning to the example of Figure <a href="#fig-gp">2.5</a>:</p>
<ul>
<li>the set \(V\) contains the constant <code>Food Festival</code> and variables <code>?event</code>, <code>?ven1</code> and <code>?ven2</code>;</li>
<li>the set \(E\) contains four edges, including \((\)<code>?event</code>, <code>type</code>, <code>Food Festival</code>\()\);</li>
<li>the set \(L\) contains the constants <code>type</code> and <code>venue</code>.</li>
</ul>
</div>
<p>A basic property graph pattern is also defined by introducing variables.</p>
<dl class="definition" id="def-pgp">
<dt>Basic property graph pattern</dt>
<dd>We define a <em>basic property graph pattern</em> as a tuple \(Q = (V,E,L,P,U,e,l,p)\), where \(V \subseteq \term\) is a set of node id terms, \(E \subseteq \term\) is a set of edge id terms, \(L \subseteq \term\) is a set of label terms, \(P \subseteq \term\) is a set of property terms, \(U \subseteq \term\) is a set of value terms, \(e : E \rightarrow V \times V\) maps an edge id term to a pair of node id terms, \(l : V \cup E \rightarrow 2^{L}\) maps a node or edge id term to a set of label terms, and \(p : V \cup E \rightarrow 2^{P \times U}\) maps a node or edge id term to a set of pairs of property–value terms.</dd>
</dl>
<p>Towards defining the results of evaluating a basic graph pattern over a data graph (following the same model), we first define a partial mapping \(\mu : \var \rightarrow \con\) from variables to constants, whose <em>domain</em> (the set of variables for which it is defined) is denoted by \(\dom(\mu)\). Given a basic graph pattern \(Q\), let \(\var(Q)\) denote the set of all variables appearing in (some recursively nested element of) \(Q\). We further denote by \(\mu(Q)\) the image of \(Q\) under \(\mu\), meaning that any variable \(v \in \var(Q) \cap \dom(\mu)\) is replaced in \(Q\) by \(\mu(v)\). Observe that when \(\var(Q) \subseteq \dom(\mu)\), then \(\mu(Q)\) is a data graph (in the corresponding model of \(Q\)).</p>
<p>Next, we define the notion of containment between data graphs. For two directed edge-labelled graphs \(G_1 = (V_1,E_1,L_1)\) and \(G_2 = (V_2,E_2,L_2)\), we say that \(G_1\) is a <em>sub-graph</em> of \(G_2\), denoted \(G_1 \subseteq G_2\), if and only if \(V_1 \subseteq V_2\), \(E_1 \subseteq E_2\), and \(L_1 \subseteq L_2\).<sup class="fnmark" id="fnm3"><a href="#fn3">3</a></sup><span class="footnote" id="fn3"><sup><a href="#fnm3">note 3</a></sup> Given, for example, \(G_1 = (\{a\},\{(a,b,a)\},\{b,c\})\) and \(G_2 = (\{a,c\},\{(a,b,a)\},\{b\})\), we remark that \(G_1 \not\subseteq G_2\) and \(G_2 \not\subseteq G_1\): the former has a label not used on an edge while the latter has a node without an incident edge. In concrete data models like RDF where such cases of nodes or labels without edges cannot occur, the sub-graph relation \(G_1 \subseteq G_2\) holds if and only if \(E_1 \subseteq E_2\) holds.</span> Conversely, in property graphs, nodes can often be defined without edges. For two property graphs \(G_1 = (V_1,E_1,L_1,P_1,U_1,e_1,l_1,p_1)\) and \(G_2 = (V_2,E_2,L_2,P_2,U_2,e_2,l_2,p_2)\), we say that \(G_1\) is a <em>sub-graph</em> of \(G_2\), denoted \(G_1 \subseteq G_2\), if and only if \(V_1 \subseteq V_2\), \(E_1 \subseteq E_2\), \(L_1 \subseteq L_2\), \(P_1 \subseteq P_2\), \(U_1 \subseteq U_2\), for all \(x \in E_1\) it holds that \(e_1(x) = e_2(x)\), and for all \(y \in E_1 \cup V_1\) it holds that \(l_1(y) \subseteq l_2(y)\) and \(p_1(y) \subseteq p_2(y)\).</p>
<p>We are now ready to define the evaluation of a basic graph pattern.</p>
<dl class="definition" id="def-evgp">
<dt>Evaluation of a basic graph pattern</dt>
<dd>Let \(Q\) be a basic graph pattern and let \(G\) be a data graph (in the same model). We then define the <em>evaluation of the basic graph pattern \(Q\) over the data graph \(G\)</em>, denoted \(Q(G)\), to be the set of mappings \(Q(G) = \{ \mu \mid \mu(Q) \subseteq G \text{ and } \dom(\mu) = \var(Q) \}\).</dd>
</dl>
<div class="example">
<p>Figure <a href="#fig-gp">2.5</a> enumerates all of the mappings given by the evaluation of the depicted basic graph pattern over the data graph of Figure <a href="#fig-delg">2.1</a>. Each non-header row indicates a mapping \(\mu\).</p>
</div>
<p>The final results of evaluating a basic graph pattern may vary depending on the choice of semantics: the results under <em>homomorphism-based semantics</em> are defined as \(Q(G)\). Conversely, under <em>isomorphism-based</em> semantics, mappings that send two edge variables to the same constant and/or mappings that send two node variables to the same constant may be excluded from the results. Henceforth we assume the more general <em>homomorphism-based semantics</em>.</p>
</div>
<h4 id="sssec-complexpatterns" class="subsection">Complex graph patterns</h4>
<p>A (basic) graph pattern transforms an input graph into a table of results (as shown in Figure <a href="#fig-gp">2.5</a>). We may then consider using the relational algebra to combine and/or transform such tables, thus forming more complex queries from one or more graph patterns. Recall that the relational algebra consists of unary operators that accept one input table, and binary operators that accept two input tables. Unary operators include projection (\(\pi\)) to output a subset of columns, selection (\(\sigma\)) to output a subset of rows matching a given condition, and renaming of columns (\(\rho\)). Binary operators include union (\(\cup\)) to merge the rows of two tables into one table, difference (\(-\)) to remove the rows from the first table present in the second table, and joins (\(\Join\)) to extend the rows of one table with rows from the other table that satisfy a join condition. Selection and join conditions typically include equalities (\(=\)), inequalities (\(\leq\)), negation (\(\neg\)), disjunction (\(\vee\)), etc. From these operators, we can further define other (syntactic) operators, such as intersection (\(\cap\)) to output rows in both tables, anti-join (\(\rhd\), aka <em>minus</em>) to output rows from the first table for which there are no join-compatible rows in the second table, left-join (\(\mathbin{\rule[0ex]{0.3em}{.5pt}\llap{\rule[1ex]{0.3em}{.5pt}}\mkern-6mu\Join}\), aka <em>optional</em>) to perform a join but keeping rows from the first table without a compatible row in the second table, etc.</p>
<p>Basic graph patterns can then be expressed in a subset of relational algebra (namely \(\pi\), \(\sigma\), \(\rho\), \(\Join\)). Assuming, for example, a single ternary relation \(G(s,p,o)\) representing a graph – i.e., a table \(G\) with three columns \(s\), \(p\), \(o\) – the query of Figure <a href="#fig-gp">2.5</a> can be expressed in relational algebra as:</p>
<p class="mathblock">\(\pi_{ev,vn_1,vn_2}(\sigma_{p=\texttt{type} \wedge o=\texttt{Food Festival} \wedge p_1=p_2=\texttt{venue}}(\rho_{s/ev}(G \bowtie \rho_{p/p_1,o/vn_1}(G) \bowtie \rho_{p/p_2,o/vn_2}(G))))\)</p>
<p>where \(\Join\) denotes a <em>natural join</em>, meaning that equality is checked across pairs of columns with the same name in both tables (here, the join is thus performed on the subject column \(s\)). The result of this query is a table with a column for each variable: \(ev,vn1,vn2\). However, not all queries using \(\pi, \sigma, \rho\) and \(\Join\) on \(G\) can be expressed as basic graph patterns; for example, we cannot choose which variables to project in a basic graph pattern, but rather must project all variables not fixed to a constant.</p>
<p>Graph query languages such as SPARQL [<a href="#ref-sparql11">Harris et al., 2013</a>] and Cypher [<a href="#ref-FrancisGGLLMPRS18">Francis et al., 2018</a>] allow the full use of relational operators over the results of graph patterns, giving rise to <em>complex graph patterns</em> [<a href="#ref-AnglesABHRV17">Angles et al., 2017</a>]. Figure <a href="#fig-cq">2.6</a> presents an example of a complex graph pattern with projected variables in bold, choosing particular variables to appear in the final results. In Figure <a href="#fig-cgp">2.7</a>, we give another example of a complex graph pattern looking for food festivals or drinks festivals not held in Santiago, optionally returning their start date and name (where available).</p>
<figure id="fig-cq">
<img src="images/fig-cq.svg" alt="Conjunctive query" class="multi" />
<table class="condensedTable">
<thead>
<tr>
<th><span class="sf">?name1</span></th>
<th><span class="sf">?con</span></th>
<th><span class="sf">?name2</span></th>
</tr>
</thead>
<tbody>
<tr>
<td><code>Food Truck</code></td>
<td><code>bus</code></td>
<td><code>Food Truck</code></td>
</tr>
<tr>
<td><code>Food Truck</code></td>
<td><code>bus</code></td>
<td><code>Food Truck</code></td>
</tr>
<tr>
<td><code>Food Truck</code></td>
<td><code>bus</code></td>
<td><code>Ñam</code></td>
</tr>
<tr>
<td><code>Food Truck</code></td>
<td><code>flight</code></td>
<td><code>Ñam</code></td>
</tr>
<tr>
<td><code>Food Truck</code></td>
<td><code>flight</code></td>
<td><code>Ñam</code></td>
</tr>
<tr>
<td><code>Ñam</code></td>
<td><code>bus</code></td>
<td><code>Food Truck</code></td>
</tr>
<tr>
<td><code>Ñam</code></td>
<td><code>flight</code></td>
<td><code>Food Truck</code></td>
</tr>
<tr>
<td><code>Ñam</code></td>
<td><code>flight</code></td>
<td><code>Food Truck</code></td>
</tr>
</tbody>
</table>
<figcaption>Complex graph pattern (left) with mappings generated over the graph of Figure <a href="#fig-delg">2.1</a> (right)</figcaption>
</figure>
<p>Complex graph patterns can give rise to duplicate results; for example, the first result in Figure <a href="#fig-cq">2.6</a> appears twice since <code>?city1</code> matches <code>Arica</code> and <code>?city2</code> matches <code>Viña del Mar</code> in one result, and vice-versa in the other. Query languages then offer two semantics: <em>bag semantics</em> preserves duplicates according to the multiplicity of the underlying mappings, while <em>set semantics</em> (typically invoked with a <code>DISTINCT</code> keyword) removes duplicates from the results.</p>
<figure id="fig-cgp">
<img class="inlined" src="images/fig-cgp1.svg" alt="Complex graph pattern 1"/>
<img class="inlined" src="images/fig-cgp2.svg" alt="Complex graph pattern 2"/>
<img class="inlined" src="images/fig-cgp3.svg" alt="Complex graph pattern 3"/>
<img class="inlined" src="images/fig-cgp4.svg" alt="Complex graph pattern 4"/>
<img class="inlined" src="images/fig-cgp5.svg" alt="Complex graph pattern 5"/>
<div><div style="display:inline;">\(Q := ((((Q_1 \cup Q_2) \rhd Q_3)\) \(\mathbin{\rule[0ex]{0.3em}{.5pt}\llap{\rule[1ex]{0.3em}{.5pt}}\mkern-6mu\Join}\) \(Q_4 )\) \(\mathbin{\rule[0ex]{0.3em}{.5pt}\llap{\rule[1ex]{0.3em}{.5pt}}\mkern-6mu\Join}\) \(Q_5),\qquad Q(G) =\) <table class="condensedTable" style="position:relative;top:.6em;display:inline-block;vertical-align:middle;"><thead><tr><th>?event</th><th>?start</th><th>?name</th></tr></thead><tbody><tr><td><code>EID16</code></td><td></td><td><code>Food Truck</code></td></tr></tbody></table></div></div>
<figcaption>Complex graph pattern (\(Q\)) with mappings generated (\(Q(G)\)) over the graph of Figure <a href="#fig-delg">2.1</a> (\(G\))</figcaption>
</figure>
<div class="formal">
<p>We now formally define complex graph patterns.</p>
<dl class="definition" id="def-cgp">
<dt>Complex graph pattern</dt>
<dd><em>Complex graph patterns</em> are defined recursively, as follows:
<ul>
<li>If \(Q\) is a basic graph pattern, then \(Q\) is a <em>complex graph pattern</em>.</li>
<li>If \(Q\) is a complex graph pattern, and \(\mathcal{V} \subseteq \var(Q)\), then \(\pi_\mathcal{V}(Q)\) is a <em>complex graph pattern</em>.</li>
<li>If \(Q\) is a complex graph pattern, and \(R\) is a selection condition with Boolean and equality connectives (\(\wedge\), \(\vee\), \(\neg\), \(=\)) , then \(\sigma_R(Q)\) is a <em>complex graph pattern</em>.</li>
<li>If both \(Q_1\) and \(Q_2\) are complex graph patterns, then \(Q_1 \Join Q_2\), \(Q_1 \cup Q_2\), \(Q_1 - Q_2\) and \(Q_1 \rhd Q_2\) are also <em>complex graph patterns</em>.</li>
</ul>
</dd>
</dl>
<p>We now define the evaluation of complex graph patterns. Given a mapping \(\mu\), for a set of variables \(\mathcal{V} \subseteq \var\) let \(\mu[\mathcal{V}]\) denote the mapping \(\mu'\) such that \(\dom(\mu') = \dom(\mu) \cap \mathcal{V}\) and \(\mu'(v) = \mu(v)\) for all \(v \in \dom(\mu')\) (in other words, \(\mu[\mathcal{V}]\) projects the variables \(\mathcal{V}\) from \(\mu\)). Letting \(R\) denote a Boolean selection condition and \(\mu\) a mapping, we denote by \(\mu \models R\) that \(\mu\) satisfies the Boolean condition. Finally, we define two mappings \(\mu_1\) and \(\mu_2\) to be <em>compatible</em>, denoted \(\mu_1 \sim \mu_2\), if and only if \(\mu_1(v) = \mu_2(v)\) for all \(v \in \dom(\mu_1) \cap \dom(\mu_2)\) (i.e., they map common variables to the same constant). We are now ready to provide the definition.</p>
<dl class="definition" id="def-evalcgp">
<dt>Complex graph pattern evaluation</dt>
<dd>Given a complex graph pattern \(Q\), if \(Q\) is a basic graph pattern, then \(Q(G)\) is defined per Definition <a href="#def-evgp">2.7</a>. Otherwise, \(Q(G)\) is defined as follows:
\begin{align*}
\pi_\mathcal{V}(Q)(G) = & \,\{ \mu[\mathcal{V}] \mid \mu \in Q(G) \} \\
\sigma_R(Q)(G) = & \, \{ \mu \mid \mu \in Q(G)\text{ and }\mu \models R\}\\
Q_1 \Join Q_2(G) = & \,\{ \mu_1 \cup \mu_2 \mid \mu_1 \in Q_2(G), \mu_2 \in Q_1(G)\text{ and }\mu_1 \sim \mu_2 \} \\
Q_1 \cup Q_2(G) = & \,\{ \mu \mid \mu \in Q_1(G)\text{ or } \mu \in Q_2(G) \} \\
Q_1 - Q_2(G) = & \,\{ \mu \mid \mu \in Q_1(G)\text{ and } \mu \notin Q_2(G) \} \\
Q_1 \rhd Q_2(G) = & \,\{ \mu \mid \mu \in Q_1(G)\text{ and }\nexists \mu_2 \in Q_2(G)\text{ such that }\mu \sim \mu_2 \}
\end{align*}</dd>
</dl>
<p>Based on these operators, we can define some additional syntactic operators, such as the <em>left-join</em> \(\mathbin{\rule[0ex]{0.3em}{.5pt}\llap{\rule[1ex]{0.3em}{.5pt}}\mkern-6mu\Join}\), aka <em>optional</em>):</p>
<p>
\begin{align*}
Q_1 \mathbin{\rule[0ex]{0.3em}{.5pt}\llap{\rule[1ex]{0.3em}{.5pt}}\mkern-6mu\Join} Q_2(G) = & \,(Q_1(G) \Join Q_2(G)) \cup (Q_1(G) \rhd Q_2(G))
\end{align*}
</p>
<p>We call such operators <em>syntactic</em> as they do not add expressivity.</p>
<div class="example">
<p>Figure <a href="#fig-cgp">2.7</a> illustrates a complex graph pattern and its evaluation.</p>
</div>
</div>
<h4 id="sssec-navpatterns" class="subsection">Navigational graph patterns</h4>
<p>A key feature that distinguishes graph query languages is the ability to include <em>path expressions</em> in queries. A path expression \(r\) is a regular expression that allows for matching arbitrary-length paths between two nodes using a <em>regular path query</em> \((x,r,y)\), where \(x\) and \(y\) can be variables or constants (or even the same term). The base path expression is where \(r\) is a constant (an edge label). Furthermore if \(r\) is a path expression, then \(r^*\) (<em>Kleene star</em>: zero-or-more) is also a path expression. Finally, if \(r_1\) and \(r_2\) are path expressions, then \(r_1 \mid r_2\) (<em>disjunction</em>) and \(r_1 \cdot r_2\) (<em>concatenation</em>) are also path expressions. A related notion is that of <em>2-way regular path queries</em>, which also allow for querying inverse paths; specifically, if \(r\) is path expression, then it is a <em>2-way path expression</em>, and if \(r\) is a <em>2-way path expression</em>, then \(r^-\) (<em>inverse</em>) is a <em>2-way path expression</em>. Henceforth we will refer generically to both the 1-way and 2-way variants as path expressions and regular path queries.</p>
<p>Regular path queries can be evaluated under a number of different semantics. For example, \((\)<code>Arica</code>, <code>bus*</code>, <code>?city</code>\()\) evaluated against the graph of Figure <a href="#fig-delg">2.1</a> may match the paths shown in Figure <a href="#fig-path">2.8</a>. In fact, since a cycle is present, an infinite number of paths are potentially matched. For this reason, restricted semantics are often applied, returning only the shortest paths, or paths without repeated nodes or edges (as in the case of Cypher).<sup class="fnmark" id="fnm4"><a href="#fn4">4</a></sup><span class="footnote" id="fn4"><sup><a href="#fnm4">note 4</a></sup> Mapping variables to paths requires special treatment [<a href="#ref-AnglesABHRV17">Angles et al., 2017</a>]. Cypher [<a href="#ref-FrancisGGLLMPRS18">Francis et al., 2018</a>] returns a string that encodes a path, upon which certain functions such as <code>length(·)</code> can be applied. G-CORE [<a href="#ref-AnglesABBFGLPPS18">Angles et al., 2018</a>], on the other hand, allows for returning paths, and supports additional operators on them, including projecting them as graphs, applying cost functions, and more besides.</span> Rather than returning paths, another option is to instead return the (finite) set of pairs of nodes connected by a matching path (as in the case of SPARQL 1.1).</p>
<figure id="fig-path">
<img class="inlined" src="images/fig-path1.svg" alt="Path matching 1"/>
<img class="inlined" src="images/fig-path2.svg" alt="Path matching 2"/>
<img class="inlined" src="images/fig-path3.svg" alt="Path matching 3"/>
<img class="inlined" src="images/fig-path4.svg" alt="Path matching 4"/>
<span style="margin-left:2em;">⋯</span>
<figcaption>Example paths matching \((\)<code>Arica</code>, <code>bus*</code>, <code>?city</code>\()\) over the graph of Figure <a href="#fig-delg">2.1</a></figcaption>
</figure>
<p>Regular path queries can then be used in basic graph patterns to express <em>navigational graph patterns</em> [<a href="#ref-AnglesABHRV17">Angles et al., 2017</a>], as shown in Figure <a href="#fig-ngp">2.9</a>, which illustrates a query searching for food festivals in cities reachable (recursively) from Arica by bus or flight. Furthermore, when regular path queries and graph patterns are combined with operators such as projection, selection, union, difference, and optional, the result is known as <em>complex navigational graph patterns</em> [<a href="#ref-AnglesABHRV17">Angles et al., 2017</a>].</p>
<div class="formal">
<p>We first define path expressions and regular path queries.</p>
<dl class="definition" id="def-path-expression">
<dt>Path expression</dt>
<dd>A constant (edge label) \(c\) is a <em>path expression</em>. Furthermore, if \(r\), \(r_1\) and \(r_2\) are path expressions, then:
<ul>
<li>\(r^-\) (<em>inverse</em>) and \(r^*\) (<em>Kleene star</em>) are <em>path expressions</em>.</li>
<li>\(r_1 \cdot r_2\) (<em>concatenation</em>) and \(r_1 \mid r_2\) (<em>disjunction</em>) are <em>path expressions</em>.</li>
</ul>
</dd>
</dl>
<p>We now define the evaluation of a path expression on a directed-edge labelled graph under the SPARQL 1.1-style semantics whereby the endpoints (pairs of start and end nodes) of the path are returned [<a href="#ref-sparql11">Harris et al., 2013</a>].</p>
<dl class="definition" id="def-path-expression-evaluation">
<dt>Path evaluation (directed edge-labelled graph)</dt>
<dd>Given a directed edge-labelled graph \(G = (V,E,L)\) and a path expression \(r\), we define the <em>evaluation of \(r\) over \(G\)</em>, denoted \(r[G]\), as follows:
\begin{align*}
r[G] = &\, \{ (u,v) \mid (u,r,v) \in E \} \,(\text{for }r \in \con) \\
r^-[G] = &\, \{ (u,v) \mid (v,u) \in r[G] \} \\
r_1 \mid r_2[G] = &\, r_1[G] \cup r_2[G] \\
r_1 \cdot r_2[G] = &\, \{ (u,v) \mid \exists w \in V : (u,w) \in r_1[G]\text{ and }(w,v) \in r_2[G]\}\\
r^*[G] = &\, V \cup \bigcup_{n \in \mathbb{N^+}} r^n[G]
\end{align*}
where by \(r^n\) we denote the \(n\)<sup>th</sup>-concatenation of \(r\) (e.g., \(r^3 = r \cdot r \cdot r\)).</dd>
</dl>
<p>The evaluation of a path expression on a property graph \(G = (V,E,L,P,U,e,l,p)\) can be defined analogously by adapting the first definition (in the case that \(r \in \con\)) as follows:</p>
<p>\[ r[G] = \{(u,v) \mid \exists x \in E : e(x) = (u,v)\text{ and }l(e) = r \} \,.\]</p>
<p>The rest of the definitions then remain unchanged.</p>
<p>Query languages may support additional operators, some of which are syntactic (e.g., \(r^+\) is sometimes used for one-or-more, but can be rewritten as \(r \cdot r^*\)), while others may add expressivity such as the case of SPARQL [<a href="#ref-sparql11">Harris et al., 2013</a>], which allows a limited form of negation in expressions (e.g., \(!r\), with \(r\) being a constant or the inverse of a constant, matching any path not labelled \(r\)).</p>
<p>Next we define a regular path query and its evaluation.</p>
<dl class="definition" id="def-regular-path-query">
<dt>Regular path query</dt>
<dd>A <em>regular path query</em> is a triple \((x,r,y)\) where \(x,y \in \con \cup \var\) and \(r\) is a path expression.</dd>
</dl>
<dl class="definition" id="def-regular-path-query-evaluation">
<dt>Regular path query evaluation</dt>
<dd>Let \(G\) denote a directed edge-labelled graph, \(c\), \(c_1\), \(c_2 \in \con\) denote constants and \(z\), \(z_1\), \(z_2 \in \var\) denote variables. Then the <em>evaluation of a regular path query</em> is defined as follows:
\begin{align*}
(c_1,r,c_2)(G) = & \{ \mu_\emptyset \mid (c_1,c_2) \in r[G] \} \\
(c,r,z)(G) = & \{ \mu \mid \dom(\mu) = \{ z \}\text{ and }(c,\mu(z)) \in r[G] \} \\
(z,r,c)(G) = & \{ \mu \mid \dom(\mu) = \{ z \}\text{ and }(\mu(z),c) \in r[G] \} \\
(z_1,r,z_2)(G) = & \{ \mu \mid \dom(\mu) = \{ z_1, z_2 \}\text{ and }(\mu(z_1),\mu(z_2)) \in r[G] \}
\end{align*}
where \(\mu_\emptyset\) denotes the empty mapping such that \(\dom(\mu) = \emptyset\) (the join identity).</dd>
</dl>
<dl class="definition" id="def-navigational-graph-pattern">
<dt>Navigational graph pattern</dt>
<dd>If \(Q\) is a basic graph pattern, then \(Q\) is a <em>navigational graph pattern</em>. If \(Q\) is a navigational graph pattern and \((x,r,y)\) is a regular path query, then \(Q \Join (x,r,y)\) is a <em>navigational graph pattern</em>.</dd>
</dl>
<p>The definition of the evaluation of a navigational graph pattern then follows from the previous definition of a join and the definition of the evaluation of a regular path query (for a directed edge-labelled graph or a property graph, respectively). Likewise, <em>complex navigational graph patterns</em> – and their evaluation – are defined by extending this definition in the natural way with the same operators from Definition <a href="#def-cgp">2.8</a> following the same semantics seen in Definition <a href="#def-evalcgp">2.9</a>.</p>
</div>
<figure id="fig-ngp">
<img src="images/fig-ngp.svg" alt="Navigational graph pattern" class="multi" />
<div style="height:2em;"> </div>
<table class="condensedTable">
<thead>
<tr>
<th><span class="sf">?event</span></th>
<th><span class="sf">?name</span></th>
<th><span class="sf">?city</span></th>
</tr>
</thead>
<tbody>
<tr>
<td><code>EID15</code></td>
<td><code>Ñam</code></td>
<td><code>Santiago</code></td>
</tr>
<tr>
<td><code>EID16</code></td>
<td><code>Food Truck</code></td>
<td><code>Arica</code></td>
</tr>
<tr>
<td><code>EID16</code></td>
<td><code>Food Truck</code></td>
<td><code>Viña del Mar</code></td>
</tr>
</tbody>
</table>
<div style="height:1em;"> </div>
<figcaption>Navigational graph pattern (left) with mappings generated over the graph of Figure <a href="#fig-delg">2.1</a> (right)</figcaption>
</figure>
<h4 id="app-qother" class="subsection">Other features</h4>
<p>Thus far, we have discussed features that form the practical and theoretical foundation of any query language for graphs [<a href="#ref-AnglesABHRV17">Angles et al., 2017</a>]. However, specific query languages for graphs may support other features, such as aggregation (<code>GROUP BY</code>, <code>COUNT</code>, etc.), more complex filters and datatype operators (e.g., range queries on years extracted from a date), federation for querying remotely hosted graphs over the Web, languages for updating graphs, support for entailment, etc. For more information, we refer to the documentation of the respective query languages (e.g., [<a href="#ref-sparql11">Harris et al., 2013</a>, <a href="#ref-AnglesABBFGLPPS18">Angles et al., 2018</a>]) and to the survey by <a href="#ref-AnglesABHRV17">Angles et al. [2017]</a>.</p>
<h4 id="app-quis" class="subsection">Query Interfaces</h4>
<p>Knowledge graphs are often queried by non-expert users who may not be able to express their information needs in terms of a particular graph query language. Different types of interfaces have thus been proposed in order to assist users in querying data graphs. Such interfaces may support, for example:</p>
<dl>
<dt>Faceted browsing:</dt>
<dd>Users start by specifying a simple search, such as a keyword search, a type of node like <code>Food Festival</code>, or possibly other kinds of search. They are then presented with a set of matching results, and a set of facets, which are typically attributes (e.g., <code>venue</code>) and values (e.g., <code>Santa Lucía</code>) present in the current results set. Selecting a value for a facet restricts the current results set to include only results with the indicated value; this selection process can be applied iteratively to restrict results per multiple facets. Often the faceted criteria are translated into and evaluated as graph queries. Though relatively intuitive for users, such systems typically support acyclic queries that generate lists of results (analogous to graph queries that project a single variable), and rarely support more expressive queries. Examples of faceted browsing systems for graphs include VisiNav [<a href="#ref-Harth10">Harth, 2010</a>], Broccoli [<a href="#ref-BastB13">Bast and Buchhold, 2013</a>], SemFacet [<a href="#ref-ArenasGKMZ16">Arenas et al., 2016</a>], GraFa [<a href="#ref-Moreno-VegaH18">Moreno-Vega and Hogan, 2018</a>], etc.</dd>
<dt>Query building:</dt>
<dd>Users are provided with a form or graphical interface that can be used to specify a graph query without needing to understand the syntax of a specific query language. Such query builders allow for incrementally adding nodes or edges to the query, assisted by features such as auto-completion, previewing intermediate results, and graph navigation. Query builders typically allow for expressing queries equivalent to (cyclic) basic graph patterns, but may not support more expressive features of query languages as described herein. Graph query builder systems include Smeagol [<a href="#ref-ClemmerD11">Clemmer and Davies, 2011</a>], QueryVOWL [<a href="#ref-HaagLSE15a">Haag et al., 2015</a>], VIIQ [<a href="#ref-JayaramGL15">Jayaram et al., 2015a</a>], Sparklis [<a href="#ref-Ferre17">Ferré, 2017</a>], RDF Explorer [<a href="#ref-VargasAHL19">Vargas et al., 2019</a>], and more besides.</dd>
<dt>Query-by-example:</dt>
<dd>Users provide examples of positive and sometimes negative answers to their queries. For example, they may provide as positive examples the nodes <span class="gnode">Arica</span>, <span class="gnode">Santiago</span>, <span class="gnode">Viña del Mar</span>, and as negative examples the nodes <span class="gnode">Chile</span>, <span class="gnode">Lima</span>, where the system will then “reverse engineer” a query that returns positive examples but not negative examples (in this case, the query proposed may return nodes of type <code>City</code> whose <code>country</code> is <code>Chile</code>). Query-by-example systems typically support basic graph patterns, and may not support more expressive querying features. They are useful in cases where users have examples of what they are looking for, but are not necessarily sure of the query they need to retrieve similar examples. Query-by-example systems for graphs include GQBE [<a href="#ref-JayaramKLYE15">Jayaram et al., 2015b</a>] and SPARQLByE [<a href="#ref-DiazAB16">Diaz et al., 2016</a>].</dd>
<dt>Question answering:</dt>
<dd>Users express their queries as questions in natural language; for example, they might ask “<em>What food festivals will be held in Arica?</em>”. The question answering system will then generate answers from the graph based on its best interpretation of the question. We identify three types of question answering system. <em>Navigation-based systems</em> identify entities/nodes from the graph that are mentioned in the query, and then attempt to navigate edges from those nodes whose labels best match the question; for example, they may match the nodes <span class="gnode">Food Festival</span> and <span class="gnode">Arica</span> in the graph based on the question, and from there, try to navigate edges in the graph whose labels match the question in order to find answers. <em>Template-based systems</em> rather pre-suppose a fixed list of question templates expressed in the query language, with placeholder variables that will be replaced with entities/nodes detected in the question; a template matched for the previous example may be of the form “<em>What <code>X</code> will be held in <code>Y</code>?</em>”. <em>Translation-based systems</em> attempt to translate the question into a query in the structured query language, using (typically neural) machine translation techniques. The latter two types of question answering systems can additionally return a graph query that explains the answers generated. Question answering systems are often very intuitive to use, but may not always return correct results, particularly when considering complex questions/queries. Examples of question answering systems for knowledge graphs include Treo [<a href="#ref-FreitasOOCS11a">Freitas et al., 2011</a>], NFF [<a href="#ref-Hu0YWZ18">Hu et al., 2018</a>], TemplateQA [<a href="#ref-ZhengYZC18">Zheng et al., 2018</a>], WDAqua-core1 [<a href="#ref-DiefenbachBSM20">Diefenbach et al., 2020</a>], and more besides.</dd>
</dl>
<p>Such query interfaces enable non-expert users to formulate queries over graphs, which in turn broadens the potential impact of knowledge graphs.</p>
</section>
</section>
<section id="chap-knowledge" class="chapter">
<h2>Schema, Identity, Context</h2>
<p>In this chapter we describe extensions of the data graph – relating to schema, identity and context – that provide additional structures for accumulating knowledge. Henceforth, we refer to a <em>data graph</em> as a collection of data represented as nodes and edges using one of the models discussed in Chapter <a href="#chap-graph">2</a>. We refer to a <em>knowledge graph</em> as a data graph potentially enhanced with representations of schema, identity, context, ontologies and/or rules. These additional representations may be embedded in the data graph, or layered above. Representations for schema, identity and context are discussed now, while ontologies and rules will be discussed in Chapter <a href="#chap-deductive">4</a>.</p>
<section id="sec-schema" class="section">
<h3>Schema</h3>
<p>One of the benefits of modelling data as graphs – versus, for example, the relational model – is the option to forgo or postpone the definition of a schema. However, when modelling data as graphs, schemata <em>can</em> be used to prescribe a high-level structure and/or semantics that the graph follows or should follow. We discuss three types of graph schemata: <em>semantic</em>, <em>validating</em>, and <em>emergent</em>.</p>
<h4 id="sec-semSchema" class="subsection">Semantic schema</h4>
<p>A semantic schema allows for defining the meaning of high-level terms (aka <em>vocabulary</em> or <em>terminology</em>) used in the graph, which facilitates reasoning over graphs using those terms. Looking at Figure <a href="#fig-delg">2.1</a>, for example, we may notice some natural groupings of nodes based on the types of entities to which they refer. We may thus decide to define <em>classes</em>, such as <code>Event</code>, <code>City</code>, etc., to denote these groupings. In fact, Figure <a href="#fig-delg">2.1</a> already illustrates three low-level classes – <code>Open Market</code>, <code>Food Market</code>, <code>Drinks Festival</code> – grouping similar entities with an edge labelled <span class="gelab">type</span>. We may subsequently wish to capture some relations between some of these classes. In Figure <a href="#fig-classhier">3.1</a>, we present a class hierarchy for events where children are defined to be <em>sub-classes</em> of their parents such that if we find an edge <span class="gnode">EID15</span><img class="tip" src="images/edge-source.png" width="8" alt="arrow source"/><span class="edge">type</span><img class="tip" src="images/edge-tip.png" width="15" alt="arrow tip rightward"/><span class="gnode">Food Festival</span> in our graph, we may also <em>infer</em> that <span class="gnode">EID15</span><img class="tip" src="images/edge-source.png" width="8" alt="arrow source"/><span class="edge">type</span><img class="tip" src="images/edge-tip.png" width="15" alt="arrow tip rightward"/><span class="gnode">Festival</span> and <span class="gnode">EID15</span><img class="tip" src="images/edge-source.png" width="8" alt="arrow source"/><span class="edge">type</span><img class="tip" src="images/edge-tip.png" width="15" alt="arrow tip rightward"/><span class="gnode">Event</span> hold in the graph.</p>
<figure id="fig-classhier">
<img src="images/fig-classhier.svg" alt="Example class hierarchy for Event"/>
<figcaption>Example class hierarchy for <code>Event</code></figcaption>
</figure>
<p>Aside from classes, we may also wish to define the semantics of edge labels, aka <em>properties</em>. Returning to Figure <a href="#fig-delg">2.1</a>, we may consider that the properties <span class="gelab">city</span> and <span class="gelab">venue</span> are <em>sub-properties</em> of a more general property <span class="gelab">location</span>, such that given an edge <span class="gnode">Santa Lucía</span><img class="tip" src="images/edge-source.png" width="8" alt="arrow source"/><span class="edge">city</span><img class="tip" src="images/edge-tip.png" width="15" alt="arrow tip rightward"/><span class="gnode">Santiago</span>, for example, we may also infer that <span class="gnode">Santa Lucía</span><img class="tip" src="images/edge-source.png" width="8" alt="arrow source"/><span class="edge">location</span><img class="tip" src="images/edge-tip.png" width="15" alt="arrow tip rightward"/><span class="gnode">Santiago</span> must hold as an edge in the graph. We may also consider, for example, that <span class="gelab">bus</span> and <span class="gelab">flight</span> are both sub-properties of a more general property <span class="gelab">connects to</span>. Along these lines, properties may also form a hierarchy similar to what we saw for classes. We may further define the <em>domain</em> of properties, indicating the class(es) of entities for nodes from which edges with that property extend; for example, we may define that the domain of <span class="gelab">connects to</span> is a class <code>Place</code>, such that given the previous sub-property relations, we infer <span class="gnode">Arica</span><img class="tip" src="images/edge-source.png" width="8" alt="arrow source"/><span class="edge">type</span><img class="tip" src="images/edge-tip.png" width="15" alt="arrow tip rightward"/><span class="gnode">Place</span>. Conversely, we may define the <em>range</em> of properties, indicating the class(es) of entities for nodes to which edges with that property extend; for example, we may define that the range of <span class="gelab">city</span> is a class <code>City</code>, inferring that <span class="gnode">Arica</span><img class="tip" src="images/edge-source.png" width="8" alt="arrow source"/><span class="edge">type</span><img class="tip" src="images/edge-tip.png" width="15" alt="arrow tip rightward"/><span class="gnode">City</span>.</p>
<p>A prominent standard for defining a semantic schema for (RDF) graphs is the <em>RDF Schema</em> (<em>RDFS</em>) standard [<a href="#ref-RDFS">Brickley and Guha, 2014</a>], which allows for defining sub-classes, sub-properties, domains, and ranges amongst the classes and properties used in an RDF graph, where such definitions can be serialised as a graph. We illustrate the semantics of these features in Table <a href="#tab-semSchema">3.1</a> and provide a concrete example of definitions in Figure <a href="#fig-sg">3.2</a> for a sample of terms used in the running example. These definitions can then be embedded into a data graph. More generally, the semantics of terms used in a graph can be defined in much more depth than seen here, as is supported by the <em>Web Ontology Language</em> (<em>OWL</em>) standard [<a href="#ref-OWL2">Hitzler et al., 2012</a>] for RDF graphs. We will return to such semantics later in Chapter <a href="#chap-deductive">4</a>.</p>
<table class="normalTable" id="tab-semSchema">
<caption>Definitions for sub-class, sub-property, domain and range</caption>
<thead>
<tr>
<th>Feature</th>
<th>Definition</th>
<th>Condition</th>
<th>Example</th>
</tr>
</thead>
<tbody>
<tr>
<td>Subclass</td>
<td><span class="gnode">\(c\)</span><img class="tip" src="images/edge-source.png" width="8" alt="arrow source"/><span class="edge">subc. of</span><img class="tip" src="images/edge-tip.png" width="15" alt="arrow tip rightward"/><span class="gnode">\(d\)</span></td>
<td><span class="gnode">\(x\)</span><img class="tip" src="images/edge-source.png" width="8" alt="arrow source"/><span class="edge">type</span><img class="tip" src="images/edge-tip.png" width="15" alt="arrow tip rightward"/><span class="gnode">\(c\)</span> implies <span class="gnode">\(x\)</span><img class="tip" src="images/edge-source.png" width="8" alt="arrow source"/><span class="edge">type</span><img class="tip" src="images/edge-tip.png" width="15" alt="arrow tip rightward"/><span class="gnode">\(d\)</span></td>
<td><span class="gnode">City</span><img class="tip" src="images/edge-source.png" width="8" alt="arrow source"/><span class="edge">subc. of</span><img class="tip" src="images/edge-tip.png" width="15" alt="arrow tip rightward"/><span class="gnode">Place</span></td>
</tr>
<tr>
<td>Subproperty</td>
<td><span class="gnode">\(p\)</span><img class="tip" src="images/edge-source.png" width="8" alt="arrow source"/><span class="edge">subp. of</span><img class="tip" src="images/edge-tip.png" width="15" alt="arrow tip rightward"/><span class="gnode">\(q\)</span></td>
<td><span class="gnode">\(x\)</span><img class="tip" src="images/edge-source.png" width="8" alt="arrow source"/><span class="edge">\(p\)</span><img class="tip" src="images/edge-tip.png" width="15" alt="arrow tip rightward"/><span class="gnode">\(y\)</span> implies <span class="gnode">\(x\)</span><img class="tip" src="images/edge-source.png" width="8" alt="arrow source"/><span class="edge">\(q\)</span><img class="tip" src="images/edge-tip.png" width="15" alt="arrow tip rightward"/><span class="gnode">\(y\)</span></td>
<td><span class="gnode">venue</span><img class="tip" src="images/edge-source.png" width="8" alt="arrow source"/><span class="edge">subp. of</span><img class="tip" src="images/edge-tip.png" width="15" alt="arrow tip rightward"/><span class="gnode">location</span></td>
</tr>
<tr>
<td>Domain</td>
<td><span class="gnode">\(p\)</span><img class="tip" src="images/edge-source.png" width="8" alt="arrow source"/><span class="edge">domain</span><img class="tip" src="images/edge-tip.png" width="15" alt="arrow tip rightward"/><span class="gnode">\(c\)</span></td>
<td><span class="gnode">\(x\)</span><img class="tip" src="images/edge-source.png" width="8" alt="arrow source"/><span class="edge">\(p\)</span><img class="tip" src="images/edge-tip.png" width="15" alt="arrow tip rightward"/><span class="gnode">\(y\)</span> implies <span class="gnode">\(x\)</span><img class="tip" src="images/edge-source.png" width="8" alt="arrow source"/><span class="edge">type</span><img class="tip" src="images/edge-tip.png" width="15" alt="arrow tip rightward"/><span class="gnode">\(c\)</span></td>
<td><span class="gnode">venue</span><img class="tip" src="images/edge-source.png" width="8" alt="arrow source"/><span class="edge">domain</span><img class="tip" src="images/edge-tip.png" width="15" alt="arrow tip rightward"/><span class="gnode">Event</span></td>
</tr>
<tr>
<td>Range</td>
<td><span class="gnode">\(p\)</span><img class="tip" src="images/edge-source.png" width="8" alt="arrow source"/><span class="edge">range</span><img class="tip" src="images/edge-tip.png" width="15" alt="arrow tip rightward"/><span class="gnode">\(c\)</span></td>
<td><span class="gnode">\(x\)</span><img class="tip" src="images/edge-source.png" width="8" alt="arrow source"/><span class="edge">\(p\)</span><img class="tip" src="images/edge-tip.png" width="15" alt="arrow tip rightward"/><span class="gnode">\(y\)</span> implies <span class="gnode">\(y\)</span><img class="tip" src="images/edge-source.png" width="8" alt="arrow source"/><span class="edge">type</span><img class="tip" src="images/edge-tip.png" width="15" alt="arrow tip rightward"/><span class="gnode">\(c\)</span></td>
<td><span class="gnode">venue</span><img class="tip" src="images/edge-source.png" width="8" alt="arrow source"/><span class="edge">range</span><img class="tip" src="images/edge-tip.png" width="15" alt="arrow tip rightward"/><span class="gnode">Venue</span></td>
</tr>
</tbody>
</table>
<figure id="fig-sg">
<img src="images/fig-sg.svg" alt="Example schema graph describing sub-classes, sub-properties, domains, and ranges"/>
<figcaption>Example schema with sub-classes, sub-properties, domains, and ranges</figcaption>
</figure>
<p>Semantic schemata are typically defined for incomplete graph data, where the absence of an edge between two nodes, such as <span class="gnode">Viña del Mar</span><img class="tip" src="images/edge-source.png" width="8" alt="arrow source"/><span class="edge">flight</span><img class="tip" src="images/edge-tip.png" width="15" alt="arrow tip rightward"/><span class="gnode">Arica</span>, does not mean that the relation does not hold in the real world. Therefore, from the graph of Figure <a href="#fig-delg">2.1</a>, we cannot assume that there is no flight between Viña del Mar and Arica. In contrast, if the <em>Closed World Assumption</em> (<em>CWA</em>) were adopted – as is the case in many classical database systems – it would be assumed that the data graph is a complete description of the world, thus allowing to assert with certainty that no flight exists between the two cities. Systems that do not adopt the CWA are said to adopt the <em>Open World Assumption</em> (<em>OWA</em>). Considering our running example, it would be unreasonable to assume that the tourism organisation has complete knowledge of everything describable in its knowledge graph, and hence adopting the OWA appears more appropriate. However, it can be inconvenient if a system is unable to definitely answer “<em>yes</em>” or “<em>no</em>” to questions such as “<em>is there a flight between Arica and Viña del Mar?</em>”, especially when the organisation is certain that it has complete knowledge of the flights. A compromise between OWA and CWA is the <em>Local Closed World Assumption</em> (<em>LCWA</em>), where portions of the data graph are assumed to be complete.</p>
<h4 id="sssec-validating-schema" class="subsection">Validating schema</h4>
<p>When graphs are used to represent diverse, incomplete data at large scale, the OWA is the most appropriate choice for a <em>default</em> semantics. But in some scenarios, we may wish to guarantee that our data graph – or specific parts thereof – are in some sense “complete”. Returning to Figure <a href="#fig-delg">2.1</a>, for example, we may wish to ensure that all events have at least a name, a venue, a start date, and an end date, such that applications using the data – e.g., one that sends event notifications to users – can ensure that they have the minimal information required. Furthermore, we may wish to ensure that the city of an event is <em>stated to be</em> a city (rather than <em>inferring</em> that it is a city). We can define such constraints in a validating schema and validate the data graph with respect to the resulting schema, listing constraint violations (if any). Thus while semantic schemata allow for inferring new graph data, validating schemata allow for validating a given data graph with respect to some constraints.</p>
<p>A standard way to define a validating schema for graphs is using <em>shapes</em> [<a href="#ref-SHACLSpec">Knublauch and Kontokostas, 2017</a>, <a href="#ref-Prudhommeaux2014">Prud'hommeaux et al., 2014</a>, <a href="#ref-Labra2017">Labra Gayo et al., 2018</a>]. A shape <em>targets</em> a set of nodes in a data graph and specifies <em>constraints</em> on those nodes. The shape’s target can be defined in many ways, such as targeting all instances of a class, the domain or range of a property, the result of a query, nodes connected to the target of another shape by a given property, etc. Constraints can then be defined on the targeted nodes, such as to restrict the number or types of values taken on a given property, the shapes that such values must satisfy, etc</p>
<p>A <em>shapes graph</em> is formed from a set of interrelated shapes. Shapes graphs can be depicted as UML-like class diagrams, where Figure <a href="#fig-shapeExample">3.3</a> illustrates an example of a shapes graph based on Figure <a href="#fig-delg">2.1</a>, defining constraints on four interrelated shapes. Each shape – denoted with a box like <span class="shap">Place</span>, <span class="shap">Event</span>, etc. – is associated with a set of constraints. Nodes conform to a shape if and only if they satisfy all constraints defined on the shape. Inside each shape box are placed constraints on the number (e.g., <code>[1..*]</code> denotes one-to-many, <code>[1..1]</code> denotes precisely one, etc.) and types (e.g., <code>string</code>, <code>dateTime</code>, etc.) of nodes that conforming nodes can relate to with a property (e.g., <span class="gelab">name</span>, <span class="gelab">start</span>, etc.). Another option is to place constraints on the number of nodes conforming to a particular shape that the conforming node can relate to with a property (thus generating edges between shapes); for example, <span class="shap">Event</span><img class="tip" src="images/edge-source.png" width="8" alt="arrow source"/><span class="stack"><span class="edge">venue</span><br/><span class="edge">1..*</span></span><img class="tip" src="images/edge-tip.png" width="15" alt="arrow tip rightward"/><span class="shap">Venue</span> denotes that conforming nodes for <span class="shap">Event</span> must relate to at least one node with the property <span class="gelab">venue</span> that conforms to the <span class="shap">Venue</span> shape. Shapes can inherit the constraints of parent shapes – with inheritance denoted with an \(\triangle\) connector – as in the case of <span class="shap">City</span> and <span class="shap">Venue</span>, whose conforming nodes must also conform to the <span class="shap">Place</span> shape.</p>
<figure id="fig-shapeExample">
<img src="images/fig-shapeExample.svg" alt="Example shapes graph depicted as a UML-like diagram"/>
<figcaption>Example shapes graph depicted as a UML-like diagram</figcaption>
</figure>
<p>Given a shape and a targeted node, it is possible to check if the node conforms to that shape or not, which may require checking conformance of other nodes; for example, the node <span class="gnode">EID15</span> conforms to the <span class="shap">Event</span> shape not only based on its local properties, but also based on conformance of <span class="gnode">Santa Lucía</span> to <span class="shap">Venue</span> and <span class="gnode">Santiago</span> to <span class="shap">City</span>. Conformance dependencies may also be recursive, where the conformance of <span class="gnode">Santiago</span> to <span class="shap">City</span> requires that it conforms to <span class="shap">Place</span>, which requires that <span class="gnode">Viña del Mar</span> and <span class="gnode">Arica</span> conform to <span class="shap">Place</span>, and so on. Conversely, <span class="gnode">EID16</span> does not conform to <span class="gnode">Event</span>, as it does not have the <span class="gelab">start</span> and <span class="gelab">end</span> properties required by the example shapes graph.</p>
<p>When declaring shapes, the data modeller may not know in advance the entire set of properties that some nodes can have (now or in the future). An <em>open shape</em> allows the node to have additional properties not specified by the shape, while a <em>closed shape</em> does not. For example, if we add the edge <span class="gnode">Santiago</span><img class="tip" src="images/edge-source.png" width="8" alt="arrow source"/><span class="edge">founder</span><img class="tip" src="images/edge-tip.png" width="15" alt="arrow tip rightward"/><span class="gnode">Pedro de Valdivia</span> to the graph represented in Figure <a href="#fig-delg">2.1</a>, then <span class="gnode">Santiago</span> only conforms to the <span class="shap">City</span> shape if the shape is defined as open (since the shape does not mention <span class="gelab">founder</span>).</p>
<p>Practical languages for shapes often support additional Boolean features, such as conjunction (<em class="sc">and</em>), disjunction (<em class="sc">or</em>), and negation (<em class="sc">not</em>) of shapes; for example, we may say that all the values of <span class="gelab">venue</span> should conform to the shape <span class="shap"><span class="sc">Venue</span> <em>and</em> (<em>not</em> <span class="sc">City</span>)</span>, making explicit that venues in the data graph should not be directly given as cities. However, shapes languages that freely combine recursion and negation may lead to semantic problems, depending on how their semantics are defined. To illustrate, consider the following case inspired by the barber paradox [<a href="#ref-Labra2017">Labra Gayo et al., 2018</a>], involving a shape <span class="shap">Barber</span> whose conforming nodes <span class="gelab">shave</span> at least one node conforming to <span class="shap"><span class="sc">Person</span> <em>and</em> (<em>not</em> <span class="sc">Barber</span>)</span>. Now, given (only) <span class="gnode">Bob</span><img class="tip" src="images/edge-source.png" width="8" alt="arrow source"/><span class="edge">shave</span><img class="tip" src="images/edge-tip.png" width="15" alt="arrow tip rightward"/><span class="gnode">Bob</span> with <span class="gnode">Bob</span> conforming to <span class="shap">Person</span>, does <span class="gnode">Bob</span> conform to <span class="shap">Barber</span>? If <em>yes</em> – if <span class="gnode">Bob</span> conforms to <span class="shap">Barber</span> – then <span class="gnode">Bob</span> violates the constraint by not shaving at least one node conforming to <span class="shap"><span class="sc">Person</span> <em>and</em> (<em>not</em> <span class="sc">Barber</span>)</span>. If <em>no</em> – if <span class="gnode">Bob</span> does not conform to <span class="shap">Barber</span> – then <span class="gnode">Bob</span> satisfies the <span class="shap">Barber</span> constraint by shaving such a node. Semantics to avoid such paradoxical situations have been proposed based on stratification [<a href="#ref-Boneva2017">Boneva et al., 2017</a>], partial assignments [<a href="#ref-Corman2018b">Corman et al., 2018</a>], and stable models [<a href="#ref-Gelfond88">Gelfond and Lifschitz, 1988</a>].</p>
<p>Although validating schemata and semantic schemata serve different purposes, they can complement each other. In particular, a validating schema can take into consideration a semantic schema, such that, for example, validation is applied on the data graph including inferences. Taking the class hierarchy of Figure <a href="#fig-classhier">3.1</a> and the shapes graph of Figure <a href="#fig-shapeExample">3.3</a>, for example, we may define the target of the <span class="shap">Event</span> shape as the nodes that are of type <code>Event</code> (the class). If we first apply inferencing with respect to the class hierarchy of the semantic schema, the <span class="shap">Event</span> shape would now target <span class="gnode">EID15</span> and <span class="gnode">EID16</span>. The presence of a semantic schema may, however, require adapting the validating schema. Taking into account, for example, the aforementioned class hierarchy would require defining a relaxed cardinality on the <span class="gelab">type</span> property. Open shapes may also be preferred in such cases rather than enumerating constraints on all possible properties that may be inferred on a node.</p>
<p>Two shapes languages have recently emerged for RDF graphs: <em>Shape Expressions</em> (<em>ShEx</em>), published as a W3C Community Group Report [<a href="#ref-Prudhommeaux2014">Prud'hommeaux et al., 2014</a>]; and <em>SHACL</em> (<em>Shapes Constraint Language</em>), published as a W3C Recommendation [<a href="#ref-SHACLSpec">Knublauch and Kontokostas, 2017</a>]. These languages support the discussed features (and more) and have been adopted for validating graphs in a number of domains relating to healthcare [<a href="#ref-ThorntonSSGMPW19">Thornton et al., 2019</a>], scientific literature [<a href="#ref-HammondPT17">Hammond et al., 2017</a>], spatial data [<a href="#ref-Car2019">Car et al., 2019</a>], amongst others. More details about ShEx and SHACL can be found in the book by <a href="#ref-Labra2017">Labra Gayo et al. [2018]</a>. A recently proposed language that can be used as a common basis for both ShEx and SHACL reveals their similarities and differences [<a href="#ref-Labra-Gayo2019">Labra Gayo et al., 2019</a>]. A similar notion of schema has been proposed by <a href="#ref-Angles18">Angles [2018]</a> for property graphs.</p>
<div class="formal">
<p>We formally define shapes following the conventions of <a href="#ref-Labra-Gayo2019">Labra Gayo et al. [2019]</a>.</p>
<dl class="definition" id="def-shape">
<dt>Shape</dt>
<dd>A <em>shape</em> \(\phi\) is defined as:
<table>
<tr>
<td>\(\phi\)</td>
<td>::=</td>
<td>\(\top\)</td>
<td>true</td>
</tr>
<tr>
<td> </td>
<td> \( | \) </td>
<td>\(\datatype{N}\)</td>
<td>node belongs to the set of nodes \(N\)</td>
</tr>
<tr>
<td></td>
<td> \( | \) </td>
<td>\(\Psi_{\mathrm{cond}}\)</td>
<td>node satisfies the Boolean condition \(\mathrm{cond}\)</td>
</tr>
<tr>
<td></td>
<td> \( | \) </td>
<td>\(\phi_1 \wedge \phi_2\)</td>
<td>conjunction of shape \(\phi_1\) and shape \(\phi_2\)</td>
</tr>
<tr>
<td></td>
<td> \( | \) </td>
<td>\(\lnot \phi \)</td>
<td>negation of shape \(\phi\)</td>
</tr>
<tr>
<td></td>
<td> \( | \) </td>
<td>\(@s\)</td>
<td>reference to shape with label \(s\)</td>
</tr>
<tr>
<td></td>
<td> \( | \) </td>
<td>\(\qualified{p}{\phi}{min}{max}\)</td>
<td>between \(min\) and \(max\) outward edges (inclusive) with label \(p\) to nodes satisfying shape \(\phi\)</td>
</tr>
</table>
where \(min \in \mathbb{N}_{(0)}\), \(max \in \mathbb{N}_{(0)} \cup \{ * \}\), with “\(*\)” indicating unbounded.</dd>
</dl>
<dl class="definition" id="def-shapes-schema">
<dt>Shapes schema</dt>
<dd>A <em>shapes schema</em> is defined as a tuple \(\Sigma = (\Phi,S,\lambda)\) where \(\Phi\) is a set of shapes, \(S\) is a set of shape labels, and \(\lambda : S \rightarrow \Phi\) is a total function from labels to shapes.</dd>
</dl>
<div class="example">
<p>The shapes schema from Figure <a href="#fig-shapeExample">3.3</a> can be expressed as:</p>
<table>
<tr>
<td><span class="shap">Event</span></td>
<td>\(\mapsto\)</td>
<td>\(\qualifiedL{name}{\datatypeL{string}}{1}{*}\wedge\qualifiedL{start}{\datatypeL{dateTime}}{1}{1}\wedge\qualifiedL{end}{\datatypeL{dateTime}}{1}{1}\)</td>
</tr>
<tr>
<td></td>
<td></td>
<td>\(\qquad\wedge\qualifiedL{type}{\top}{1}{*}\wedge\xrightarrow{venue}\)<span class="shap">Venue</span>\(\{1,*\}\)</td>
</tr>
<tr>
<td><span class="shap">Venue</span></td>
<td>\(\mapsto\)</td>
<td><span class="shap">Place</span>\(\:\wedge\qualifiedL{indoor}{\datatypeL{boolean}}{0}{1}\wedge\xrightarrow{city}\)<span class="shap">City</span>\(\{0,1\}\)</td>
</tr>
<tr>
<td><span class="shap">City</span></td>
<td>\(\mapsto\)</td>
<td><span class="shap">Place</span>\(\:\wedge\qualifiedL{population}{(\datatypeL{int}\wedge \Psi_{>5000})}{0}{1}\)</td>
</tr>
<tr>
<td><span class="shap">Place</span></td>
<td>\(\mapsto\)</td>
<td>\(\qualifiedL{lat}{\datatypeL{float}}{0}{1}\wedge\qualifiedL{long}{\datatypeL{float}}{0}{1}\)</td>
</tr>
<tr>
<td></td>
<td></td>
<td>\(\qquad\wedge\xrightarrow{flight}\)<span class="shap">Place</span>\(\{0,*\}\wedge\xrightarrow{bus}\)<span class="shap">Place</span>\(\{0,*\}\)</td>
</tr>
</table>
<p>For example, <span class="shap">Event</span> is a shape label (an element of \(S\)) that maps to a shape (an element of \(\phi\)). This mapping is defined by \(\lambda\).</p>
</div>
<p>In a shapes schema, shapes may refer to other shapes, giving rise to a graph that is sometimes known as the <em>shapes graph</em> [<a href="#ref-SHACLSpec">Knublauch and Kontokostas, 2017</a>]. Figure <a href="#fig-shapeExample">3.3</a> illustrates a shapes graph of this form.</p>
<p>The semantics of a shape is defined in terms of the evaluation of that shape over each node of a given data graph. The semantics of a shapes schema, in turn, is the result of evaluating each shape of the schema over each node of a given data graph; the result of this evaluation is a <em>shapes map</em>.</p>
<dl class="definition" id="def-shape-map">
<dt>Shapes map</dt>
<dd>Given a directed edge-labelled graph \(G = (V,E,L)\) and a shapes schema \(\Sigma = (\Phi,S,\lambda)\), a <em>shapes map</em> is a (partial) mapping \(\sigma: V \times S \rightarrow \{ 0, 1 \}\).</dd>
</dl>
<p>The shapes map \(\sigma\) is a way of labelling the nodes of \(G\) with the labels of shapes from \(S\). If \(\sigma(v,s) = 1\), then node \(v\) is labelled \(s\) (possibly amongst other labels); otherwise if \(\sigma(v,s) = 0\), then node \(v\) is not labelled \(s\). The precise semantics depends on whether or not \(\sigma\) is a total or partial mapping: whether or not it is defined for every pair in \(V \times S\). Herein we present the semantics for the more straightforward case wherein \(\sigma\) is assumed to be a total shapes map.</p>
<dl class="definition" id="def-shape-evaluation">
<dt>Shape evaluation</dt>
<dd>Given a shapes schema \(\Sigma \coloneqq (\Phi,S,\lambda)\), a directed edge-labelled graph \(G = (V,E,L)\), a node \(v \in V\) and a total shapes map \(\sigma\), the <em>shape evaluation function</em> \(\semantics{\phi}{G}{v}{\sigma} \in \{ 0 , 1 \}\) is defined as follows:
<table>
<tr>
<td>\(\semantics{\top}{G}{v}{\sigma}\)</td>
<td>\(=\)</td>
<td>\(1\)</td>
</tr>
<tr>
<td>\(\semantics{\datatype{N}}{G}{v}{\sigma}\)</td>
<td>\(=\)</td>
<td>\(1\) iff \(v \in N\)</td>
</tr>
<tr>
<td>\(\semantics{\Psi_{\mathrm{cond}}}{G}{v}{\sigma}\)</td>
<td>\(=\)</td>
<td>\(1\) iff \(\mathrm{cond}(v)\) is true</td>
</tr>
<tr>
<td>\(\semantics{\phi_1 \wedge \phi_2}{G}{v}{\sigma}\)</td>
<td>\(=\)</td>
<td>\(\min\{\semantics{\phi_1}{G}{v}{\sigma}, \semantics{\phi_2}{G}{v}{\sigma}\}\)</td>
</tr>
<tr>
<td>\(\semantics{\lnot \phi}{G}{v}{\sigma}\)</td>
<td>\(=\)</td>
<td>\(1 - \semantics{\phi}{G}{v}{\sigma}\)</td>
</tr>
<tr>
<td>\(\semantics{@s}{G}{v}{\sigma}\)</td>
<td>\(=\)</td>
<td>\(1\) iff \(\sigma(v,s) = 1\)</td>
</tr>
<tr>
<td>\(\semantics{\qualified{p}{\phi}{min}{max}}{G}{v}{\sigma}\)</td>
<td>\(=\)</td>
<td>\(1\) iff \(min \leq \lvert \{ (v,p,u)\in E \mid \semantics{\phi}{G}{u}{\sigma}=1 \} \rvert \leq max\)</td>
</tr>
</table>
If \(\semantics{\phi}{G}{v}{\sigma} = 1\), then \(v\) is said to <em>satisfy</em> \(\phi\) in \(G\) under \(\sigma\).</dd>
</dl>
<p>Typically for the purposes of validating a graph with respect to a shapes schema, a <em>target</em> is defined that requires certain nodes to satisfy certain shapes.</p>
<dl class="definition" id="def-shape-target">
<dt>Shapes target</dt>
<dd>Given a directed edge-labelled graph \(G = (V,E,L)\) and a shapes schema \(\Sigma = (\Phi,S,\lambda)\), a <em>shapes target</em> \(T \subseteq V \times S\) is a set of pairs of nodes and shape labelsfrom \(G\) and \(\Sigma\), respectively.</dd>
</dl>
<p>The nodes that a shape targets can be selected a manual selection, based on the type(s) of the nodes, based on the results of a graph query, etc. [<a href="#ref-Corman2018b">Corman et al., 2018</a>, <a href="#ref-Labra-Gayo2019">Labra Gayo et al., 2019</a>].</p>
<p>Lastly, we define the notion of a valid graph under a given shapes schema and target based on the existence of a shapes map satisfying certain conditions.</p>
<dl class="definition" id="def-valid-graph">
<dt>Valid graph</dt>
<dd>Given a shapes schema \(\Sigma = (\Phi,S,\lambda)\), a directed edge-labelled graph \(G = (V,E,L)\), and a shapes target \(T\), we say that <em>\(G\) is valid under \(\Sigma\) and \(T\)</em> if and only if there exists a shapes map \(\sigma\) such that, for all \(s \in S\) and \(v \in V\) it holds that \(\sigma(v,s) = \semantics{\lambda(s)}{G}{v}{\sigma}\), and \((v,s) \in T\) implies \(\sigma(v,s) = 1\).</dd>
</dl>
<div class="example">
<p>Taking the graph \(G\) from Figure <a href="#fig-delg">2.1</a> and the shapes schema \(\Sigma\) from Figure <a href="#fig-shapeExample">3.3</a>, first assume an empty shapes target \(T = \{\}\). If we consider a shapes map where (e.g.) \(\sigma(\)<span class="gnode">EID15</span>, <span class="shap">Event</span>\() = 1\), \(\sigma(\)<span class="gnode">Santa Lucía</span>, <span class="shap">Venue</span>\() = 1\), \(\sigma(\)<span class="gnode">Santa Lucía</span>, <span class="shap">Place</span>\() = 1\), etc., but where \(\sigma(\)<span class="gnode">EID16</span>, <span class="shap">Event</span>\() = 0\) (as it does not have the required values for <span class="gelab">start</span> and <span class="gelab">end</span>), etc., then we see that \(G\) is valid under \(\Sigma\) and \(T\). However, if we were to define a shapes target \(T\) to ensure that the <span class="shap">Event</span> shape targets <span class="gnode">EID15</span> and <span class="gnode">EID16</span> – i.e., to define \(T\) such that \(\{ (\)<span class="gnode">EID15</span>, <span class="shap">Event</span>\(), (\)<span class="gnode">EID16</span>, <span class="shap">Event</span>\() \} \subseteq T\) – then the graph would no longer be valid under \(\Sigma\) and \(T\) since <span class="gnode">EID16</span> does not satisfy <span class="shap">Event</span>.</p>
</div>
<p>The semantics we present here assumes that each node in the graph either satisfies or does not satisfy each shape labelled by the schema. More complex semantics – for example, based on Kleene’s three-valued logic [<a href="#ref-Corman2018b">Corman et al., 2018</a>, <a href="#ref-Labra-Gayo2019">Labra Gayo et al., 2019</a>] – have been proposed that support partial shapes maps, where the satisfaction of some nodes for some shapes can be left as undefined. Shapes languages in practice may support other more advanced forms of constraints, such as counting on paths [<a href="#ref-SHACLSpec">Knublauch and Kontokostas, 2017</a>]. In terms of implementing validation with respect to shapes, work has been done on translating constraints into sets of graph queries, whose results are input to a SAT solver for recursive cases [<a href="#ref-CormanFRS19a">Corman et al., 2019</a>].</p>
</div>
<h4 id="ssec-emergentSchema" class="subsection">Emergent schema</h4>
<p>Both semantic and validating schemata require a domain expert to explicitly specify definitions and constraints. However, a data graph will often exhibit latent structures that can be automatically extracted as an <em>emergent schema</em> [<a href="#ref-PhamPEB15">Pham et al., 2015</a>] (aka <em>graph summary</em> [<a href="#ref-LiuSDK18">Liu et al., 2018</a>, <a href="#ref-CebiricGKKMTZ19">Čebirić et al., 2019</a>, <a href="#ref-SpahiuPPRM16a">Spahiu et al., 2016</a>]).</p>
<p>A framework often used for defining emergent schema is that of <em>quotient graphs</em>, which partition groups of nodes in the data graph according to some equivalence relation while preserving some structural properties of the graph. Taking Figure <a href="#fig-delg">2.1</a>, we can intuitively distinguish different <em>types</em> of nodes based on their context, such as event nodes, which link to venue nodes, which in turn link to city nodes, and so forth. In order to describe the structure of the graph, we could consider six partitions of nodes: <em>event</em>, <em>name</em>, <em>venue</em>, <em>class</em>, <em>date-time</em>, <em>city</em>. In practice, these partitions may be computed based on the class or shape of the node. Merging the nodes of each partition into one node while preserving edges leads to the quotient graph shown in Figure <a href="#fig-emergentSchema">3.4</a>: the nodes of this quotient graph are the partitions of nodes from the data graph and an edge <span class="gnode">\(X\)</span><img class="tip" src="images/edge-source.png" width="8" alt="arrow source"/><span class="edge">\(y\)</span><img class="tip" src="images/edge-tip.png" width="15" alt="arrow tip rightward"/><span class="gnode">\(Z\)</span> is included the quotient graph if and only if there exists \(x \in X\) and \(z \in Z\) such that <span class="gnode">\(x\)</span><img class="tip" src="images/edge-source.png" width="8" alt="arrow source"/><span class="edge">\(y\)</span><img class="tip" src="images/edge-tip.png" width="15" alt="arrow tip rightward"/><span class="gnode">\(z\)</span> is in the original data graph.</p>
<figure id="fig-emergentSchema">
<img src="images/fig-emergentSchema.svg" alt="Example quotient graph simulating the data graph in Figure 1"/>
<figcaption>Example quotient graph simulating the data graph in Figure <a href="#fig-delg">2.1</a></figcaption>
</figure>
<p>There are many ways in which quotient graphs may be defined, depending not only on how nodes are partitioned, but also how the edges are defined. Different quotient graphs may provide different guarantees with respect to the structure they preserve. Formally, we can say that every quotient graph <em>simulates</em> its input graph (based on the <em>simulation relation</em> of set membership between data nodes and quotient nodes), meaning that for all \(x \in X\) with \(x\) an input node and \(X\) a quotient node, if <span class="gnode">\(x\)</span><img class="tip" src="images/edge-source.png" width="8" alt="arrow source"/><span class="edge">\(y\)</span><img class="tip" src="images/edge-tip.png" width="15" alt="arrow tip rightward"/><span class="gnode">\(z\)</span> is an edge in the data graph, then there must exist an edge <span class="gnode">\(X\)</span><img class="tip" src="images/edge-source.png" width="8" alt="arrow source"/><span class="edge">\(y\)</span><img class="tip" src="images/edge-tip.png" width="15" alt="arrow tip rightward"/><span class="gnode">\(Z\)</span> in the quotient graph such that \(z \in Z\); for example, the quotient graph of Figure <a href="#fig-emergentSchema">3.4</a> simulates the data graph of Figure <a href="#fig-delg">2.1</a>. However, this quotient graph seems to suggest (for instance) that <span class="gnode">EID16</span> would have a start and end date in the data graph when this is not the case. A stronger notion of structural preservation is given by <em>bisimilarity</em>, which in this case would further require that if <span class="gnode">\(X\)</span><img class="tip" src="images/edge-source.png" width="8" alt="arrow source"/><span class="edge">\(y\)</span><img class="tip" src="images/edge-tip.png" width="15" alt="arrow tip rightward"/><span class="gnode">\(Z\)</span> is an edge in the quotient graph, then for all \(x \in X\), there must exist a \(z \in Z\) such that <span class="gnode">\(x\)</span><img class="tip" src="images/edge-source.png" width="8" alt="arrow source"/><span class="edge">\(y\)</span><img class="tip" src="images/edge-tip.png" width="15" alt="arrow tip rightward"/><span class="gnode">\(z\)</span> is in the data graph; this is not satisfied by <span class="gnode">EID16</span> in the quotient graph of Figure <a href="#fig-emergentSchema">3.4</a>, which does not have an outgoing edge labelled <span class="gelab">start</span> or <span class="gelab">end</span> in the original data graph. Figure <a href="#fig-emergentSchema2">3.5</a> illustrates a bisimilar version of the quotient graph, splitting the <em>event</em> partition into two nodes reflecting their different outgoing edges. An interesting property of bisimilarity is that it preserves forward-directed paths: given a path expression \(r\) without inverses and two bisimilar graphs, \(r\) will match a path in one graph if and only if it matches a corresponding path in the other bisimilar graph. One can verify, for example, that a path matches <span class="gnode">\(x\)</span><img class="tip" src="images/edge-source.png" width="8" alt="arrow source"/><span class="edge">city\(\cdot\)(flight|bus)*</span><img class="tip" src="images/edge-tip.png" width="15" alt="arrow tip rightward"/><span class="gnode">\(z\)</span> in Figure <a href="#fig-delg">2.1</a> if and only if there is a path matching <span class="gnode">\(X\)</span><img class="tip" src="images/edge-source.png" width="8" alt="arrow source"/><span class="edge">city\(\cdot\)(flight|bus)*</span><img class="tip" src="images/edge-tip.png" width="15" alt="arrow tip rightward"/><span class="gnode">\(Z\)</span> in Figure <a href="#fig-emergentSchema2">3.5</a> such that \(x \in X\) and \(z \in Z\).</p>
<figure id="fig-emergentSchema2">
<img src="images/fig-emergentSchema2.svg" alt="Example quotient graph bisimilar with the data graph in Figure 1"/>
<figcaption>Example quotient graph bisimilar with the data graph in Figure <a href="#fig-delg">2.1</a></figcaption>
</figure>