-
Notifications
You must be signed in to change notification settings - Fork 0
/
seminar_paper.tex
310 lines (228 loc) · 37.6 KB
/
seminar_paper.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
% This is samplepaper.tex, a sample chapter demonstrating the
% LLNCS macro package for Springer Computer Science proceedings;
% Version 2.20 of 2017/10/04
%
\documentclass[runningheads]{llncs}
%
\usepackage{graphicx}
\usepackage{todonotes}
\usepackage{verbatim}
\usepackage{caption}
\usepackage{hyperref}
\usepackage{tabularx}
\usepackage{adjustbox}
% Used for displaying a sample figure. If possible, figure files should
% be included in EPS format.
%
% If you use the hyperref package, please uncomment the following line
% to display URLs in blue roman font according to Springer's eBook style:
% \renewcommand\UrlFont{\color{blue}\rmfamily}
\begin{document}
%
\title{Clustering Knowledge Graphs}
%
%\titlerunning{Abbreviated paper title}
% If the paper title is too long for the running head, you can set
% an abbreviated paper title here
%
\author{Lina Teresa Molinas Comet}
%
\authorrunning{Lina Teresa Molinas Comet.}
% First names are abbreviated in the running head.
% If there are more than two authors, 'et al.' is used.
%
\institute{RWTH Aachen University, Aachen, Germany \\
\email{[email protected]}\\
\url{http://dbis.rwth-aachen.de/cms}}
%
\maketitle % typeset the header of the contribution
%
\begin{abstract}
Nowadays the use of graph-like structures stands out in the context of data representation because it allows the integration of information from multiple sources. Moreover, clustering techniques are used on top of graphs to group information based on their relevant characteristics to reveal significant information. Consequently, in this paper we present an overview of the most interesting new techniques for clustering knowledge graphs, where we notice that similarity functions are preferred for grouping information. We describe each approach to show its features in terms of applied criteria and areas of use. We also provide an analysis comparing those approaches, concluding that the CLIP algorithm and the AppGrouper approach can be used in the context of Linked Data for grouping related information.
\keywords{Knowledge Graphs \and Clustering \and Knowledge Bases \and Algorithms}
\end{abstract}
%
%
%
\section{Introduction} \label{introduction}
We are living in the era of big data, meaning that we need to deal with large amounts of data available in different representations. Therefore, in order to get valuable insights it is necessary to extract the right portion of data to make sense of the underlying information \cite{Pedrycz}. However, the extraction process is not an easy task due to the resulting complexity of having different data representations and the underlying semantics that may be lost in the process. One way of dealing with this kind of problems is using graph-based data representation which allows the integration of information from multiple sources. Besides, it is important to apply the right procedures to derive worthwhile knowledge. One of those techniques gaining more popularity is data clustering to group entities \cite{Pedrycz}. Therefore, in this paper we present an overview of the most recent techniques for clustering knowledge graphs where we summarize the main characteristics of each algorithm, namely: the motivation and adopted criteria, the required steps, and the relevant findings mentioned by each paper's authors.
Our contribution in this paper is an analysis comparing those different approaches in terms of required input, the used criteria ranging from shortest path computation to prioritization of strong connections among entities, and the areas of application. Besides, we also recap the particular advantages of using each algorithm and the still unresolved issues in those approaches, mainly related to scalability and performance when dealing with large ontologies.
The structure of this paper is as follows: first, in section \ref{background} we introduce some relevant concepts in connection to the topic. Then, in section \ref{general-techniques}, we briefly look at some traditional clustering approaches and the common problems on graph clustering. After that, in section \ref{algorithms}, we examine some of the new techniques and algorithms developed for graph clustering in the Web. Next, in section \ref{analysis} we discuss the different techniques by comparing them, as well as providing suggestions of application areas. Finally, we conclude that the CLIP algorithm is suitable for generating clusters with good quality, and that the AppGrouper approach can be extended to be used in the context of Linked Data.
\section{Background} \label{background}
First of all, for a better understanding, in this section we define the main concepts related to the topic under study. After that, we provide a short description of the relation between those concepts.
\subsection{Terminology and definitions} \label{terminology}
\subsubsection{Graphs} \label{graphs}
In the formal definition of Diestel \cite{Diestel}, ``a $graph$ is a pair $G = (V, E)$ of sets satisfying $E \subseteq [V]^2$; thus, the elements of $E$ are 2-element subsets of $V$". He also indicates that ``the elements of $V$ are the $vertices$ (...) of the graph $G$, the elements of $E$ are its $edges$ (...)". In other words, a graph is a set of vertices (nodes) and edges (links) connecting those vertices. The nodes on a graph represent different entities from the real world \cite{Robinson}, while the edges depict the relationship among them.
\subsubsection{Knowledge graphs} \label{knowledge-graphs}
Currently there is no common definition of the term knowledge graphs
(KG). There neither exists, as explained by Ehrlinger and W{\"o}{\ss}
\cite{Ehrlinger}, an exact differentiation between the use of this term and other related terms (i.e. knowledge bases, knowledge vault, and
ontology). What is more, Google has its own implementation of what they
call Knowledge Graph\footnote{Introducing the Knowledge Graph: things, not strings, accessed December 12, 2018, \href{https://googleblog.blogspot.com/2012/05/introducing-knowledge-graph-things-not.html}
{https://googleblog.blogspot.com/2012/05/introducing-knowledge-graph-things-not.html}},
a model which considers some semantics and allows the creation of a short summary related to a topic, but which does not cover all
the aspects considered in a KG according to other interpretations of
the term \cite{Ehrlinger}.
In the definition of Paulheim \cite{Paulheim}, ``a knowledge graph
1. mainly describes real world entities and their interrelations, organized in a graph, 2. defines possible classes and relations of entities in a schema, 3. allows for potentially interrelating arbitrary entities with each other, and 4. covers various topical domains". In the context of the Semantic Web, F{\"a}rber \cite{Farber} indicates that knowledge graph is defined as an RDF graph consisting of a finite set of RDF triples, and where every triple consist of a subject, a predicate and an object. Additionally, Ehrlinger and W{\"o}{\ss} \cite{Ehrlinger} propose the following definition: ``a knowledge graph acquires and integrates information into an ontology and applies a reasoner to derive new knowledge". Moreover, they suggest that KG involve the use of a graph-based structure to store data. However, KG focus on the instances rather than on the schema of the represented knowledge \cite{Paulheim}.
\subsubsection{Knowledge-based systems} \label{knowledge-based}
A knowledge-based (KB) system is part of one of the areas of Artificial Intelligence (AI) \cite{Tripathi}. More specifically, KG is a database containing facts, rules, and relations \cite{Engelmore}. Additionally, expert systems aim to acquire expert knowledge coming from a human who is a specialist in a particular domain, subsequently making that information available to non-expert users \cite{Tripathi}.
Therefore, using a KB system can lead to benefits in handling knowledge, for instance the quality improvement of tasks related to decision making \cite{Engelmore}.
\subsubsection{Clustering} \label{clustering}
Clustering is a field of study which helps to discover and expose known or unknown clusters in datasets \cite{Han} \cite{Mirkin}. It aims to divide a dataset into various clusters relying on the entities' attributes. After the division, the entities in one group are highly similar, while they differ from entities in other groups \cite{Han}. In this sense, clustering is useful for coping with big amounts of data by helping data analysis \cite{Pedrycz} \cite{Mirkin}. Therefore, it is used in many application areas \cite{Pedrycz} \cite{Han} and can be seen from many different perspectives (e.g. machine learning, data mining, knowledge-discovery, statistics, etc.). In this respect, those perspectives can also overlap as we will see in section \ref{algorithms}.
\subsection{Interrelation of terms} \label{interrelation}
We previously mentioned the benefits of using KB systems \cite{Engelmore}. Likewise, implementing KG is
rewarding, as shown by the use in industry. In this context, Pan et al. \cite{Pan} introduce some success cases involving KG to first collect and then provide not only data but also knowledge. As a result, KG help knowledge-based search services to discover and understand information. Moreover, as Tang et al. \cite{Tang} argue, clustering can be used to find related topics for a specific application (e.g. common topics between two people interchanging emails). Additionally, from the Linked Data perspective, this involves linking content with meaning for a better topic understanding \cite{Pan}.
In summary, clustering techniques support KG by associating related content, thereby improving data exploration and knowledge discovery.
\section{State of the art}\label{state-art}
In this section, we present a brief summary of the most well-known traditional techniques and algorithms used for clustering in the context of data mining and knowledge-based systems. After that, we introduce some novel approaches focusing on KG clustering.
\subsection{General techniques for knowledge graph clustering} \label{general-techniques}
In general terms, standard clustering techniques are used in a variety of scenarios and application areas, for example in data analysis and interpretation \cite{Pedrycz}. Those algorithms focused on clustering, pattern mining, and classification can be extended to graphs representations \cite{Aggarwal}. In this respect, clustering algorithms are classified in a variety of taxonomies \cite{Pedrycz} \cite{Zacharski} \cite{Berkhin}. Yet, regardless of the classification, those algorithms mainly consider the similarity and/or distance among entities \cite{Pedrycz} to cluster documents. In this context, there are also many studies covering surveys on the different techniques used for graph clustering, for instance: \cite{Aggarwal}, \cite{Schaeffer}, \cite{Carpineto}.
Regarding graph clustering, Aggarwal et al. \cite{Aggarwal} indicate that clustering algorithms can be grouped in two big categories: \textit{node clustering}, which clusters a graph based on distance or similarity functions; and \textit{structural clustering}, which groups multiple graphs based on their structural behaviour. From another point of view, Carpineto et al. \cite{Carpineto} remark that clustering in search results\footnote{``Search Results Clustering (SRC) relates to grouping search results by topics, and it is a common approach for showing results in plain ranked lists \cite{Chang}."} can improve semantics between users' requests and the results returned by a search engine. Besides, they distinguish three categories of algorithms in the context of SRC: \textit{data-centric}, \textit{description-aware}, and \textit{description-centric} \cite{Carpineto}.
Moreover, Aggarwal et al. \cite{Aggarwal} mention problems with graph-based clustering related to node and graph clustering, such as \textit{pseudo-cliques} (high probability that an edge exists between any pair of nodes), \textit{shingles} (sub-graphs having a large number of common links), and unmatched graphs structures. In this context, Chang et al. \cite{Chang} also refer to some of the most common issues of traditional clustering when applying them to search results, namely: clusters' topic labels are not semantically meaningful for users, topics should be coherence within clusters and different from other clusters, clusters' size should be equitable, clusters should just consider the most common results, etc.
Furthermore, Carpineto et al. \cite{Carpineto} recognize the fundamental differences between applying clustering techniques to documents against SRC. The main distinctions refer to the use of meaningful labels in SRC, while in traditional algorithms centroids are used. Also, the cluster computation is done online in SRC, while in document clustering it is done offline. Another difference is that in SRC the input data is shorter (snippets) unlike traditional clustering which uses the whole document. In addition, the number of clusters in a traditional algorithm is fixed and clusters are disjoint, while in SRC the number of clusters is variable and overlapping may exist. Hence, in order to use traditional clustering algorithms in the context of graph representations, some changes need to be made to create good clusters (i.e. adjustments in the use of centroids in k-means algorithms, modifications in similarity functions and distances computation between entities) \cite{Aggarwal}.
\subsection{Techniques and algorithms for clustering KG}\label{algorithms}
As we have seen in the previous section, classic clustering techniques should be adapted to achieve better results when clustering KG. Here we present some of the new algorithms and techniques for graph clustering which follow this premise, and therefor include changes in traditional algorithms.
\subsubsection{Graph clustering for content aggregation for knowledge bases}\label{content-aggregation}
Schmitz et al. \cite{Schmitz} focus on applying clustering in peer-to-peer networks, more specifically in the context of personal knowledge management (P2PPKM) systems. In these systems a similarity function is available, and every peer creates and shares its own personal knowledge base. These networks are self-organized, which means that the network uses indices to choose one route instead of another (i.e. each peer stores a view of the content of its neighbors, thus it can decide to which peer steer a query or answer).
As much as P2PPKM systems help in sharing knowledge among peers in the network, Schmitz et al. \cite{Schmitz} identify the lack of expertise or ``semantic self-descriptions" shared among them (i.e. every peer publishes their whole content, generating inefficiencies such as excessive traffic in the network). Hence, they propose a new technique for knowledge clustering in P2PPKM, focusing on the notion of semantic systems, which consists of using only a portion of the total knowledge base. Indeed, each peer shares their domain specific ontologies (e.g. metadata related to bibliographic information: BibTEX ontology, and a classification scheme).
Moreover, the suggested approach measures the semantic distance between nodes in a graph by using shortest path algorithm. In addition, for the extraction of expertise and content aggregation of the knowledge bases, they use bi-section {\textit{k-modes}} clustering which is an extension of the original {\textit{k-modes}} clustering algorithm\footnote{``As described in \cite{Huang}, the {\textit{k-modes}} technique extends the {\textit{k-means}} algorithm, but instead of using mean values for clustering uses mode values. Besides, this algorithm calculates dissimilarity among categorical entities, and update the value of modes based on frequency, to minimize the clustering cost."}. In particular, in their implementation \cite{Schmitz}, the clustering first generates one cluster consisting of all nodes in the graph. Then, it divides the cluster with the largest variance with 2-modes. The algorithm repeats the steps in a recursive way until {\textit{k}} clusters are reached. In this respect, the value of {\textit{k}} is determined by a ``silhouette coefficient"\footnote{``The silhouette coefficient is an indicator for the quality of the clustering. It determines how well clusters are separated in terms of the distances of each item to the nearest and the second nearest centroid \cite{Schmitz}."}.
Schmitz et al. \cite{Schmitz} test their proposed technique with few entities related to data of scientists and bibliographic information of their publications. In conclusion, their findings indicate that the suggested clustering technique helps to extract the appropriate expertise from P2PPKM systems. Moreover, they find that the expert knowledge size shared among peers influences the finding of information (i.e. a bigger size of knowledge base will contain more information than a smaller one). Besides, the k-modes clustering technique, by using indices, performs better than other approaches because it needs less queries to get specific information.
However, it is important to observe that in large ontologies scalability issues may arise, due to the computation of the shortest path \cite{Schmitz}.
\subsubsection{Structural similarity clustering in KG} \label{structural-similarity}
Elbattah et al. \cite{Elbattah} suggest that it is possible to divide entities in KG into group-forming communities. Hence, in the context of a large scale KG environment, they recommend a technique that focuses on clustering entities which are similar when considering their linked-based structure. Regarding the clustering process, they also remark the importance of selecting the right algorithm because it influences scalability and the quality of resulting clusters. In their study they select the Louvain clustering algorithm because it requires less computational power compared to other algorithms due to its optimization-based nature. Specifically, this algorithm first divides each node into an individual community (i.e. the number of communities is equal to the number of nodes). Then, each node is placed in a different community and the modularity is computed. If the computed modularity is better in a particular community, the node is moved to that one. Finally, the algorithm creates a new graph with all the detected communities. Those steps are repeated until the maximum modularity is reached \cite{Elbattah}.
In order to test their hypothesis and the chosen algorithm, they use a subset of the knowledge base of Freebase. As a result, after the clustering, they find that some of the computed clusters are part of larger categories, while others exhibit substantial cross-category structures, meaning that one entity is part of more that one category.
Elbattah et al. \cite{Elbattah} conclude that, taking into account the computed cluster density and the general graph modularity, the resulting clusters are good. Furthermore, they emphasize that the computed clusters can help to get insights and to reveal unknown relationships among entities. In other words, they help to find new interpretations of the data.
\subsubsection{Entity clustering using link features} \label{entity-clustering}
Saeedi et al. \cite{Saeedi} emphasize that problems may arise in large-scale environments when trying to discover links and identify multiple representations referring to the same entity. Therefore, in a previous paper \cite{Peukert}, they suggested merging all the matches of the same entity resulting in a single entity representation of the KG by using a similarity graph where edges have a value indicating the degree of similarity \cite{Saeedi}. This graph is built by FAMER, a framework for distributed multi-source entity resolution (e.g. linking schemes, links from the Web of Data) \cite{Peukert}. Then, the clustering schemes\footnote{These schemes are: 1. Connected components, 2. Center clustering, 3. Merge Center, 4. Start clustering (two variations), 5. Correlation clustering. \cite{Peukert}} use the resulting graph to resolve entity matching and to create clusters with highly similar entities within them. Although the proposed technique seems valid, Saeedi et al. discovered some issues in the resulting clusters (e.g. overlapping clusters, source-inconsistent clusters when entities have more than one entity per source), and therefore they propose the CLIP algorithm which prioritizes strong links and ignores weak links\footnote{``A link is strong if it is the one with highest similarity from both sides in an entity. On the other hand, a link is weak if it is the least similar for any of the two sides."\cite{Saeedi}} when clustering entities. The input of the CLIP algorithm is a similarity graph which determines the strength of all links in that graph as a first step. Then, it identifies complete clusters but only considering entities with strong links among them. In the following step it also includes normal links\footnote{``A link is consider to be normal if it is the one with highest similarity for only one of the two sides."\cite{Saeedi}}. This process is done iteratively and in parallel, and the output is the resulting cluster set including all the source-consistent clusters.
The authors also propose RLIP for repairing clusters generated by different clustering schemes. In its first step, the RLIP algorithm also prioritizes strong links among entities within a cluster, and hereby solves overlapping clusters by allocating an overlapped entity to a specific cluster. Here, when overlapped entities have strong links to multiple clusters, the algorithm considers the highest association degree\footnote{``The association degree of an entity for a cluster of size k corresponds to the average similarity of the entity to the k − 1 other entities in the cluster."\cite{Saeedi}} to select a cluster. Besides, overlapped entities with no strong links and entities with only strong links to other overlapped entities are considered singletons \cite{Saeedi}. If necessary, it uses CLIP algorithm to fix source-inconsistent clusters.
As part of their evaluation, Saeedi et al. \cite{Saeedi} point out that CLIP performs better in precision and F-measure (test's accuracy) than the other clustering schemes which they studied. Also, they indicate that the cluster quality is exceptional because the CLIP algorithm does not consider weak links. Additionally, they remark that RLIP algorithm can improve F-measure for all the studied clustering schemes and that both algorithms support scalability in large datasets by parallel implementation.
\subsubsection{Interactive knowledge-graph-based clustering tool for App search results} \label{app}
Chang et al. \cite{Chang} reveal some drawbacks of Search Results Clustering (SRC), such as finding the most suitable cluster or determining the best granularity level for clustering. As a consequence, they propose a new clustering technique (AppGrouper) based on previous work by Scaiella et al. \cite{Scaiella}. AppGrouper incorporates Machine Learning (ML) and KG concepts. This technique works for a query and its associated results in the Google Play Store.
The main steps in the clustering process in this algorithm, as explained in \cite{Chang}, are: 1. \textit{topic labels extraction and topic label graph construction}, where the algorithm maps topic labels from search results to concepts in knowledge bases (e.g. DBpedia) then, by using ML techniques, constructs the topic label graph based on similarity between nodes; 2. \textit{topic labels clustering}, where Hierarchical Agglomerate Clustering (HAC) is used in the topic label graph with a parameter indicating the threshold of minimum similarity to merge two nodes into the same cluster; 3. \textit{clusters filtering, ranking and assignment}, the algorithm filters and then ranks the clusters by computing the quality of each cluster. Additionally, the algorithm generates a cluster which includes all the apps that are not included in the other clusters. Furthermore, the algorithm optimizes clustering by maintaining coherence within the same cluster and separation between clusters, and by keeping the same size in all the involved clusters. Besides, AppGrouper also includes an interactive user interface in which users can guide the clustering process at any stage by cleansing the input to the algorithm, guiding the algorithm to achieve detailed clusters, and by allowing edition of clusters and topic labels \cite{Chang}.
Chang et al. \cite{Chang} conclude that AppGrouper has advantages comparing with other approaches. More specifically, it includes meaningful topic labels describing clusters, and improves the quality of resulting clusters and the computation efficiency by allowing dynamic clustering.
\section{Analysis and discussion} \label{analysis}
In this section we compare and contrast the presented approaches. Besides, we also summarize the advantages and current issues of the different techniques.
In Table \ref{table1} we show the essence of each technique. As we can observe, the input values are different for each algorithm, except for the Louvain and RLIP algorithms. Moreover, we notice that even though the criteria of the algorithms are not the same, they have affinity because they prefer to cluster similar entities into the same group by considering shortest path, modularity and strong links among those entities.
\captionof{table}{Comparison of Algorithms for Knowledge Graph Clustering} \label{table1}
\begin{tabular}{|p{2cm}|p{2.3cm}|p{3.5cm}|p{3.5cm}|p{3.5cm}|}
\hline
\textbf{Algorithm} & \textbf{Input} & \textbf{Criteria} & \textbf{Output} & \textbf{Application}\\
\hline
Bi-section k-modes (\ref{content-aggregation}) & Domain specific ontologies & Semantic distance (shortest path) and centroids computation & Self-descriptions of knowledge bases from each peer in a network & Content/expertise extraction \\
\hline
Louvain (\ref{structural-similarity}) & Set of nodes & Maximal modularity & A new graph where nodes are the detected communities & Data exploration and discovery\\
\hline
CLIP (\ref{entity-clustering}) & Similarity graph & Strong links prioritization & Cluster set with matching entities & Creation of high quality, source-consistent and overlap-free entity clusters \\
\hline
RLIP (\ref{entity-clustering}) & Set of entity clusters & Strong links prioritization & Repaired cluster set & Overlapping and source-inconsistent clusters resolution \\
\hline
Knowledge-graph-based clustering (\ref{app}) & Semantic graph of topic labels & Hierarchical Agglomerate Clustering (HAC) and minimum similarity threshold & Ranked topic clusters & Clustering search result in Google Play Store \\
\hline
\end{tabular}
\smallskip In Table \ref{table2} we sum up the main advantages and detected issues mentioned by the authors of the introduced papers, and also those noticed during this analysis. Here we can observe that the main issue in the algorithms is still related to scalability, as performance is reduce when involving large KG. On the other hand, the evident benefit of those algorithms is the quality improvement of generated clusters because the optimization processes focus on enhancing the computation of similarity properties. Furthermore, the target application areas of the various approaches are diverse. While the first approach focuses on expertise extraction in a P2P network, the others concentrate on producing high-quality clusters that can be used to explore data and to identify connections among entities in the same cluster. Besides, as expressed by its authors, the AppGrouper technique can also be extended to resolve other SRC problems, for example topic modeling and knowledge extraction.
In conclusion, we consider that the most promising techniques for KG clustering are the CLIP algorithm and AppGrouper. The first one stands out because of its favorable clustering quality and its clusters reparation handling by using its RLIP component. Therefore, it can be used for knowledge discovery by identifying not only explicit but also implicit connections among the well structured clusters, especially when taking into account the semantic relationship between terms. Moreover, if CLIP algorithm is combined with the bi-section k-modes algorithm (\ref{content-aggregation}) the results will be even better in terms of efficiency and runtime performance. The second one, even though it is a specific implementation in Google Play Store, could be extended for finding knowledge in the ranked clusters as well. In this respect, also by taking advantage of AppGrouper's user interface and the use of semantic labels, it would be possible to get the most relevant query results and its connections (in the context of Linked Data) when using a search engine.
\captionof{table}{Advantages vs Issues in the proposed algorithms} \label{table2}
\begin{tabular}{|p{3cm}|p{8.5cm}|p{3.5cm}|}
\hline
\textbf{Algorithm} & \textbf{Advantages} & \textbf{Issues}\\
\hline
Bi-section k-modes (\ref{content-aggregation}) & - The amount of retrieved information depends on the size of shared expert knowledge base; - Use of indexes allows better performance because fewer consultations are required & Scalability problems in large KG \\
\hline
Louvain (\ref{structural-similarity}) & - Less computation power is required - Good quality in resulting clusters; - Cross-category clusters implying implicit connection among entities & Performance issues may arise when applying it to large KG \\
\hline
CLIP (\ref{entity-clustering}) & - Good quality in resulting clusters; - Parallel execution; - Overlap-free and source-consistent clusters; - Better precision and accuracy performance & Slow execution time when iteratively processing source-inconsistent components \\
\hline
RLIP (\ref{entity-clustering}) & - Cluster reparation can be performed in clusters generated by the FAMER framework or by other clustering techniques; - Parallel execution; - Effective cluster repair by improving accuracy performance & Slow execution time due the use of the CLIP component\\
\hline
Knowledge-graph-based clustering (\ref{app}) & - Supports meta-data information; - Clusters' ranking computation; - Includes semantic labels; - Allows dynamic clustering by interactive user interface & Performance may be an issue when working with large KG \\
\hline
\end{tabular}
\section{Conclusion} \label{conclusion}
In this paper we introduced a general summary of the most interesting techniques used for data clustering and the most frequent problems when applying conventional clustering techniques to KG. Moreover, we mentioned that traditional algorithms need changes to their computation processes to achieve better clustering in graphs. After that, we presented some of the new techniques for KG clustering by explaining the process of clustering tacked by each of them.
Then, in our analysis and evaluation of the presented KG clustering algorithms we highlighted that the main benefit of using them is achieving better data classification by grouping similar entities. However, the still unresolved issue when clustering large-scale KG regards performance and scalability.
In summary, the presented clustering methods focus on generating good clusters to help data exploration and content extraction, as well as repairing ambiguous clusters by optimizing similarity computations. Furthermore, based on our analysis, we conclude that the CLIP algorithm is the most suitable for generating good quality clusters, and also it is a promising approach as it repairs inaccurate clusters by its RLIP component. Besides, we consider that extending the AppGrouper approach makes it possible to use it in the context of information and knowledge discovery, more specifically when using search engines. Nevertheless, we acknowledge that more research should be done in this area in order to reach real expertise clustering to effectively uncover semantic connections with regards to Linked Data.
%
% ---- Bibliography ----
%
% BibTeX users should specify bibliography style 'splncs04'.
% References will then be sorted and formatted in the correct style.
%
\bibliographystyle{splncs04}
\bibliography{mybibliography}
%
\begin{thebibliography}{8}
\bibitem{Pedrycz}
Pedrycz, W.: Knowledge-Based Clustering: From Data to Information Granules. 2nd edn. Wiley-Interscience, New York, NY, USA (2005)
\bibitem{Diestel}
Diestel, R.: Graph Theory. 4th edn. Springer, New York, NY, USA, (2012)
\bibitem{Robinson}
Robinson, I., Webber, J.,Eifrem, E.: Graph Databases. 2nd edn. O'Reilly Media, Inc., Sebastopol, CA (2015)
\bibitem{Ehrlinger}
Ehrlinger, L, W{\"o}{\ss}, W.: Towards a Definition of Knowledge Graphs. In: Martin, M., Cuquet M., Folmer, E. (eds.) In Joint Proceedings of the Posters and Demos Track of the 12th International Conference on Semantic Systems - SEMANTiCS2016 and the 1st International Workshop on Semantic Change \& Evolving Semantics (SuCCESS'16), CEUR-WS, vol. 1695, Leipzig, Germany (2016). \doi{10.10007/1234567890}
\bibitem{Paulheim}
Paulheim, H.: Knowledge Graph Refinement: A Survey of Approaches and Evaluation Methods. Semantic Web Journal, 489--508 (2017). \doi{10.3233/SW-160218}
\bibitem{Farber}
F{\"a}rber, M., Bartscherer, F, Menne,C., Rettinger, A.: Linked data quality of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO. Semantic Web Journal, 77--129 (2018). \doi{10.3233/SW-170275}
\bibitem{Tripathi}
Tripathi, K.: A Review on Knowledge-based Expert System: Concept and Architecture. IJCA Special Issue on Artificial Intelligence Techniques-Novel Approaches \& Practical Applications (2011). \doi{10.5120/2845-226}
\bibitem{Engelmore}
Engelmore R.S. : Artificial Intelligence and Knowledge Based Systems: Origins, Methods and Opportunities for NDE. In: Thompson D.O., Chimenti D.E. (eds.) Review of Progress in Quantitative Nondestructive Evaluation. Review of Progress in Quantitative Nondestructive Evaluation, vol. 6 A.,
Springer, Boston, MA (1987). \doi{10.1007/978-1-4613-1893-4\_1}
\bibitem{Han}
Han, J., Kamber, M., Pei, J.: Data Mining: Concepts and Techniques. 3rd edn. Morgan Kaufmann Publishers Inc.,
San Francisco, CA, USA (2011)
\bibitem{Mirkin}
Mirkin, B.: Clustering For Data Mining: A Data Recovery Approach. 2nd edn. Chapman \& Hall/CRC,
Boca Raton, FL, USA (2005)
\bibitem{Pan}
Pan, J., Vetere, G., Gomez-Perez, J., Wu,: Exploiting Linked Data and Knowledge Graphs in Large Organisations. 1st edn. Springer International, Switzerland, (2017)
\bibitem{Tang}
Tang, G., Pei, J., Luk, W.: Email Mining: Tasks, Common Techniques, and Tools. Knowl. Inf. Syst., 1--31 (2014). \doi{10.1007/s10115-013-0658-2}
\bibitem{Aggarwal}
Aggarwal, C., Wang, H.: A Survey of Clustering Algorithms for Graph Data. In: Aggarwal, C., Wang, H. (eds.) Managing and Mining Graph Data, pp. 275--301.
Springer US, Boston, MA, USA (2010). \doi{10.1007/978-1-4419-6045-0\_9}
\bibitem{Zacharski}
Zacharski, R.: A Programmer's Guide to Data Mining. http://guidetodatamining.com/ (2012)
\bibitem{Berkhin}
Berkhin, P.: A Survey of Clustering Data Mining Techniques. In: Kogan, J., Nicholas, C., Teboulle, M. (eds.) Grouping Multidimensional Data: Recent Advances in Clustering 2006, pp. 25--71.
Springer Berlin Heidelberg, Berlin, Heidelberg (2006). \doi{10.1007/3-540-28349-8\_2}
\bibitem{Schaeffer}
Schaeffer, S.: Survey: Graph Clustering. Comput. Sci. Rev., 27--64 (2007). \doi{10.1016/j.cosrev.2007.05.001}
\bibitem{Carpineto}
Carpineto, C., Osi\'{n}ski, S, Romano, G., Weiss, D. : A Survey of Web Clustering Engines. ACM Comput. Surv., 17:1--17:38 (2009)
\bibitem{Chang}
Chang, S., Dai, P., Hong, L., Sheng, C., Zhang, T., Chi, E.: AppGrouper: Knowledge-based Interactive Clustering Tool for App Search Results. In: Proceedings of the 21st International Conference on Intelligent User Interfaces, pp. 348--358. ACM, New York, NY, USA (2016) \doi{10.1145/2856767.2856783}
\bibitem{Schmitz}
Schmitz, C., Hotho, A., J{\"a}schke, R., Stumme, G.: Content Aggregation on Knowledge Bases Using Graph Clustering. In: Sure, Y., Domingue, J. (eds.) The Semantic Web: Research and Applications. ESWC 2006, LNCS, vol. 4011, pp. 530--544.
Springer, Berlin, Heidelberg (2006). \doi{10.1007/11762256\_39}
\bibitem{Elbattah}
Elbattah, M., Roushdy, M., Aref, M., M.Salem, A.: Large-Scale Entity Clustering Based on Structural Similarity within Knowledge Graphs. In: Arun, K., Somani, G. (eds.) Big Data Analytics: Tools and Technology for Effective Planning, Edition: 1, Chapter: 14, pp. 311--334. CRC Press Editors (2017). \doi{10.1201/b21822-14}
\bibitem{Saeedi}
Saaedi, A., Peukert, E., Rahm, E. : Using Link Features for Entity Clustering in Knowledge Graphs. In: Gangemi, A., Navigli, R., Vidal, M., Hitzler, P., Troncy, R., Hollink, L., Tordai, A., Alam, M. (eds.) The Semantic Web. ESWC 2018, LNCS, vol. 10843, pp. 576--592.
Springer International Publishing, Cham (2018). \doi{10.1007/978-3-319-93417-4\_37}
\bibitem{Peukert}
Saeedi, A., Peukert, E., Rahm, E.: Comparative Evaluation of Distributed Clustering Schemes for Multi-source Entity Resolution. In: Kirikova M., N\o rv\aa g K., Papadopoulos G. (eds.), ADBIS 2017, LNCS, vol. 10509, pp. 278--293. Springer International (2017). \doi{10.1007/978-3-319-66917-5\_19}
\bibitem{Scaiella}
Scaiella, U., Ferragina, P., Marino, A., Ciaramita, M.: Topical Clustering of Search Results. In: Proceedings of the Fifth ACM International Conference on Web Search and Data Mining, pp. 223--232. ACM, Seattle, Washington, USA (2012) \doi{10.1145/2124295.2124324}
\bibitem{Huang}
Huang, Z.: Data Mining and Knowledge Discovery. Database Management \& Information Retrieval 22--83 (1998) \doi{10.1023/A:1009769707641}
\end{thebibliography}
\begin{comment}
\bibitem{ref_lncs1}
Author, F., Author, S.: Title of a proceedings paper. In: Editor,
F., Editor, S. (eds.) CONFERENCE 2016, LNCS, vol. 9999, pp. 1--13.
Springer, Heidelberg (2016). \doi{10.10007/1234567890}
\bibitem{ref_article1}
Author, F.: Article title. Journal \textbf{2}(5), 99--110 (2016)
\bibitem{ref_book1}
Author, F., Author, S., Author, T.: Book title. 2nd edn. Publisher,
Location (1999)
\bibitem{ref_proc1}
Author, A.-B.: Contribution title. In: 9th International Proceedings
on Proceedings, pp. 1--2. Publisher, Location (2010)
\bibitem{ref_url1}
LNCS Homepage, \url{http://www.springer.com/lncs}. Last accessed 4
Oct 2017
\end{comment}
\end{document}