-
Notifications
You must be signed in to change notification settings - Fork 0
/
bsca.py
371 lines (277 loc) · 208 KB
/
bsca.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
# -*- coding: utf-8 -*-
"""bsca
Automatically generated by Colab.
Original file is located at
https://colab.research.google.com/drive/1sP__99kgLd-gS9tVLCgZCssSHby2L64m
# Basic Sequential Clustering Algorithm (BSCA) Analysis
Cluster analysis is used to group objects with similarities together and seperate them from dissimilar objects. Clustering is an unsupervised learning technique that is used for training data with given inputs, but without a traget value [1]. Clustering is used to identify subgroups within the data. Cluster formation depends on different factors, such as distance, graphs, and density of the data points [1]. There are various types of clustering methods, such as heirachical, partioning, distribution, fuzzy clustering, and model based methods. This study will use the basic sequential clustering algorithm to perform cluster analysis on two different datasets.
## Explanation of Datasets
For this study, 2 datsets will be used. Principal component analysis (PCA) has already been completed on the Breast Cancer dataset that will be used in this study using Python; this data has been converted into 2-D data. Linear discriminant analysis (LDA) has been performed using Python on the Iris dataset that will be used in this study; this dataset has been converted to 1-D data.
## Methods and Design
Basic sequential clustering, also known as basic sequential algorithmic scheme (BSAS), is a simple method to produce complex clusters. In BSAS, each data vector is considered only once, and the maximum number of clusters is decided before analysis and is not known a priori. Each vector is assigned to an existing or new cluster, depending on its distance from the existing clusters. BSAS has two mandatory parameters that have to be assigned, the maximum number of clusters (M) and the threshold of dissimilarity (a), which is the maximum distance between points. Like other clustering methods, the optimal number of clusters is arbitrary and different methods can be employed to help determine which number is best. Since principal component analysis and linear discriminant analysis were already applied to these datasets, the results from that analysis will be used to help decide the ideal number of clusters and the values for the threshold of dissimilarity. BSCA follows a simple process of defining the number of clusters and threshold for dissimilarity and then assigning points to clusters as shown in the flowchart below. Three different values for the maximum number of clusters and threshold of dissimilarity were used for each dataset.
![cluster flowchart.png]()
Clustering on PCA Breast Cancer dataset
"""
import pandas as pd
import numpy as np
from statistics import mean
import plotly.graph_objects as go
import plotly.express as px
#import data set with 2D from PCA
from google.colab import files
uploaded=files.upload()
df=pd.read_csv('PCA_Breast Cancer.csv', header=None, names=['PCA1', 'PCA2'])
#define parameters for max number of clusters, threshold dissimilarity (a), and counter (t)
max_clusters= 4
a= 0.5
t= 1
#Create cluster dictionary
clusters = dict(zip(range(max_clusters), [[] for i in range(max_clusters)]))
#assign first point to cluster 1
clusters[0].append(df.PCA1[0])
#iterate through data points after first point
for row in df.PCA1[1:]:
#add at least 1 point to each cluster
c=0
currentclusters=t
distance = dict(zip(range(currentclusters), [[] for i in range(currentclusters)]))
for c in range(currentclusters):
distance[c].append(abs(row-mean(clusters[c])))
closest =min(distance.values())
nearest_cluster=list(distance.keys())[list(distance.values()).index(closest)]
if t<max_clusters and closest[0] >a:
print('cluster ' + str(c+1) + ' is made')
clusters[currentclusters].append(row)
t=t+1
#after all clusters have at least one data point sort the rest of the data point by choosing cluster with closest centroid
else:
clusters[nearest_cluster].append(row)
print(clusters)
c_df=pd.DataFrame()
for c in clusters:
for i in clusters[c]:
c_df=pd.concat([c_df, pd.DataFrame({'Value': i, 'Cluster': c}, index=[0])], ignore_index=True)
fig=go.Figure()
fig=px.scatter(x=c_df.Value, color=c_df.Cluster, color_continuous_scale=['red','green','blue'], title="BSCA with PCA Breast Cancer Data alpha=" + str(a) + ", max clusters=" + str(max_clusters))
fig.update_layout(coloraxis_colorbar=dict(
title="Clusters",
tickvals=[0,1,2,3, 4],
ticktext=["c1","c2","c3","c4", "c5"],
lenmode="pixels", len=500,
))
fig.show()
#define parameters for max number of clusters, threshold dissimilarity (a), and counter (t)
max_clusters= 5
a= 1.5
t= 1
#Create cluster dictionary
clusters = dict(zip(range(max_clusters), [[] for i in range(max_clusters)]))
#assign first point to cluster 1
clusters[0].append(df.PCA1[0])
#iterate through data points after first point
for row in df.PCA1[1:]:
#add at least 1 point to each cluster
c=0
currentclusters=t
distance = dict(zip(range(currentclusters), [[] for i in range(currentclusters)]))
for c in range(currentclusters):
distance[c].append(abs(row-mean(clusters[c])))
closest =min(distance.values())
nearest_cluster=list(distance.keys())[list(distance.values()).index(closest)]
if t<max_clusters and closest[0] >a:
print('cluster ' + str(c+1) + ' is made')
clusters[currentclusters].append(row)
t=t+1
#after all clusters have at least one data point sort the rest of the data point by choosing cluster with closest centroid
else:
clusters[nearest_cluster].append(row)
print(clusters)
c_df=pd.DataFrame()
for c in clusters:
for i in clusters[c]:
c_df=pd.concat([c_df, pd.DataFrame({'Value': i, 'Cluster': c}, index=[0])], ignore_index=True)
fig=go.Figure()
fig=px.scatter(x=c_df.Value, color=c_df.Cluster, color_continuous_scale=['red','green','blue'], title="BSCA with PCA Breast Cancer Data alpha=" + str(a) + ", max clusters=" + str(max_clusters))
fig.update_layout(coloraxis_colorbar=dict(
title="Clusters",
tickvals=[0,1,2,3, 4],
ticktext=["c1","c2","c3","c4", "c5"],
lenmode="pixels", len=500,
))
fig.show()
#define parameters for max number of clusters, threshold dissimilarity (a), and counter (t)
max_clusters= 6
a= 0.8
t= 1
#Create cluster dictionary
clusters = dict(zip(range(max_clusters), [[] for i in range(max_clusters)]))
#assign first point to cluster 1
clusters[0].append(df.PCA1[0])
#iterate through data points after first point
for row in df.PCA1[1:]:
#add at least 1 point to each cluster
c=0
currentclusters=t
distance = dict(zip(range(currentclusters), [[] for i in range(currentclusters)]))
for c in range(currentclusters):
distance[c].append(abs(row-mean(clusters[c])))
closest =min(distance.values())
nearest_cluster=list(distance.keys())[list(distance.values()).index(closest)]
if t<max_clusters and closest[0] >a:
print('cluster ' + str(c+1) + ' is made')
clusters[currentclusters].append(row)
t=t+1
#after all clusters have at least one data point sort the rest of the data point by choosing cluster with closest centroid
else:
clusters[nearest_cluster].append(row)
print(clusters)
c_df=pd.DataFrame()
for c in clusters:
for i in clusters[c]:
c_df=pd.concat([c_df, pd.DataFrame({'Value': i, 'Cluster': c}, index=[0])], ignore_index=True)
fig=go.Figure()
fig=px.scatter(x=c_df.Value, color=c_df.Cluster, color_continuous_scale=['red','green','blue'], title="BSCA with PCA Breast Cancer Data alpha=" + str(a) + ", max clusters=" + str(max_clusters))
fig.update_layout(coloraxis_colorbar=dict(
title="Clusters",
tickvals=[0,1,2,3, 4],
ticktext=["c1","c2","c3","c4", "c5"],
lenmode="pixels", len=500,
))
fig.show()
"""Clustering on LDA Iris Dataset"""
import pandas as pd
import numpy as np
from statistics import mean
import plotly.graph_objects as go
import plotly.express as px
#import data set with 1D from LDA
from google.colab import files
uploaded=files.upload()
df=pd.read_csv('LDA_Iris.csv', header=None, names=['LDA1'])
#define parameters for max number of clusters, threshold dissimilarity (a), and counter (t)
max_clusters= 3
a= 0.9
t= 1
#Create cluster dictionary
clusters = dict(zip(range(max_clusters), [[] for i in range(max_clusters)]))
#assign first point to cluster 1
clusters[0].append(df.LDA1[0])
#iterate through data points after first point
for row in df.LDA1[1:]:
#add at least 1 point to each cluster
c=0
currentclusters=t
distance = dict(zip(range(currentclusters), [[] for i in range(currentclusters)]))
for c in range(currentclusters):
distance[c].append(abs(row-mean(clusters[c])))
closest =min(distance.values())
nearest_cluster=list(distance.keys())[list(distance.values()).index(closest)]
if t<max_clusters and closest[0] >a:
print('cluster ' + str(c+1) + ' is made')
clusters[currentclusters].append(row)
t=t+1
#after all clusters have at least one data point sort the rest of the data point by choosing cluster with closest centroid
else:
clusters[nearest_cluster].append(row)
print(clusters)
c_df=pd.DataFrame()
for c in clusters:
for i in clusters[c]:
c_df=pd.concat([c_df, pd.DataFrame({'Value': i, 'Cluster': c}, index=[0])], ignore_index=True)
fig=go.Figure()
fig=px.scatter(x=c_df.Value, color=c_df.Cluster, color_continuous_scale=['red','green','blue'], title="BSCA with LDA Iris Data alpha=" + str(a) + ", max clusters=" + str(max_clusters))
fig.update_layout(coloraxis_colorbar=dict(
title="Clusters",
tickvals=[0,1,2,3, 4],
ticktext=["c1","c2","c3","c4", "c5"],
lenmode="pixels", len=500,
))
fig.show()
#define parameters for max number of clusters, threshold dissimilarity (a), and counter (t)
max_clusters= 3
a= 1.5
t= 1
#Create cluster dictionary
clusters = dict(zip(range(max_clusters), [[] for i in range(max_clusters)]))
#assign first point to cluster 1
clusters[0].append(df.LDA1[0])
#iterate through data points after first point
for row in df.LDA1[1:]:
#add at least 1 point to each cluster
c=0
currentclusters=t
distance = dict(zip(range(currentclusters), [[] for i in range(currentclusters)]))
for c in range(currentclusters):
distance[c].append(abs(row-mean(clusters[c])))
closest =min(distance.values())
nearest_cluster=list(distance.keys())[list(distance.values()).index(closest)]
if t<max_clusters and closest[0] >a:
print('cluster ' + str(c+1) + ' is made')
clusters[currentclusters].append(row)
t=t+1
#after all clusters have at least one data point sort the rest of the data point by choosing cluster with closest centroid
else:
clusters[nearest_cluster].append(row)
print(clusters)
c_df=pd.DataFrame()
for c in clusters:
for i in clusters[c]:
c_df=pd.concat([c_df, pd.DataFrame({'Value': i, 'Cluster': c}, index=[0])], ignore_index=True)
fig=go.Figure()
fig=px.scatter(x=c_df.Value, color=c_df.Cluster, color_continuous_scale=['red','green','blue'], title="BSCA with LDA Iris Data alpha=" + str(a) + ", max clusters=" + str(max_clusters))
fig.update_layout(coloraxis_colorbar=dict(
title="Clusters",
tickvals=[0,1,2,3, 4],
ticktext=["c1","c2","c3","c4", "c5"],
lenmode="pixels", len=500,
))
fig.show()
#define parameters for max number of clusters, threshold dissimilarity (a), and counter (t)
max_clusters= 4
a= 0.7
t= 1
#Create cluster dictionary
clusters = dict(zip(range(max_clusters), [[] for i in range(max_clusters)]))
#assign first point to cluster 1
clusters[0].append(df.LDA1[0])
#iterate through data points after first point
for row in df.LDA1[1:]:
#add at least 1 point to each cluster
c=0
currentclusters=t
distance = dict(zip(range(currentclusters), [[] for i in range(currentclusters)]))
for c in range(currentclusters):
distance[c].append(abs(row-mean(clusters[c])))
closest =min(distance.values())
nearest_cluster=list(distance.keys())[list(distance.values()).index(closest)]
if t<max_clusters and closest[0] >a:
print('cluster ' + str(c+1) + ' is made')
clusters[currentclusters].append(row)
t=t+1
#after all clusters have at least one data point sort the rest of the data point by choosing cluster with closest centroid
else:
clusters[nearest_cluster].append(row)
print(clusters)
c_df=pd.DataFrame()
for c in clusters:
for i in clusters[c]:
c_df=pd.concat([c_df, pd.DataFrame({'Value': i, 'Cluster': c}, index=[0])], ignore_index=True)
fig=go.Figure()
fig=px.scatter(x=c_df.Value, color=c_df.Cluster, color_continuous_scale=['red','green','blue'], title="BSCA with LDA Iris Data alpha=" + str(a) + ", max clusters=" + str(max_clusters))
fig.update_layout(coloraxis_colorbar=dict(
title="Clusters",
tickvals=[0,1,2,3, 4],
ticktext=["c1","c2","c3","c4", "c5"],
lenmode="pixels", len=500,
))
fig.show()
"""## Results
The first cluster analysis on the PCA data used a maximum number of clusters of 4 and threshold for dissimilarity of 0.5. Based on the scatter plot, clusters 1, 2 , and 4 seem to have pretty tight groupings, where cluster 3 is very spread out and more rectangular than the others. After this, the data was then clustered using maximum number of clusters of 5 and threshold for dissimilarity of 1.5. Clusters 1-4 seems to be similar in size and shape with nice, tight groupings. Cluster 5 is more spreadout and rectangular than the other. For the third analysis, the data was clustered using maximum number of clusters of 6 and threshold for dissimilarity of 0.8. Clusters 1-5 are evenly spread out and sized. Cluster 6 is more spread out and rectangular. The data in cluster 6 of the third analysis, appears to be the same data as cluster 5 in the previous analysis.
The first analysis on the LDA data was done using a maximum number of clusters of 3 and threshold for dissimilarity of 0.9. The three clusters are fairly even in shape and size. There looks to be a slight outlier belonging to the first cluster. The second analysis again used 3 clusters, but a threshold of dissimilarity of 1.5. The resulting clusters from this analysis are similar to those in the first analysis. There is still an outlier belonging to cluster 1 that is spread out from the group. The third analysis was done using 4 clusters and a threshold of similarity of 0.7. Clusters 2, 3 , and 4 are larger than cluster 1 and similar in shape. Cluster 1 still contains the outlier, but is grouped with a few similar points.
## Discussion and Conclusions
Based on the results from the previous section, the data from the PCA Breast Cancer dataset, 6 clusters with a threshold of dissimilarity of 0.8 appears to be the most optimal amount of clusters and value for a. This is because the shape and size of the clusters is the most similar across each cluster in this analysis. This clustering breaks the objects into the most uniform groups.
The results from the LDA cluster analysis on the Iris dataset showed that the optimal number of clusters is 3, with a threshold of dissimilarity of 0.9. These clusters from this analysis were the most seperated of the 3. I would recommend removing the outlier at (0,0), as it seems to remain seperated from the data in all 3 analyses. Removing this one data point should still maintain the infroamtion from the dataset without affecting the cluster grouping extensively.
This type of cluster analysis has no right or wrong amount of clusters, as clustering is inherently subjective. Based on the analysis done in this study, the optimal number of clusters has been decided based on groupings thaty best fit the data and personal preference.
## References
Monsalves, B., & Damjan. (2022, August 8). Types of clustering algorithms in machine learning with examples. Blogs & Updates on Data Science, Business Analytics, AI Machine Learning. https://www.analytixlabs.co.in/blog/types-of-clustering-algorithms/
"""