-
Notifications
You must be signed in to change notification settings - Fork 0
/
clusterFactory.h
327 lines (272 loc) · 15.1 KB
/
clusterFactory.h
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
/* This file is part of CongressRelationVis. This project creates graphs
showing simmilarity of voting of different members of the US House of
Representatives.
Copyright (C) 2013 Ezra Erb
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License version 3 as published
by the Free Software Foundation.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program. If not, see <http://www.gnu.org/licenses/>.
I'd appreciate a note if you find this program useful or make
updates. Please contact me through LinkedIn or github (my profile
also has a link to the code depository)
*/
/* The classes in this file take the map of vote differences and group
Congresspeople based on vote similarity. It uses a complete link
clustering algorithm, so groups are formed such that the maximum number
of vote differences between members of the group is under some limit.
The class exists for a number of reasons:
1. No set of Congresspeople, no matter how partisan, vote the exact same
way on all votes. The absences alone see to that. This creates noise
in the results. The clustering declares that vote differences below
some threshold are not meaningful and should be ignored. The
groups show which Congressmen should be treated by the final display
has having identical votes
2. Displaying a large number of Congressmen on the screen at once leads to a
messy, hard to read result. Removing the least meaningful differences
leads to a more comprehensible graph.
3. Force based layout algorithms are NP-complete, which means that
computation time scales exponentially with the number of nodes. Reducing
their number is essential to making the problem tractable, and clustering
is the best method available for doing that.
*/
#include<map>
#include<set>
using std::multimap;
using std::set;
/* NOTE: The following are defined in headers deliberately not included,
because they are widely used and the source should include the headers anyway */
using std::swap;
using std::pair;
using std::vector;
typedef set<unsigned short> CongressGroup;
typedef vector<CongressGroup> CongressGroupVector;
struct CongressGroupData
{
CongressGroup _group; // Members of this group
vector<short> _parties; // Count of members in each political party: D, R, other
vector<short> _regions; // Count of members in each geographic region
// Returns the total number of Congresspeople in a group
unsigned short getCount(void) const;
};
typedef vector<CongressGroupData> CongressGroupDataList;
class GroupDistanceMap
{
/* This class manages the data of distances between clusters. The data
management is complex enough that it needs to be encapsulated in a
separate class. The basic design issue is that the data is accessed
in two very different ways: find the distance given a pair of clusters,
and find the lowest distance within the set. The lowest distance
changes with every cluster merge because the smaller distances of the
old clusters to the remainder are always eliminated. The optimal data
structures for each purpose are very different, so this class keeps
two separate data representations and keeps them in sync. The risk of
desynchronization should be obvious, which is why they are encapsulated
in this class.
For the purpose of getting distance given two clusters, it uses a ragged
array, a two dimensional array (stored as vectors) where the length of
each row varies. In particular, each row only records distances for
columns less than the row number. The ragged array means that only one
value update is needed for any pair of clusters.
For the purpose of finding the lowest distance, it keeps a sorted list
of cluster pairs indexed by distance, using a multi-map. Multimaps have
the nice property that iterators are stable unless the entry itself is
changed, so storing the iterators in the ragged array allows direct access
to the entry in the sorted distances multimap. This way, distance data
can be directly looked up without storing it twice.
Efficiency overview: Every merge will require updating N distance
entries, to account for the newly formed cluster relative to the remainder.
Deletion of entries can be done in constant time due to the stored iterator.
Updating an existing value will be done half the time on average, giving N/2
map updates. The update is done by binary search, so the time required is the log
of the number of entries. The size of the sorted list is N^2, so the cost of a
merge is 2logN. Combine this to get a final cost of O(NlogN), an amazing coincidence.
The first item in the multi-map is the next pair to merge, so finding the next
pair to merge is constant time. The alternative is to search the multi-map
every time. Updates are done in constant time, but every search is now O(N^2),
because the ragged array is not sorted and must be linearly searched. Finally,
if the merging process is reported until there is only one group left, N-1
merges must be performed, giving an overall performance of O(NNlogN). This
matches the current theoretical minimum for complete link clustering algorithms.
*/
public:
typedef pair<unsigned short, unsigned short> ClusterPair;
/* Initializes the class from a map of distances between clusters. The input
matches the distances between Congresspeople by deliberate coincidence,
since every one is initially their own cluster */
explicit GroupDistanceMap(const VoteDiffMatrix& distances);
// Returns the highest valid cluster index
unsigned short getClusterNoLimit(void) const;
// Returns true if distance data exists for a given pair of clusters.
bool haveDistanceData(unsigned short cluster1, unsigned short cluster2) const;
bool haveDistanceData(const ClusterPair& clusters) const;
/* Gets the distance between a pair of clusters. If either is no longer
defined, the distance is zero. */
short getDistance(unsigned short cluster1, unsigned short cluster2) const;
short getDistance(const ClusterPair& clusters) const;
/* Erases a cluster distance from the map. Also deletes the iterator.
If the distance is not defined, nothing happens */
void eraseDistance(unsigned short cluster1, unsigned short cluster2);
/* Updates a cluster distance to a new value. Deletes the map iterator,
inserts the new value, and stores the new iterator. If no existing
distance is defined, nothing happens. If the distance shrinks
(indicating an error in the algorithm) a warning is issued. */
void updateDistance(unsigned short cluster1, unsigned short cluster2,
short distance);
/* Returns the pair of clusters whose merger will create the shortest
distance within any possible cluster.*/
ClusterPair getShortestDistanceCluster(void);
private:
// Prohibit copying, because it will screw up iterator data
GroupDistanceMap(const GroupDistanceMap& other);
GroupDistanceMap operator=(const GroupDistanceMap& other);
typedef multimap<short, ClusterPair> DistanceMap;
typedef DistanceMap::iterator DistancePtr;
DistanceMap _sortedDistances;
vector<vector<DistancePtr> > _distanceByCluster;
// Sorts the requested cluster pair into order for the lookup
void sortClusterRequest(unsigned short& cluster1, unsigned short& cluster2) const;
};
// Returns the total number of Congresspeople in a group
inline unsigned short CongressGroupData::getCount(void) const
{
return _group.size();
}
// Returns the highest valid cluster index
inline unsigned short GroupDistanceMap::getClusterNoLimit(void) const
{
return _distanceByCluster.size();
}
// Sorts the requested cluster pair into order for the lookup
inline void GroupDistanceMap::sortClusterRequest(unsigned short& cluster1, unsigned short& cluster2) const
{
/* The ragged array is set up so the row is always the higher cluster number.
This routine means that callers don't need to know about this, or care */
if (cluster1 < cluster2)
swap(cluster1, cluster2);
}
/* Returns the pair of clusters whose merger will create the shortest
distance within any possible cluster.*/
inline GroupDistanceMap::ClusterPair GroupDistanceMap::getShortestDistanceCluster(void)
{
if (_sortedDistances.empty())
return std::make_pair(0, 0);
else
return _sortedDistances.begin()->second;
}
// Returns true if distance data exists for a given pair of clusters.
inline bool GroupDistanceMap::haveDistanceData(unsigned short cluster1, unsigned short cluster2) const
{
// NOTE: Variables passed by value, so can butcher at will
sortClusterRequest(cluster1, cluster2);
if ((cluster1 < getClusterNoLimit()) && (cluster1 > cluster2))
// An entry still exists if it has an iterator into the sorted distance list
return (_distanceByCluster.at(cluster1).at(cluster2) != _sortedDistances.end());
else // Outside the ragged array limits, so clearly no data
return false;
}
inline bool GroupDistanceMap::haveDistanceData(const ClusterPair& clusters) const
{
return haveDistanceData(clusters.first, clusters.second);
}
/* Gets the distance between a pair of clusters. If either is no longer
defined, the distance is zero. */
inline short GroupDistanceMap::getDistance(unsigned short cluster1, unsigned short cluster2) const
{
// NOTE: Variables passed by value, so can butcher them at will
sortClusterRequest(cluster1, cluster2);
if (!haveDistanceData(cluster1, cluster2))
return 0; // Outside the ragged array
else
// The at() calls return an iterator. Dereference it to get the pair, first value is distance
return _distanceByCluster.at(cluster1).at(cluster2)->first;
}
inline short GroupDistanceMap::getDistance(const ClusterPair& clusters) const
{
return getDistance(clusters.first, clusters.second);
}
/* Erases a cluster distance from the map. Also deletes the iterator.
If the distance is not defined, nothing happens */
inline void GroupDistanceMap::eraseDistance(unsigned short cluster1, unsigned short cluster2)
{
// NOTE: Variables passed by value, so can butcher them at will
sortClusterRequest(cluster1, cluster2);
if (haveDistanceData(cluster1, cluster2)) {
/* OPTIMIZATION NOTE: Its tempting to set a reference to the wanted
ragged array entry, to avoid looking it up twice. Don't bother, the
vector at() method is incredibly optimized, and inline */
_sortedDistances.erase(_distanceByCluster.at(cluster1).at(cluster2));
// Clear array entry
_distanceByCluster.at(cluster1).at(cluster2) = _sortedDistances.end();
} // Distance defined for the cluster pair
}
// This class does the clustering, using the classic complete link algorithm
class ClusterFactory
{
public:
static short meaningfulDifferenceLimit;
/* Given the map of vote differences indexed by congresspersons and the level
of vote differences considered to be noise, returns the groups of congressmen.
The last optional parameter is a lower limit on the number of groups, used to
prevent returning a few huge blobs if the noise threshold is chosen badly.
Internally, the method implements the classic complete linkage grouping algorithm */
static void formClusters(const VoteDiffMatrix& congressVotes,
CongressGroupVector& congressMatchGroups,
const CongressData& congressData, // Needed for trace
short noiseThreshold = 100, short minGroups = 0,
bool traceOutput = false);
/* Finds the average vote difference between every pair of clusters. The
difference map is indexed by the order clusters appear in the supplied
group list. This method assumes that the noise limit for the clustering
is high enough that rounding from the average calculation will not
materially affect the results. */
static void getClusterDistanceMap(const VoteDiffMatrix& congressVoteMap,
const CongressGroupVector& congressGroupList,
VoteDiffMatrix& groupVotesMap);
// Given groups of congress persons, summarizes data about their members
static void getClusterCongressData(const CongressGroupVector& congressGroupList,
const CongressData& congressData,
const RegionMapper& regions,
CongressGroupDataList& congressGroupData);
// Outputs the contents of the cluster list, with identification informmation
static void debugOutputClusterList(const CongressGroupVector& congressMatchGroups,
const CongressData& congressData);
// Outputs the contents of the cluster data list
static void debugOutputClusterList(const CongressGroupDataList& congressGroupData);
private:
// Finds the distances between a newly merged cluster and all other clusters
static void mergeClusters(CongressGroupVector& groups, unsigned short source,
unsigned short destination);
// Finds the cluster index number for a newly merged cluster
static unsigned short findMergeClusterIndex(unsigned short cluster1,
unsigned short cluster2);
// Merges the cluster contents within the list
static void mergeClusters(GroupDistanceMap& data, unsigned short cluster1,
unsigned short cluster2);
// Find the average distance between two clusters
static short findClusterDistance(const VoteDiffMatrix& congressVoteMap,
const CongressGroup& cluster1,
const CongressGroup& cluster2);
// Prints the distance output
static void debugOutputDistances(const GroupDistanceMap& distances);
};
inline unsigned short ClusterFactory::findMergeClusterIndex(unsigned short cluster1,
unsigned short cluster2)
{
if (cluster1 < cluster2)
return cluster1;
else
return cluster2;
}
// Finds the distances between a newly merged cluster and all other clusters
inline void ClusterFactory::mergeClusters(CongressGroupVector& groups, unsigned short source,
unsigned short destination)
{
groups.at(destination).insert(groups.at(source).begin(),
groups.at(source).end());
groups.at(source).clear();
}