-
Notifications
You must be signed in to change notification settings - Fork 6
/
MANUAL.tex
690 lines (549 loc) · 28.1 KB
/
MANUAL.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
\documentclass{article}
\def\pin{p_\mathrm{in}}
\def\pout{p_\mathrm{out}}
\def\kin{k_\mathrm{in}}
\def\kout{k_\mathrm{out}}
\def\in{\mathrm{in}}
\def\out{\mathrm{out}}
\def\low{\mathrm{low}}
\def\high{\mathrm{high}}
\def\code#1{\texttt{#1}}
\title{Dynamic community detection benchmark readme}
\begin{document}
\maketitle
This document describes the usage and implementation of the dynamic network
community benchmark.
\section{About the benchmark}
\label{sec:about}
The benchmark is fully described in our accompanying paper \cite{xxx}.
\subsection{Basic operations}
\label{sec:about-operations}
\noindent
There are two basic operations:
\textbf{Merging/splitting}: Two communities start off as random graphs
with link density $\pin$, with link density $\pout$ between them. As
time goes on, edges are added between them until there becomes a
random graph with only one community of link density $\pin$. Then,
this process reverses to complete one cycle until the communities are
split again.
\textbf{Expanding/contracting}: Two equal sized communities start off
as random graphs with internal link densities $\pin$. Nodes are
transferred from one community to the other, until the fraction $f$ of
the nodes of one community have been moved. At $t=.25\tau$, we have
the maximal size of the first community and minimal size of the
second. The process reverses, and at $t=.5\tau$ communities are of
equal size again, and at $t=.75\tau$ the second community is at the
maximal size. Then, at $t=\tau$ the communities return to equal size,
one cycle is complete, and the process repeats. At all times, the
internal link density of both communities is maintained at $\pin$ and
link density between them is maintained at $\pout$.
\subsection{Standard benchmarks}
\label{sec:about-std-benchmarks}
\noindent
These two unit operations are combined into three basic benchmarks.
This is only a summary of the mechanism, for full details see section
\ref{sec:technical}. All ``standard'' benchmarks in our first paper
use $q=4$, $n=32$, $\pin=0.5$, $\pout=0.05$, and $\tau=100$ and are
referred to precisely by the names \texttt{StdMerge},
\texttt{StdGrow}, and \texttt{StdMixed}.
\textbf{StdMerge}: There are $q=4$ equal size ($n=32$ nodes)
communities. The ``top'' two communities start out split, merge, and
re-split over one cycle. The ``bottom'' starts off as one community,
splits, and re-merges over the cycle (.5 phase difference). Internal
link density is $\pin$, split link density is $\pout$, and all other
link densities are $\pout$. The cycle time is $\tau=100$.
\textbf{StdGrow}: There are $q=4$ equal size ($n=32$ nodes)
communities. The ``top'' communities begin as two equal size
communities and the expand/contract process occurs. The ``bottom''
two communities begin as equal size communities and the
expand/contract process occurs with a .5 phase difference (however,
it is actually symmetric so the phase factor does nothing). The other
variables $\pin$, $\pout$, and $\tau$, are as in StdGrow.
\textbf{StdMixed}: There are $q=4$ communities. The ``top'' two begin
split and follow the merging process, and the ``bottom'' two follow
the expand/contract process. The other variables $\pin$, $\pout$, and
$\tau$ are as in StdGrow.
\section{Installation}
\begin{itemize}
\item Only dependencies are Python 2.x and the graph library
\texttt{networkx}. \texttt{networkx} is distributed with this code
and should not need separate installation.
\item This code should be runnable from its directory and does not
need installation.
\item To load the code as a library, place \texttt{bm} on your Python
path or the directory from which your other code runs. If you want
to do this, you probably already know what you are doing.
\end{itemize}
\section{Usage}
The code can be used as a normal program from the command line
(section \ref{sec:use-cli}), as a module for other Python programs
(section \ref{sec:use-library}), or can be extended by subclassing to
create your own custom tests (section \ref{sec:use-newtests}).
\subsection{Command line usage}
\label{sec:use-cli}
Execute the \code{bm.py}. To get a listing of help options, run:
\begin{verbatim}
python /path/to/bm.py -h
\end{verbatim}
Standard syntax:
\begin{verbatim}
bm.py ModelName OutputPrefix [options]
\end{verbatim}
You must provide a \texttt{ModelName}, and a \texttt{OutputPrefix} to
specify where files will be written.
Output is written into \texttt{*.graph} files which contain the graphs
at times (in your desired format, by default edgelists), and
\texttt{*.comms} files which contain instantaneous community
information. Output formats can be specified by \code{--graph-format}
or \code{--comm-format}.
The options \code{--p\_in}, \code{--p\_out}, \code{--n}, and \code{--q}
can control the most important other parameters.
\vspace{.5cm}
\noindent
Required positional arguments:
\begin{description}
\item[ModelName] Model name to simulate. Can be \texttt{StdMerge},
\texttt{StdGrow}, or \texttt{StdMixed}. These are described
in section \ref{sec:about-std-benchmarks}.
\item[OutputPrefix] Where to write output. Technically, this is
optional, if not specified no output will be written and you will
only see some information as things are calculated. This could be
useful for testing parameters and watching the community sizes produced.
\texttt{OutputPrefix} can contain directory names. Files will be
written to \texttt{\textsl{OutputPrefix}.tNNNNN.*}, where \texttt{NNNNN} is
the corresponding time step.
\end{description}
\vspace{.5cm}
\noindent
Commonly used options:
\begin{description}
\item[-{}-n=INTEGER] Number of nodes in each community. The total
number of nodes is $n\times q$. (library model option: \code{n=INT})
\item[-{}-q=INTEGER] Number of total communities, default 4. This
should be a multiple of 2 for \texttt{StdMerge} and
\texttt{StdGrow}, and a multiple of 4 for \texttt{StdMixed}.
(library model option: \code{q=INT})
\item[-{}-p\_in=REAL] Set the link density within communities. Every
pair of nodes in the same community will independently have this
probability of having an edge. Must be real number between 0.0 and
1.0. (library model option: \code{p\_in=FLOAT})
\item[-{}-p\_out=REAL] Set the link density between nodes in different
communities. Analogous to \code{--p\_in}. (library model option:
\code{p\_out=FLOAT})
\item[-{}-k\_in=REAL] An alternative method of setting $\pin$. This
effectively sets $\pin$ = $\kin/(n-1)$. The meaning of $\kin$ is
not well defined for the growing/shrinking benchmark, see below for
more information. FIXME:describe this. (library option: use
\code{p\_in="k=FLOAT"})
\item[-{}-k\_out=REAL] An alternative method of setting $\pout$. This
effectively sets $\pout$ = $\kout/n$. The meaning of $\kout$ is not
well defined for the growing/shrinking benchmark, see below for more
information. FIXME:describe this. This is the number of external
edges \textsl{per community}, so the total number of external edges
is $\kout(q-1)$. (library option: use \code{p\_out="k=FLOAT"})
\item[-{}-k\_out\_tot=REAL] Like \texttt{-{}-k\_out} but considers total
out degree. This effectively sets $\pout$ =
$k\_\mathrm{out\_tot}/(n(q-1))$. The same caveat for grow/shrink
applies as for \code{-{}-k\_out} FIXME:describe this. (library
option: use \code{p\_out="ktot=FLOAT"})
\end{description}
\vspace{.5cm}
\noindent
Other options:
\begin{description}
\item[-{}-t=INT] Maximum time to simulate when running from the command
line. Graphs and communities will be written for every integer time
between [0, T]. (library option: not applicable, you can directly
get any time point from the library.)
\item[-{}-tau=REAL] Model time period. The model repeats after this
time. The graph and communities at points $t$ and $t+\tau$ are
identical. (library option: \code{tau=FLOAT})
\item [-{}-graph-format=NAME] Output format for graph writing. See
section \ref{sec:output-formats} for information on choices.
(library option: not applicable)
\item [-{}-comm-format=NAME] Output format for community writing. See
section \ref{sec:output-formats} for information on choices.
(library option: not applicable)
\item [-{}-seed=STRING] Seed for the random number generator, passed
directly to Python's \code{random.Random.seed}. \texttt{scipy}'s
random number generator is seeded from this, when needed. (library
option: \texttt{opts=\{"seed": SEED\}})
\item [-{}-no-det-limit] If this option is given, only consider
communities merged when $\pin=\pout^*$. If not given (detectability
limit support is on), communities are only considered merged
according to the sparse-limit detectability limit formula. (library
option: \texttt{opts=\{"no\_det\_limit": True\}})
\item [-{}-Gnm] If given, stochastic block models are taken from a
G$_\mathrm{nm}$ ensemble instead of a G$_\mathrm{np}$ ensemble. In
practice, this means that instead of the number of edges chosen from
a binomial distribution with parameters $n(n-1)/2, p$, we have
exactly $\textrm{round}(n(n-1)p/2)$ edges. This is also true for
edges between communities. \textbf{The \texttt{StdGrow} (and
\texttt{StdMerge}) benchmarks do not support this, since this has
not been implemented in the \texttt{ExpandContract} manager
yet!} (library option: \texttt{opts=\{"Gnm": True\}})
\item[-{}-cids=STRING] By default, community IDs are reused. Other
choices include: \texttt{snapshot}: New community IDs are generated
on every snapshot, they are never reused. \texttt{new}: Produce new
community IDs after each critical event. For example, after two
communities merge, a new community ID is created, and when this
splits again, yet again new community IDs are created. Community
IDs can last for several timesteps, but they will not come back if
they are lost. (library option: \texttt{opts=\{"cids": STRING\}})
\end{description}
\subsubsection{Examples}
\begin{itemize}
\item \texttt{./bm.py StdMerge output-prefix}
\item \texttt{./bm.py StdGrow output-prefix}
\item \texttt{./bm.py StdMixed output-prefix}
\item \texttt{./bm.py StdMerge output-prefix --n=32 --q=16 --p\_in=.5 --p\_out=.1}
\item \texttt{./bm.py StdMerge output-prefix --k\_in=16 --k\_out=3}
\item Specifying with k\_in instead of p\_in will automatically
calculate $p$s on a PER COMMUNITY basis. Total external edges are $k_\out(q-1)$.
\end{itemize}
\subsection{Usage as a library}
\label{sec:use-library}
The main module \texttt{bm} can be imported as a Python module. Then,
graphs and communities can be directly accessed at each individual
time step, without a need for creating and reading many temporary
files.
The easiest entry point is the \texttt{get\_model} function. All
options from the command line (that are applicable here) can also be
specified through the library interface. The meaning of these options
are not explained again here, for reference see the previous section
on command line usage.
\begin{description}
\item[bm.get\_model]\hspace{-.15cm}(\textsl{ModelName}, **\textsl{model-opts}, opts={})
Return a benchmark object \code{M}.
The following keyword options are accepted: \code{p\_in}, \code{p\_out},
\code{n}, \code{q}, \code{tau}, \code{opts}. Of these, \code{opts}
can be a dictionary which contains other options.
To specify a \code{k\_in}, use \code{p\_in="k=K"}, that is, the
\code{p\_in} option is a string starting with \code{"k="} followed
by the average degree desired. Then, $\pin=\kin/(n-1)$ and
$\pout=\kout/n$ are calculated and used.
Example usage: \code{bm.get\_model("StdMerge", p\_in=.5, p\_out=.1,
n=100, q=8, opts=\{"no\_det\_limit": True\})}
\item[Benchmark object] Object representing model state and
configurations.
\begin{description}
\item [Benchmark.graph]\hspace{-.15cm}(t)
Return \code{networkx} graph object corresponding to network state
at time $t$.
\item [Benchmark.comms]\hspace{-.15cm}(t)
Return the communities of the network at time $t$. This is a
dictionary with keys representing community names (currently integers, but
can be generalized to any string) and values which are
\texttt{set}s of node IDs.
\end{description}
\item[bm.main\_argv]\hspace{-.15cm}(argv)
Emulate command line usage from script. The argument \texttt{argv}
should be a list of strings of command line options to emulate. The
first element of the list should be some placeholder for the script name.
This does \textsl{not} run or write any output, only returns a Model
object. The \code{bm.run} function handles running and writing.
Here, you would use the \code{graph} and \code{comms} method to get
exactly the times you want without temporary files.
Example: \code{bm.main\_argv(["X", "StdGrow", "--p\_in=.1"])}
Returns: tuple \texttt{(Benchmark \textrm{object}, args)}
\end{description}
\subsection{Creating your own benchmarks}
\label{sec:use-newtests}
The benchmark includes three standard models: \texttt{StdMerge},
\texttt{StdGrow}, and \texttt{StdMixed}. These are made of three
basic operations in the classes: \texttt{Static}, statically placing links,
\texttt{Merging}, making merging operation that occurs over time, and
\texttt{ExpandContract} which make two communities that expand/contract
in time. These three unit operations can be combined in order to make
your own benchmarks, for example, with different time scales or
different community sizes.
\textsl{This is not documented yet.}
\section{Usage notes}
\label{sec:usage-notes}
\subsection{Disconnected communities}
If the values of community size $n$ and internal edge density $\pin$
are too low, it is easily possible to have one community
disconnected. We do not recalculate links for individual disconnected
nodes or components. Instead, we abort the entire process and the
benchmark fails with \texttt{DisconnectedError}. This is a conscious
choice to not in any way affect the ensemble of possible states.
Instead, we have taken the view that parameters much be such that this
is an unlikely occurrence, and then we abort on disconnection.
\subsection{$\kin$ and $\kout$}
If $\kin$ or $\kout$ are specified, we take $\pin=\kin/(n-1)$ or
$\pout=\kout/n$, with $n$ being the community size.
\subsection{Output formats}
\label{sec:output-formats}
The following output formats are available. It just so happens that
the code currently produces graphs with consecutive node IDs from $0$
to $N-1$. However, this is just due to the limited scope of the
current benchmark, internally and in most output formats, the
benchmark is designed to extend to cases where nodes are added and
removed, leading to non-consecutive node IDs (for example, node 2 is
removed). Furthermore, a likely extension is to overlapping
communities. Not all output formats can support this.
\textbf{Graph output formats}
\begin{itemize}
\item \texttt{edgelist}: An edgelist (lines of \texttt{node1 node2}),
one file per timestep.
\item \texttt{pajek}: Pajek graph format, one file per timestep.
Unlike the other output formats, these nodes are one-indexed instead
of zero-indexed. The community format will still be zero-indexed
unless otherwise stated. This format will not support
non-consecutive nodes, once that exists.
\item \texttt{null}: Do not write graphs
\item \texttt{tedgelist} (default): Temporal edgelist. Like an edgelist, but
with an extra column for time of the edge. Lines of \texttt{node1
node2 weight time}. In current implementation, weight is always
$1.0$. For each edge, there is a unique line for each time it
exists. In current implementation, times are ascending.
\item \texttt{other}: Any other graph writer in networkx can be used.
Writer is looked up as \texttt{networkx.write\_}\textsl{name}.
\end{itemize}
\textbf{Community output formats}
\begin{itemize}
\item \texttt{oneline}: There is one file per timestep. Each file
has one community written per line. Each line has node IDs written
space-separated. Time information is in the filename. The
community name is in a label in a comment above each line. This
format supports overlaps.
\item \texttt{bynode}: Lines with \texttt{node community} pairs,
one file per timestap. Time information is in the filename. This
format supports overlaps.
\item \texttt{pajek}: Pajek \texttt{.clu} format, one file per
timestep. This format has the same problem as the Pajek graph
format.
\item \texttt{tcommlist} (default): One file for all timesteps, with
lines \texttt{time node comm}. This format supports overlaps, there
will be one line for each overlap with same (time, node)
information.
\item \texttt{tmatrix}: All nodes are sorted by node ID, and
communities of each node are written consecutively, all nodes for
one timestep on each line. The file appears to be a big matrix with
dimensions of nodes (across) and time (lines). This format does not
contain information on node IDs (making it unsuitable for extensions
with added/removed nodes) or timestamps (so you must know timestamps
from some other source).
\end{itemize}
\subsection{Community tracking information}
\subsection{Time-reversibility}
The code is time-reversible. In other words, you can request the
graph at time $t=70$ and then request the graph at time $t=10$. This
as because all state is computed at benchmark initialization.
However, the downside to this is that the state is completely
periodic, you can go an infinite number of multiples of the period
$\tau$ and you will still return to the same state.
\subsection{Threadsafety}
All code is threadsafe (including/especially random number generation).
\section{Dynamic community grammar}
\label{sec:dyn-grammar}
The dynamic community grammar provides a standard language for
describing transitions in communities. This is the ``critical
points'' that some methods try to detect. For example, the three
current grammar statements indicate one community continuing in time,
two communities merging, or two communities splitting.
The grammar is printed to \texttt{stdout} when it runs. There are no
tools for using this yet.
Each statement is represented as a Python tuple. The zeroth element
is a string describing the action, followed by other elements
describing the communities operated on. When using the benchmark as a
library, a list of these statements can be returned \textit{for the
last timestep} by \texttt{Benchmark.grammar()}.
The following statements are available
\begin{itemize}
\item \texttt{("Continue", (old\_cid, ), (new\_cid, ))}. Indicates that the
community \texttt{old\_cid} is now represented by \texttt{new\_cid}.
This is the same community (maybe with a few nodes added or
removed), but the ID has changed. This statement is not generated
if \texttt{old\_cid} is the same as \texttt{new\_cid}. Note that
both IDs are in length-1 tuples.
\item \texttt{("Merge", (old\_cid0, old\_cid1, ...), (new\_cid, ))} The
communities in the first element lost their individuality (they
became too densely connected) and were
merged into a new community of name \texttt{new\_cid}. Note that
\texttt{new\_cid} is in a length-1 tuple.
\item \texttt{("Split", (old\_cid, ), (new\_cid0, new\_cid1, ...))} The
community in the first element lost its cohesiveness was split into
multiple new communities named in the second element. Note that
\texttt{old\_cid} is in a length-1 tuple.
\end{itemize}
\section{Dynamic community comparison}
The tool \texttt{dyncmtycmp.py} does a windowed community comparison,
as described in our paper.
Standard syntax:
\begin{verbatim}
dyncmtycmp.py input1 input2 [--dt=DT] [--cmp=NAME] [options]
\end{verbatim}
Positional arguments:
\begin{description}
\item[input1] Input file \#1. In either \texttt{tedgelist} or
\texttt{tmatrix} format.
\item[input2] Input file \#2. Same as \texttt{input1}.
\end{description}
Options:
\begin{description}
\item[-{}-dt=INT] Timestep delta. The code compares the $n$th snapshot to
the $n+dt$th snapshot. If $dt=0$, then this is the same as the
comparing identical snapshots to each other. Note that the name
$\Delta t$ is a slight misnomer, this is not a time delta but an
index delta.
\item[-{}-cmp=NAME] Which comparison function to use. Acceptable
names are \texttt{vi}, \texttt{vi\_norm}, \texttt{nmi},
\texttt{jaccard}, or anything in \texttt{pcd.cmtycmp}.
\item[-{}-format=NAME] Specify input format for the files. Choices
are \texttt{auto}, \texttt{tcommlist}, \texttt{tmatrix}. Default is
\texttt{auto}, which will do a basic check to figure it out.
\end{description}
Other notes:
\begin{itemize}
\item There is a simple library interface, just look in the code to
see since it is not documented here.
\end{itemize}
\section{Technical notes}
\label{sec:technical}
\subsection{Principles}
\label{sec:principles}
\begin{itemize}
\item The benchmark operates in intervals of time, not space. That
means that one cycle of the benchmark is time $\tau$ and snapshots
are given at $0\tau$, $.01\tau$, etc, instead of one snapshot being
the addition of one edge or one node. Reason: For the merging
benchmark, since $p$ is fixed, not the number of edges, there is a
different number of edges needed to be added in each instance. It is ideal to have
the critical points (completely split, completely merged) at the
same time points in all instances. Further, when merging and
expansion are coupled, we need a way to make a consistent time
scale. This is the easiest way.
\item The configurations are programmed as a certain state at a certain
time, not as a dynamic process generating a certain state. This
means it easiest to always ensure that the model always matches a
given ensemble. You can immediately generate (and check) state at
time $t$ without going through all times $[0,t)$.
\item Care is paid to time and space complexity of all operations.
There are no $O(N^2)$ time or space operations and everything is
$O(N*k)$ or $O(M)$.
\item At every step, the communities are checked for connectedness,
since there is no model-level guarantee of connectedness. If graph
is disconnected, abort. This can be changed to ``resample and
continue'' if we want to change the ensemble.
\item The model is programmed with defensive programming techniques
(e.g. arxiv:1210.0530) to prevent bugs, and will have a full test suite.
\item Everything is fully modular. For example, a generic dynamic merging operation is
defined between two communities, and this can be applied to any
realization of the benchmark, possibly multiple times. This leads
to most flexibility, reusability, and least possibility of bugs being introduced.
\item The benchmark is written in Python and depends only on the
\texttt{networkx} package for graphs. This makes for the most
flexible deployment on the most number of systems, and easiest to
implement flexibility and customizability.
\item The three ``standard'' benchmarks are presets. The defaults
will be set as we want, but they can also have $n$ ,$q$, $\pin$,
and $\pout$ (and probably more) customized. I can also have other
``advanced'' versions
for testing.
\item Public and open source project. git repository:
https://git.becs.aalto.fi/rkdarst/dynbench
\end{itemize}
\subsection{Static mechanism}
\begin{itemize}
\item A community $A$ has $n$ nodes.
\item Each edge within $A$ (between two nodes in $A$, excluding
self-edges) is made with probability $\pin$. These edges never
change and remain static throughout the life of the benchmark.
\item This could be expanded to a ``dynamic equilibrium'' mechanism
where, on each step, $x$ possible pairs of nodes are re-sampled from
the distribution.
\item This mechanism can also apply to two communities $A$ and $B$,
then considering edges between $A$ and $B$ instead of within them.
This is used for ``external'' edges between non-interacting
communities, with edge density $\pout$.
\end{itemize}
\subsection{Merging mechanism}
\begin{itemize}
\item Two communities, $A$ and $B$, are initialized with $n$ nodes.
\item Edges internal to $A$ and $B$ are managed with the ``Static''
mechanism above.
\item The edge density of edges \textsl{between} communities ranges
from $\pout$ in the split state to $\pin$ in the merged state.
Thus, at the two endpoints, the graph is a standard SBM graph and at
the other endpoint, an ER random graph with $\pin$. Note that the
\textsl{actual} edge density (edges divided by edges possible) in
the fully merged state can be higher or lower than the
\textsl{actual} internal densities of
the two sides, but they are chosen from the same distribution. This
is a natural consequence of being an ER random graph.
\item The exact number of edges $m_\low$ in the split state is chosen
from the binomial distribution $\mathcal{B}(n, \pout)$. The exact
number of edges $m_\high$ in the merged state is chosen from the
binomial distribution $\mathcal{B}(n, \pin)$. To choose any
particular point, a linear interpolation is taken between these
values.
\begin{itemize}
\item In particular, this means that at the halfway point, the
number of edges is $(m_\high+m_\low)/2$, and not chosen from a
binomial distribution with $(\pin+\pout)/2$. We are only in a
random graph ensemble at the endpoints. Otherwise, the number of
edges will fluctuate up and down as time passes, which was
requested to not happen.
\item This also means the same number of new edges are added on each
step.
\end{itemize}
\item Time is defined with linear interpolation between these
points. Alternative formulations are easy to add due to the
implementation above.
\begin{itemize}
\item $t'=0$, split state
\item $t'={1 \over 2} \tau$, fully merged state.
\item $t'=\tau$, split state.
\end{itemize}
$t'$ is the normalized time, $t' = t ~ \mathrm{mod} ~ \tau + \phi$. $\phi$ is a
phase factor.
\item The ``ground truth'' community is only considered merged when
$\pout=\pin$. If we desire, this can be changed to use the
detectability limit as a threshold.
\end{itemize}
\subsection{Expansion/contraction mechanism.}
\begin{itemize}
\item Two communities, $A$ and $B$, have $n$ nodes. A fraction $f$
defines the fraction of each community that is moved to the other
community. Our initial convention is $f=.5$ (this makes the 16 to
48 split). Each community has an
internal edge density $\pin$ and all links between any two nodes in
different communities are present with independent probability
$\pout$.
\item At the ``left'' state, $A$ has $n_A(1-f)$ nodes and $B$ has
$n_b+f n_A$ nodes. At the ``right'' state, $A$ has $n_A + f n_B$
nodes and $B$ has $n_b(1-f)$ nodes. (This can be generalized to
different numbers of nodes in $A$ and $B$, as you see here.)
\item A time interpolation is done between these two states. This is
a linear interpolation in number of nodes, but this can also be
easily adjusted in one function. Critical time points are defined
as such:
\begin{itemize}
\item $t'=0$, equal sized communities.
\item $t'={1 \over 4} \tau$, $A$ is smallest and $B$ is largest.
\item $t'={1 \over 2} \tau$, equal sized communities.
\item $t'={3 \over 4} \tau$, $A$ is largest and $B$ is smallest.
\item $t'=\tau$, equal sized communities.
\end{itemize}
$t'$ is the normalized time, $t' = t \mathrm{mod} \tau + \phi$. $\phi$ is a
phase factor.
\item If a community is ever disconnected, raise an error and abort.
\end{itemize}
\subsection{Standard benchmarks}
\textbf{This section is written poorly and unfinished. I should
describe how different $q$ works and make this more rigorous.}
\subsubsection{Standard - Merging}
Four communities. 0 and 1 merging, 2 and 3 merging with phase
factors of 0 and .5. Internal nodes are with ``static'' mechanism with
edge density $\pin$. All other links between \{0,1\} and \{2,3\} are
managed by ``static'' with edge density $\pout$.
\subsubsection{Standard - Merging}
Four communities. 0 and 1 ``growing'', 2 and 3 ``growing'' with phase
factors of 0 and .5. All other links between \{0,1\} and \{2,3\} are
managed by ``static'' with edge density $\pout$.
\subsubsection{Standard - Mixed}
Four communities. 0 and 1 ``merging'', 2 and 3 ``growing'' with phase
factors of 0 and 0. All other links between \{0,1\} and \{2,3\} are
managed by ``static'' with edge density $\pout$.
\end{document}