-
Notifications
You must be signed in to change notification settings - Fork 3
/
circllhist.tex
1154 lines (963 loc) · 55 KB
/
circllhist.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\documentclass{article}
\usepackage{arxiv}
\usepackage[utf8]{inputenc} % allow utf-8 input
\usepackage[T1]{fontenc} % use 8-bit T1 fonts
\usepackage{hyperref} % hyperlinks
\hypersetup{hidelinks = true}
\usepackage{url} % simple URL typesetting
\usepackage{booktabs} % professional-quality tables
\usepackage{amsfonts} % blackboard math symbols
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{nicefrac} % compact symbols for 1/2, etc.
\usepackage{microtype} % microtypography
\usepackage[]{algorithm2e}
\usepackage{fancyvrb}
\usepackage{mathtools}
\DeclarePairedDelimiter{\ceil}{\lceil}{\rceil}
\usepackage{graphicx}
\usepackage{subcaption}
\usepackage{csquotes}
\usepackage{afterpage}
% ENVIRONMENTS
\usepackage{amsthm}
\theoremstyle{plain}
\newtheorem{definition}{Definition}[section]
\newtheorem{lemma}[definition]{Lemma}
\newtheorem{proposition}[definition]{Proposition}
\newtheorem{corollary}[definition]{Corollary}
\newtheorem{example}[definition]{Example}
\theoremstyle{remark}
\newtheorem{remark}[definition]{Remark}
% TEXT-MODE MACROS
\newcommand{\IE}{\mathbb{E}}
\newcommand{\IN}{\mathbb{N}}
\newcommand{\IR}{\mathbb{R}}
\newcommand{\IZ}{\mathbb{Z}}
\newcommand{\union}{\cup}
\newcommand{\Union}{\bigcup}
\newcommand{\code}{\texttt} % for use as \code{test()}
\newcommand{\defn}{\emph} % for use as \defn{...}
% MATH-MODE MACROS
\newcommand{\qtext}[1]{\quad\text{#1}\quad} % for use as \def{...} inside text
\newcommand{\ra}{\rightarrow}
\newcommand{\lra}{\longrightarrow}
\newcommand{\floor}[1]{\lfloor#1\rfloor}
\newcommand{\abs}[1]{|#1|}
\newcommand{\eps}{\epsilon}
\newcommand{\float}{\mathrm{float}}
\newcommand{\interval}[1]{[#1)}
\title{circllhist}
\subtitle{A Log-Linear Histogram Data Structure for IT Infrastructure Monitoring}
\author{
Heinrich Hartmann \\
\texttt{[email protected]} \\
Circonus \\
\And
Theo Schlossnagle \\
\texttt{[email protected]} \\
Circonus
}
\begin{document}
\maketitle
\begin{abstract}
The circllhist histogram is a fast and memory efficient data structure for summarizing large
numbers of latency measurements. It is particularly suited for applications in IT infrastructure
monitoring, and provides nano-second data insertion, full mergeability, accurate approximation of
quantiles with a-priori bounds on the relative error.
Open-source implementations are available for C/lua/python/Go/Java/JavaScript.
\end{abstract}
\section{Introduction}
Latency measurements have become an important part of IT infrastructure and application monitoring.
The latencies of a wide variety of events like requests, function calls, garbage collection, disk
IO, system-call, CPU scheduling, etc. are of great interest of engineers operating and developing IT systems.
There are a number of technical challenges associated with managing and analyzing latency data. The
volume emitted by a single data source can easily become very large. Furthermore, data has to be
collected and aggregated from a large number of different sources. The data has to be stored over
long time periods (months, years), in order to allow historic comparisons and long-term service
quality estimations (SLOs).
In order to address these challenges a compression scheme has to be applied, that drastically
reduces the size of the data to be stored and transmitted. Such a compression scheme needs to allow
at minimum (1) arbitrary aggregation of already compressed data, (2) accurate quantile
approximations, with a-priori bounds on the relative error (3) accurate counting of requests larger
or lower than a given threshold. Furthermore it's beneficial if (4) information about the full
distribution is retained, so that general probabilistic modeling techniques can be applied.
Traditionally monitoring tools, either store raw data on which calculations are performed (e.g.
ELK\footnote{\url{https://www.elastic.co/what-is/elk-stack}}) or compute latency quantiles on each host
separately and store them as numeric time series
(e.g. statsd\footnote{\url{https://github.com/statsd/statsd}}). Both approaches have obvious
drawbacks. The high volume of data makes raw data storage uneconomical for sources like request
latencies, and impractical for high volume sources like function-call or system-call latencies.
Direct calculation of quantiles does not allow further aggregation, so that accurate quantiles for
the total population can not be calculated.
\begin{figure}
\includegraphics[width=\textwidth]{assets/demo.png}
\caption{Circllhist representation of 20M block-level disk I/O latencies measured over the course
of one month.}
\label{fig:demo}
\end{figure}
Circonus has addressed this problem with the circllhist data-structure, that we describe in this document.
It has been in and production use since 2011.
Figure \ref{fig:demo} shows an example of a visualized circllhist in the Circonus product.
We have talked about this at various conferences and blogs (e.g. \cite{TS0},\cite{TS1},\cite{TS2}, \cite{HH1}).
An open source implementation is available at \cite{libcircllhist}.
However, no academic paper was published until now.
This document describes the circllhist data-structure and compares it to other methods
that have been adopted by other monitoring vendors since then.
\section{Related Work} \label{sec:rw}
There is also a fair bit of work in the academic literature on the problem of efficiently
calculating aggregated quantiles since 1980. This runs under the name ``mergeable summaries'' and
``quantile sketches''. A good summary of these methods can be found in \cite{dd}, Section 1.2.
Here we focus on methods that have been adopted in practice.
A very similar approach to ours has been suggested by G. Tene from Aszul Systems who developed
a High Definition Range (HDR) Histogram data-structure \cite{hdr} to capture latency
data for benchmarking applications.
T. Dunning and O. Ertl developed the t-digest data structure in \cite{tdigest}, which is used in the
Wavefront monitoring product. The t-digest aggregates
nearby points into clusters of adaptive size, in such a way, that high resolution data is available
at the tails of the distribution, where it's most critical for applications.
The Prometheus monitoring system \cite{prom} has added a simple histogram data-type that allows a
rough summarization of the distribution with a set of numeric time series.
Most recently Data Dog has published a logarithmic histogram data structure DDSketch in \cite{dd}.
In the next section we will develop some theory around general histogram summaries, that allows us
to precisely define HDR Histograms, DDSketches and the circllhist in section \ref{sec:eval}.
\section{Theory}
\begin{figure}
\includegraphics[width=\textwidth]{assets/LLBins.png}
\caption{Construction of the Log-Linear Binning}
\label{fig:llbins}
\end{figure}
In this section we develop an abstract theory of histograms to a degree that allows us to
formally define circllhist as a linear refinement of a logarithmic histogram structure.
The basic idea behind log-linear histograms like the circllhist is illustrated in Figure \ref{fig:llbins}.
We start with a logarithmic binning of the real axes, that has bins at the powers of ten.
\begin{center}
\begin{BVerbatim}
... 0.01, 0.1, 1, 10, 100, ...
\end{BVerbatim}
\end{center}
We divide each logarithmic bin into $n=90$ equally spaced segments. In this way the bin boundaries
are precisely the base-10, precision-2 floating point numbers:
\begin{center}
\begin{BVerbatim}
... 1.0, 1.1, 1.2, ... 9.9,
10, 11, 12, ... 99,
100, 110, 120, ... 990, ...
\end{BVerbatim}
\end{center}
Those are the bin boundaries for the circllhist data structure.
When samples are inserted into the circllhist, we retain counts of the number of samples in each bin.
This information allows us to approximate the original location of the inserted samples with a
maximal relative error less than $5\%$.
\subsection{Binnings}
\begin{definition}
Let $D \subset \IR$ be a connected subset of the real axes (e.g. $D=\IR, D=[0,1)$).
A binning of $D$ is a collection of intervals $Bin[i], i \in I$, that are disjoint and collectively cover the binning domain $D$:
\begin{align*}
D = \Union_{i\in I} Bin[i] \qtext{and} Bin[i] \cap Bin[j] = \emptyset \qtext{for} i \neq j.
\end{align*}
The map that associates to each $x \in D$ the unique index $i$ so that $x \in Bin[i]$ is called
binning map and is denoted as $bin(x) = i$.
\end{definition}
\begin{remark}
The binning map $bin: D \ra I$ determines the binning via $Bin[i] = \{ x \in D \,|\, bin(x) = i \}$.
\end{remark}
\begin{example}
The linear binning of $\IR$ is given by $I = \IZ$, with
\begin{align*}
Bin[i] = [i, i+1) \qtext{and} bin(x)=\floor{x}
\end{align*}
\end{example}
\begin{example}
The length $n$ linear binning of $[0,1)$ is given by $I = \{0, \dots, n-1\}$, with
\begin{align*}
Bin^{Lin}_n[i] = [ \frac{i}{n}, \frac{i+1}{n} )
\qtext{and}
bin^{Lin}_n(x) = \floor{x \cdot n}
\end{align*}
\end{example}
\begin{example}\label{example:log}
The logarithmic binning with basis $b > 0$ of $\IR_{>0}$ is given by $I=\IZ$, with
\begin{align*}
Bin^{Log}_b[i] = [b^i, b^{i+1})
\qtext{and}
bin^{Log}_b(x)=\floor{\log_b(x)}
\end{align*}
\end{example}
\begin{definition}\label{ref}
Given a map $\alpha: I \ra J$, and a binning $(I, Bin)$, we can define a new binning
$(J, Bin^*)$ by setting:
\begin{align*}
Bin^*[j] = \Union_{i, \alpha(i) = j} Bin[i], \qtext{and} bin^*(x) = \alpha(bin(x))
\end{align*}
In this situation we call $(J, Bin^*)$ a coarsening of $(I, Bin)$, and $(I, Bin)$ a refinement of $(J, Bin^*)$.
\end{definition}
\begin{definition}
Given a binning $Bin[i], i\in I$ with half-open bins $Bin[i] = [a_i, b_i)$.
The length-n linear refinement of $(I, Bin)$, is given by the index set $I \times \{0,\dots, n-1\}$,
bins
\begin{align}\label{eq:lref}
L_nBin[i,j] = [ a_i + \frac{j}{n}(b_i - a_i), a_i + \frac{j+1}{n}(b_i - a_i) )
\end{align}
\end{definition}
\begin{lemma}
The binning map of the length-n linear refinement is given by:
\begin{align}\label{eq:lrefmap}
L_nbin(x) = ( bin(x), \floor{\frac{x - a_{bin(x)}}{b_{bin(x)} - a_{bin(x)}} \cdot n } ).
\end{align}
\end{lemma}
\begin{proof}
We have to show that $x \in L_nBin[ L_nbin(x) ]$ for all $x \in D$.
Let $(i,j) = L_nbin(x)$.
Since $i = bin(x)$, we know that $x \in [a_i, b_i)$.
Now we consider the linear map $\phi(x) = (x-a_i)/(b_i-a_i)$ which maps $Bin[i]$ bijectively to $[0,1)$.
We have $j = \floor{ n \phi(x) }$ by definition of $L_nbin(x)$.
To show that $x \in L_nBin[ L_nbin(x) ]$ it suffices to verify
that $\phi(x) \in \phi(L_nBin[i,j]) = [ \frac{j}{n}, \frac{j+1}{n} )$.
And indeed,
\begin{align*}
\frac{j}{n} = \frac{\floor{ n \phi(x) }}{n} \leq \phi(x) <
\frac{\floor{ n \phi(x) } + 1 }{n} = \frac{j+1}{n}.
\end{align*}
\end{proof}
\begin{lemma}
The length-n linear refinement of a binning $(I, Bin)$ is a refinement in the sense of Definition \ref{ref}.
The index map is given by $\alpha(i,j) = i$ for $i \in I$, $j \in \{0,\dots,n-1\}$.
\end{lemma}
\begin{proof}
We have to show that $\Union_{j} L_nBin[i,j] = Bin[i]$, for all $i \in I$.
Again we consider the linear bijection $\phi(x) = (x-a_i)/(b_i-a_i)$, which maps $Bin[i]$ to $[0,1)$ and $L_nBin[i,j]$ to $[j/n, (j+1)/n)$.
Hence it suffices to show that $[j/n, (j+1)/n)$ cover $[0,1)$ for $j=0,\dots,n$ which is evident.
\end{proof}
Now we are in a position to define log-linear binnings.
\begin{definition}\label{def:ll}
The base $b$, length $n$ log-liner binning of $\IR_{>0}$ is the length-n linear refinement of the base-b Logarithmic binning of $\IR_{>0}$.
\end{definition}
\begin{proposition}\label{prop:ll}
Let $b,p$ be positive integers. The boundaries of the base $b$, length $n = b^p - b^{p-1}$ log-linear
binning or $\IR_{>0}$ are precisely the base-p precision-p floating point numbers:
\begin{align*}
\float_{b,p}(e, d) = \frac{d}{b^{p-1}} \cdot b^e = d \cdot b^{e-p+1} \qtext{with} e \in \IZ, d \in \{ b^{p-1}, \dots, b^p - 1 \}
\end{align*}
The binning map is given by
\begin{align*}
bin(x) = (e(x), d(x) - b^{p-1}), \qtext{with} e(x) = \floor{\log_b(x)},\; d(x) = \floor{x \cdot b^{-e(x) + p - 1}}
\end{align*}
with values $e(x) \in \IZ$ and $d(x) \in \{b^{p-1}, \dots, b^{o} - 1\}$.
This binning is also called base-b precision-p log-linear binning.
\end{proposition}
\begin{example}
According to Proposition \ref{prop:ll}, the base-10 precision-1 binning has bin boundaries at
\begin{align*}
\{ d \cdot 10^e \,|\, e \in \IZ, d \in \{ 1, \dots, 9 \} \} = \{ \dots 0.8, 0.9,\; 1, 2 \dots, 8, 9,\; 10, 20 \dots \}
\end{align*}
binning map
\begin{align*}
bin(x) = (e(x), d(x) - 1), \qtext{with} e(x) = \floor{\log_{10}(x)},\; d(x) = \floor{x / 10^{e(x)}}.
\end{align*}
\end{example}
\begin{proof}
To proof Proposition \ref{prop:ll}, we compute the log-linear bin boundaries using equation \ref{eq:lref}
with $a_i = b^i, b_i = b^{i+1}$ and $n = b^p - b^{p-1}$:
\begin{align*}
Bin[e,j] &= [ b^e + \frac{j}{b^p - b^{p-1}}(b^{e+1} - b^e), b^e + \frac{j + 1}{b^p - b^{p-1}}(b^{e+1} - b^e) ) \\
&= [ b^{e-p+1}(b^{p-1} + j), b^{e-p+1}(b^{p-1} + j + 1) ) \\
&= [ d b^{e-p+1}, (d + 1) b^{e-p+1} )
\end{align*}
where we set $d = b^{p-1} + j$. If $j$ runs through $1 \dots n$, then $d$ runs through $b^{p-1},\dots, b^p -1$.
This shows that the lower boundaries are exactly the base-b precision-p floating point numbers.
For the upper boundary note, that the if $d=b^p-1$ then $(d+1)b^{e-p+1} = b^{p-1} b^{(e+1)-p+1}$ is again
a base-b precision-p floating point number (with a larger exponent).
The binning map can be explicitly calculated using Equation \ref{eq:lrefmap} as:
\begin{align*}
bin(x) &= (e(x), k(x)), \qtext{with} e(x) = \floor{\log_b(x)} \\
k(x) & = \floor{ \frac{x - b^{e}}{b^{e+1} - b^e} (b^p - b^{p-1}) }
= \floor{x \cdot b^{-e + p - 1}} - b^{p-1}
= d(x) - b^{p-1}
\end{align*}
as claimed.
\end{proof}
\begin{definition}\label{def:circllhist}
The circllhist binning is the base-10 precision-2 log-linear binning extended to the real axes,
with bins:
\begin{align*}
Bin[+1,e,d] &= [ d \cdot {10}^{e-1}, (d + 1) 10^{e-1} ), \quad e \in \IZ, d \in \{ 10, \dots, 99 \} \\
Bin[0,0,0] &= \{ 0 \} \\
Bin[-1,e,d] &= [ -d \cdot {10}^{e-1}, -(d + 1) \cdot 10^{e-1} ).
\end{align*}
The binning map is given by:
\begin{align*}
bin(x) &= (+1, e, d), e = \floor{\log_{10}(x)}, d = \floor{x \cdot 10^{-e - 1}} \\
bin(0) &= (0, 0, 0) \\
bin(-x) &= (-1, e, d)
\end{align*}
for $x > 0$.
\end{definition}
\begin{proposition} \label{prop:rec}
The binning map of the circllhist can be recursively computed with the following algorithm:
\begin{BVerbatim}[fontfamily=tt]
function bin(x)
if x == 0:
return (0,0,0)
if x < 0:
(s, e, d) := bin(-x)
return (-1, e, d)
if x < 10:
return bin(x * 10) - (0, 1, 0)
if x > 100:
return bin(x / 10) + (0, 1, 0)
else: # 10 <= x < 100:
return (+1, 1, floor(x))
end
\end{BVerbatim}
In particular, the circllhist binning map can be computed without use of the logarithm function.
\end{proposition}
\begin{proof}
The recursion terminates since every positive number $x$ can be brought into range $10 \leq x <
100$ with a finite number of divisions or multiplications by $10$. It's straight forward to
verify that each case computes valid results assuming that the results of the recursive call are
correct.
\end{proof}
\begin{remark}
If $x,e$ are integers, then the binning map of the number $x \cdot 10^{e}$ can be computed without
the use of floating point arithmetic as $bin(x) + (0,e,0)$ using Algorithm \ref{prop:rec}.
This is of practical relevance when used in an environment which do not have floating
point arithmetic available. One example being nano-second latencies measured in the Linux kernel,
or embedded devices.
\end{remark}
\subsection{Paretro Midpoints}
\begin{proposition} \label{prop:pdist}
Given an interval $[a,b]$ in $\IR_{>0}$ the unique point $m$ in $[a,b]$ so that the maximal
relative distance $rd(m, y) = |m-y|/y$ to all other points in $[a,b]$ is minimized
by the paretro midpoint
$m = 2ab / (a + b)$.
The maximal relative distance to the paretro midpoint is assumed at the interval boundaries
$rd(m,a) = rd(m,b) = (b - a) / (a + b)$.
\end{proposition}
\begin{proof}
We have to minimize the function $maxrd(x) = max_{y\in[a,b]} rd(x, y)$ over $[a,b]$.
The maximum $max_{y\in[a,b]} rd(x, y)$ is attained either for $y = a$ or $y = b$,
hence $maxrd(x) = max\{ rd(x,a), rd(x,b) \}$.
Note that the function $f(x) = rd(x, a)$ is continues and strictly monotonically increasing on $[a,b]$, with $f(a) = 0, f(b) > 0$,
and the function $g(x) = rd(x, b)$ is continues and strictly monotonically decreasing on $[a,b]$, with $g(a) > 0$ and $g(b) = 0$.
The point $m$ is the unique point in $[a,b]$ where both functions are equal, with
\begin{align*}
rd(m, a) = \frac{b - a}{a + b} = rd(m, b)
\end{align*}
Now if $a \leq x < m$ then $rd(x, b) = g(x) > g(m)$ and so $maxrd(x) > maxrd(m) = g(m)$.
Similarly if $m < x \leq b$ then $rd(x, a) = f(x) > f(m)$ and so $maxrd(x) > maxrd(m) = f(m)$.
This shows that $x=m$ is the unique minimum of $maxrd(x)$ on $[a,b]$.
\end{proof}
The following proposition gives a probabilistic interpretation of the location of relative distance minimizing midpoint.
Recall, that the expected value of a uniformly distributed random variable $X \sim U[a,b]$ is the midpoint $\IE[X] = (a+b)/2$.
\begin{proposition}
Given an interval $[a,b]$, and an a=2 paretro distributed random variable $X$, then
the paretro midpoint is the conditional expectation:
\begin{align*}
\IE[ X \, | \, X \in [a,b] \,] = 2ab / (a + b).
\end{align*}
\end{proposition}
\begin{proof}
The paretro distribution has density $p(x) =C \cdot 1/x^{a+1}$, for some positive constant $C$, so for $a=2$ we get
\begin{align*}
\IE[ X \, | \, X \in [a,b] \,] = \frac{\int_a^b x p(x) dx}{\int_a^b p(x) dx}
= \frac{\int_a^b x^{-2} dx}{\int_a^b x^{-3} dx} = 2 \frac{b^{-1} - a^{-1}}{b^{-2}- a^{-2}}
= \frac{2}{b^{-1} + a^{-1}}
= 2 \frac{ab}{a + b}
\end{align*}
which proves the claim.
\end{proof}
\begin{proposition}\label{prop:21}
The maximal relative distance to a paretro midpoint in the circllhist binning is $1/21 \approx 4.76\%$.
\end{proposition}
\begin{proof}
Substituting the bin boundaries into the formula given in Proposition \ref{prop:pdist} we find
$(b - a)(a + b) = 1/(2d + 1)$ for the bin $Bin[e,d], e \in \IZ, d \in \{ 10, \dots, 99 \}$.
This is minimized for $d = 10$ with a value of $1/21$ as claimed.
\end{proof}
\subsection{Histograms}
Once we have established the binnings, histograms are easy to define.
\begin{definition}\label{def:hist}
A histogram with domain $D \subset \IR$ is a binning $Bin[i],I$, together with a count function $H: I \ra \IN_{0}$.
\end{definition}
Given a dataset and a binning, we can associate a histogram.
\begin{definition}
Given binning $Bin[i],I$ of $D \subset \IR$, and a dataset $X = (x_1,\dots,x_n)$ with values in
$D$, we define the histogram summary of $X$ as the histogram with binning $Bin[i],I$ and count
function
\begin{align*}
H_X(i) = \# \{ j \, | \, x_j \in Bin[i] \, \}.
\end{align*}
This means, that $H_X(i)$ counts the number of points of $X$ lying in $Bin[i]$.
\end{definition}
Histograms can be freely merged without loosing information.
\begin{definition}
Let $H_1, H_2$ be histograms for the same binning. The merged histogram $H_1 + H_2$ has count function:
\begin{align*}
(H_1+H_2)(i) = H_1(i) + H_2(i).
\end{align*}
\end{definition}
The merge operation is clearly associative and commutative.
We can easily see that the histogram merge computes is compatible with merge (concatenation) of datasets:
\begin{proposition}
Given binning $Bin[i],I$ of $D \subset \IR$, and two datasets $X = (x_1,\dots,x_n),Y=(y_1,...y_m)$ with
values in $D$. Let $Z=(x_1, \dots, x_n, y_1, \dots, y_m)$ be the merged dataset, then $H_Z = H_X + H_Y$.
\end{proposition}
\begin{proof}
We have
\begin{align*}
H_Z(i) &= \# \{ j \, | \, z_j \in Bin[i] \, \} =
\# \{ j \leq n \, | \, x_j = z_j \in Bin[i] \, \} +
\# \{ j > n \, | \, z_j = y_{j-n} \in Bin[i] \, \} \\
&= H_X(i) + H_Y(i). \qedhere
\end{align*}
\end{proof}
\def\hyph{{\hbox{-}}}
\begin{remark}
Let $H$ be a histogram summary of a dataset $X$, and a threshold value $y$. The count functions
$count\hyph below_X(y) = \#\{ i | x_i < y \}$ and $count\hyph above_X(y) = \#\{ i | x_i \geq y \}$
are of great practical interest. If the bin boundaries line-up with the threshold, we can
get exact approximations of those functions from the histogram.
In the case of the circllhist, we get exact counts for every 2-digit precision decimal floating point number.
For logarithmic histograms we get exact counts at the powers of the base $b^e, e\in \IZ$.
\end{remark}
\subsection{Histogram Operations}
Given a histogram we can try to approximate statistics (means, percentiles, etc.) of the original dataset.
There are three strategies for doing so, that we have found valuable in practice:
\begin{enumerate}
\item Derive a probability distribution from a histogram (with uniform distribution inside the
bins), and apply the statistics to this distribution.
\item Resample the dataset by placing all points inside a bin at equally spaced distances ({\it fair resampling}).
\item Resample the dataset by placing all points inside a bin into the (paretro) midpoints ({\it midpoint resampling}).
\end{enumerate}
From a theoretical perspective the probabilistic strategy is the most natural, and gives generally
good results for data sampled from continues distribution (with density). Midpoint resampling is
the simplest and gives a low a-priori error bounds independent of the distribution of $X$. Fair
resampling mitigates between both approaches, and gives lower expected errors on a wide variety of
datasets and while avoiding large relative errors in case of single sample bins.
The current implementation of circllhist uses midpoint resampling to define sum, mean, stddev and
moment estimation. For percentile calculations fair resampling is used.
\begin{proposition}
Let $X=(x_1,\dots,x_n), n>0$ be a dataset with values in $\IR_{>0}$, let $H$ be the circllhist summary of $X$.
For each bin $Bin[i]$, let $c_i \in Bin[i]$ be the paretro midpoint. Let
\begin{align*}
count[H] = \sum_i H(i), \quad sum[H] = \sum_{i\in I} H(i) \cdot c_i, \qtext{and} mean[H] = sum[H] / count[H].
\end{align*}
then
\begin{enumerate}
\item $count[H] = count(X) = n$.
\item The relative error $|sum[H] - sum(X)| / sum(X)$ is smaller than $1/21 \approx 4.76\%$.
\item The relative error $|mean[H] - mean(X)| / mean(X)$ is smaller than $1/21 \approx 4.76\%$.
\end{enumerate}
\end{proposition}
\begin{proof}
The first claim follows immediately from the definition of $count[H]$, and $H(X)$.
For the second claim, we have
\begin{align*}
sum[H] - sum(X) = \sum_{i\in I} H(i) \cdot c_i - \sum_j x_j
= \sum_{i\in I} ( H(i) \cdot c_i - \sum_{j, x_j \in Bin[i]} x_j)
= \sum_{i\in I} \sum_{j, x_j \in Bin[i]} (c_i - x_j)
\end{align*}
And hence,
\begin{align*}
|sum[H] - sum(X)| \leq \sum_{i,j, x_j \in Bin[i]} |c_i - x_j| \leq \sum_{j} \frac{1}{21} |x_j| = \frac{1}{21} \cdot sum(X),
\end{align*}
where we used \ref{prop:21} in second step.
This shows the second claim.
For the third claim follows from the second by extending the fraction with $1/n$.
\end{proof}
\subsection{Quantiles}
\label{sec:quantiles}
Quantiles can be approximated from histograms using any of the strategies outlined at the beginning
of the last section. The probabilistic strategy has the downside, that for $q$ close to $1$, the
$q$-quantile will always be near the high end of the largest bin with at least one sample in
it. Often times this bin contains only a single sample. In this case the worst case error of the
full bin size can be easily assumed.
Midpoint resampling has the downside, that expected errors for percentiles in the center of the
distribution are much larger than needed. For densely populated bins the uniform distribution is
often a good approximation, and can be used to estimate percentiles with high precision in typical
cases.
Fair resampling offers a welcome middle route. Instead of placing all samples at the (paretro)
midpoint, or smoothing them out with a uniform distribution, we place the samples at equally
spaced position within each bin. In this way, densely populated bins will be approximated by a near
uniform distribution and bins with only a single sample will be replaced by a single sample at the
midpoint, reducing the worst-case error to half the bin size.
\begin{definition}
Given a histogram $H$ on a binning $Bin[i], i \in I$ we define the fair resampling $X$ of $H$ as follows.
For each bin $Bin[i]$, with boundaries $a,b$ and count $H(i)=n$, we consider the points
\begin{align*}
x_{i,k} = a + \frac{k}{n+1} (b-a) \qtext{for} k=1,\dots,n
\end{align*}
and let $\hat{X} = (x_{i,k} \,|\, i \in I, k = 1\dots,H(i))$. This is a dataset with $count[H]$ points.
Similarly, we define the midpoint resampling of $H$, with $H(i)$ samples at the midpoint of $Bin[i]$,
and the paretro midpoint resampling of $H$, with $H(i)$ samples at the paretro midpoint of $Bin[i]$.
\end{definition}
We now want to define the quantiles of a histogram as quantiles of the fair resampling of $H$.
Before we can do so we first need to clarify the quantile definition for datasets.
While there is only a single established quantile definition for probability distributions, there
are multiple different quantile definitions for datasets found in the wild.
A comprehensive list was compiled by Hyndman-Fan in 1996 \cite{HF1996}.
For our discussion we will need type-1/7 quantiles from the Hyndman-Fan list,
as well as two other quantile functions, used by data-structures covered in our evaluation.
\begin{definition} \label{def:quantiles}
Given a dataset $X=(x_1,\dots,x_n), n>0$ of real numbers, and a number $q \in [0,1]$.
Let $x_{(1)} \leq x_{(2)} \leq \dots \leq x_{(n)}$ be the ordered version of $X$.
We define the minimal type-1 q-quantile as
\begin{align*}
Q^1_0(X) = x_{(1)} \qtext{and} Q^1_q(X) = x_{(\ceil{q \cdot n})} \qtext{for} q > 0.
\end{align*}
We define the minimal type-7 quantile as
\begin{align*}
Q^{7}_q(X) = x_{(\floor{q \cdot (n - 1)} + 1)}.
\end{align*}
The interpolated type-7 quantile is given by:
\begin{align*}
Q^{7i}_q(X) = (1 - \gamma) \cdot x_{(\floor{q \cdot (n - 1)} + 1)} + \gamma \cdot x_{(\ceil{q \cdot (n - 1)} + 1)},
\end{align*}
where the interpolation factor $\gamma \in [0,1]$ is given by $\gamma = q(n-1) - \floor{q(n-1)}$.
The type-hdr quantile $Q^{hdr}_q(X)$ is given by $x_{(1)}$ if $qn \leq \frac{1}{2}$, by $x_{(n)}$
if $qn \geq n - \frac{1}{2}$, and otherwise, by
\begin{align*}
Q^{hdr}_q(X) &= x_{(\floor{q n - \frac{1}{2}})}.
\end{align*}
Similarly, the type-tdigest quantile is given by $x_{(1)}$ if $qn \leq \frac{1}{2}$, by $x_{(n)}$
if $qn \geq n - \frac{1}{2}$, and otherwise, by
\begin{align*}
Q^{tdigest}_q(X) &= (1 - \gamma) \cdot x_{(\floor{q n - \frac{1}{2}})} + \gamma \cdot x_{(\ceil{q n - \frac{1}{2}})}
\end{align*}
where $\gamma = q n - \frac{1}{2} - \floor{q n - \frac{1}{2}}$.
\end{definition}
\begin{figure}
\begin{subfigure}{0.5\textwidth}
\includegraphics[width=\textwidth]{assets/tquant-4.png}
\caption{Theoretical Quantile Functions}
\label{fig:tqf}
\end{subfigure}
\begin{subfigure}{0.5 \textwidth}
\includegraphics[width=\textwidth]{evaluation/images/practical_quantile_comparison.png}
\caption{Computed Quantile Functions}
\label{fig:pqf}
\end{subfigure}
\caption{Quantile Function Comparison for the Dataset $[1,2,3,4]$}
\label{fig:qf}
\end{figure}
Figure \ref{fig:tqf} shows a plot of these quantile definitions as functions on $q$. From a
theoretical perspective type-1 quantiles are the most natural, since they correspond the
probabilistic quantiles for the empirical distribution function of the dataset $X$. Also they allow
to formulate precise statements about lower counts (cf. Proposition \ref{prop:count} below). For
this reason the circllhist implementation uses minimal type-1 quantiles on the fair resampling of
the histogram as for quantile calculations.
Somewhat surprisingly, type-7 quantiles are most commonly found in software packages.
E.g. numpy \cite{numpy} computes interpolated type-7 quantiles by default.
Also DDSketch \cite{dd} and t-digest \cite{tdigest} implement variants of type-7 quantiles.
A detailed comparison between type-1 and type-7 quantiles can be found at \cite{HH19}.
The t-digest and the HDR Histograms use custom quantile functions that were not covered in Hyndman-Fan.
We call them ``type-hdr'' and ``type-tdigest'' here.
Figure \ref{fig:pqf} shows a plot of the quantile functions, produced by all relevant implementations.
\begin{proposition}\label{prop:count}
Let $X$ be a dataset of length $n > 1$, $y \in \IR$ a threshold value and $0 < q \leq 1$.
Then the following statements are equivalent:
\begin{itemize}
\item[(1)] There are at least $q n$ datapoints $x$ in $X$ with $x \leq y$.
\item[(2)] We have $Q^1_q(X) \leq y$.
\end{itemize}
\end{proposition}
\begin{proof}
To show (1) implies (2), assume that there are at least $q n$ datapoints below $y$.
Since the number of datapoints is always an integer, the same holds for $r = \ceil{q n}$.
This implies that the ordered values $x_{(1)}, \dots, x_{(r)}$ all lie below $y$,
and hence $x_{(r)} = Q^1_q(X) \leq y$.
To show (1) implies (2), assume that $Q^1_q(X) \leq y$ then the $r = \ceil{n q}$ values $x_{(1)},
\dots, x_{(r)} = Q^1_q(X)$ all lie below $y$. This shows there are at least $n q$
datapoints lie below $y$.
\end{proof}
\begin{proposition}
Let $X$ a dataset with $n > 0$ positive values and $0 < q \leq 1$. Let $H$ be the circllhist summary of $X$.
We denote by $\hat{X}^f(H)$ the fair resampling, and by $\hat{X}^p(H)$ the paretro-midpoint resampling of $H$.
\begin{enumerate}
\item The relative error of the estimated type-1 quantile is less than $1/21 \approx 4.76\%$ for
the paretro-resampling:
\begin{align*}
|Q^1_q(X) - Q^1_q(\hat{X}^p(H))| \leq \frac{1}{21} Q^1_q(X)
\end{align*}
\item The relative error of the estimated type-1 quantile is less than $1/10 = 10\%$ for
the fair-resampling:
\begin{align*}
|Q^1_q(X) - Q^1_q(\hat{X}^f(H))| \leq \frac{1}{10} Q^1_q(X)
\end{align*}
Furthermore, in case the quantile value falls into a bin with a single sample (typical for
outliers), the maximal relative error is $5\%$.
\end{enumerate}
\end{proposition}
\begin{proof}
Let $r = \ceil{q n}$, so that $Q^1_q(X) = X_{(r)}$.
By construction of the paretro resampling, the $r$-th ordered value in the $Q^1_q(\hat{X}^p) = \hat{X}^p_{(r)}$,
is the paretro midpoint of the bin containing $X_{(r)}$.
This shows that $|Q^1_q(X) - Q^1_q(\hat{X}^p)| < 1/21 Q^1_q(X)$ by Proposition \ref{prop:21}.
Similarly for the fair resampling, the $r$-th ordered value in the $Q^1_q(\hat{X}^f) = \hat{X}^f_{(r)}$,
lies in the bin containing $X_{(r)}$.
Now we claim, that the maximal relative distance to any point in the same circllhist bin is $10\%$.
Indeed, the worst case relative difference is assumed for a bin $[10 \cdot 10^e, 11 \cdot 10^e)$ with
$x = 10 \cdot 10^e$ and $y \ra 11 \cdot 10^e$ so that $|x-y|/x \ra 1/10 = 10\%$.
In case that there is only a single sample in the bin $[a,b)$ containing $X_{(r)}$, the value
$\hat{X}^f_{(r)}$ will be at the (arithmetic) midpoint of the bin $m = a+\frac{1}{2}(b-a)$.
Hence, in this case $|X_{(r)} - \hat{X}^f_{(r)}| / X_{(r)} \leq \frac{1}{2} (b-a)/X_{(r)}$
This number is maximized for $X_{(r)} = a$, and $a = 10 \cdot 10^e$ in the circllhist binning.
In which case $\frac{1}{2} (b-a)/a =\frac{1}{2} \frac{1}{10} = 5\%$.
\end{proof}
\begin{example}
With the notation from the last proposition. Let $X=(10,10,10, \dots, 10)$ of length $n$, then
the fair resampling is given by $\hat{X}^f = (10 + \frac{1}{n+1}, \dots, 10 + \frac{n}{n+1})$.
So $Q^1_1(\hat{X}^f) = 10 + \frac{n}{n+1}$ which converges to 11 for $n \ra \infty$.
On the other hand $Q^1_1(X) = max(X) = 10$. So the worst case error of 10\% is assumed in
the asymptotic case.
\end{example}
This proposition shows, that the worst-case relative error for the circllhist is 10\%. The
theoretical worst-case is only realized in cases were a large number of samples falls at the lower
end of a bin as shown in the last example. This is very rare to happen. In practice we usually
see a 5\% worst-case relative error at the tails of the distribution, and high accuracy
quantiles (<\%1 relative error) in the body of the distribution.
\section{The Circllhist Implementation}
The data-structure implemented in \cite{libcircllhist}, is a circllhist in the sense of
Definition~\ref{def:circllhist} with exponent range limited to $-128 \leq e < 127$.
In this way, both exponent $e$ and mantissa $d$ can be represented by 8-bit integers.
The counts $H(i)$ are represented as 64-bit unsigned integers.
The largest representable number of the circllhist is $99 \cdot 10^{127}$.
This number is larger than the age of the universe measured in nano-seconds (13.8 billion years)
The smallest representable positive number is $10 \cdot 10^{-128}$.
This number is smaller than the Plank time measured in years ($5.39 \cdot 10^{-44}$ s).
We have found this value range to be sufficient for all practical purposes.
Within that range the circllhist bins have a relative size of 1\%-10\% of the values contained in
them. Hence, the original values can be reconstructed with a relative error of no more than 10\%.
If the reconstructed values are placed at the paretro midpoint of the bin, the relative reconstruction
error can be reduced to 4.76\% (see Proposition \ref{prop:21}).
The accuracy of approximations of statistical quantities like sums, means and quantiles is commonly
much better than 10\%, since the individual reconstruction errors cancel across the dataset.
In the case of quantiles, we will see this in the evaluation below.
The counts of samples above or below a threshold, can be reconstructed accurately for threshold
that fall onto bin boundaries. In the case of the circllhist, those lie at decimal floating point
numbers with two digits of precision (e.g. 0.23, 1.5, 110). In practice, those are the numbers
that humans choose, if they have to come up with thresholds.
A circllhist is internally represented as a sparse list of (bin, count) pairs. If a bin has a count
of 0, then it can be skipped in the representation. The serialized form of the histogram will only
contain bins with non-zero counts. The theoretical maximum of used bins is $2 \times 256 \times 90
+ 1 = 46081$ (for sign, exponent, mantissa and zero bucket) which makes for a maximal size of 461kb.
In practice we have never seen histogram data structures getting close to this size. Even in
extreme cases, when capturing billions of samples from a nano-second to minute scale, the number of
allocated bins never exceeded 1000, and the total size stayed below 10kb.
Typical histograms in our system, have anywhere from 0-200 allocated bins and occupy <2kb before
compression.
A notable design goal of the circllhist is it's use for measurements inside the kernel or
low-powered embedded devices. In those environments floating point arithmetic is not available, and
insertion performance is particularly critical. For these purposes the circllhist provides a highly
optimized insertion function that avoids floating point arithmetic entirely (cf. Proposition
\ref{prop:rec}).
The C implementation of libcircllhist includes a number of performance optimizations. It comes with
an optional index structure, that avoids iteration over bins when retrieving and inserting data and
uses static branch annotations to aid CPU branch predictions. With these optimizations we can get
raw insertion latencies down to $\sim 10ns$ for integer, and $\sim 80ns$ for double
values.\footnote{ These latencies were measured in a tight C loop with the provided
\texttt{test/histogram\_perf.c} script, on a 2Ghz Intel Xeon CPU. The evaluation in
section~\ref{sec:eval} is Python based and uses different iteration counts and data. }
Implementations of the circllhist are available for a variety of languages including:
\begin{itemize}
\item C/C++ -- \url{https://github.com/circonus-labs/libcircllhist}
\item Python -- \url{https://github.com/circonus-labs/libcircllhist/tree/master/src/python}
\item Go -- \url{https://github.com/circonus-labs/circonusllhist}
\item Lua -- \url{https://github.com/circonus-labs/libcircllhist/tree/master/src/lua}
\item JavaScript -- \url{https://github.com/circonus-labs/circllhist.js}
\item C\# .NET -- \url{https://github.com/circonus-labs/netcircllhist}.
\end{itemize}
\section{Evaluation}
\label{sec:eval}
\begin{figure}
\includegraphics[width=\textwidth/2]{evaluation/images/dist_API_Latencies.png}
\includegraphics[width=\textwidth/2]{evaluation/images/dist_Simulated_Latencies.png}
\caption{Total Distribution of the 'API Latency' and 'Simulated Latencies' Datasets, with quantile markers.}
\label{fig:ds}
\end{figure}
In this section we present a numerical comparison between the quantile aggregation methods described
in Section \ref{sec:rw}. We will evaluate the precision and performance of quantile calculations,
performance of insertion and merge operations, as well as the size of the resulting data-structures.
The evaluation proceeds in three phases: insertion, merge and quantile calculation. In the insertion
phase, raw data is inserted the data-structures. Each data-set is split into a number of batches. We
create individual data-structures for each batch. In the merge phase, data-structures created for
each batch are merged into a single one. In the quantile phase, we perform quantile calculations on
the merged data-structure.
The evaluation was performed using a set of Jupyter notebooks, which are available at
\begin{center}
\url{https://github.com/circonus-labs/circllhist-paper}.
\end{center}
This repository contains datasets and source code used to generate the exact tables and graphics we
are using this document. We also provide docker images and instructions, that should aid the
reproduction of the evaluation results on other machines. We are open to extension and improvements
to the evaluation setup. Please contact us via email or open a pull request, if you identified flaws
or found improvements.
\subsection{Methods}
The following data-aggregation methods are considered for evaluation.
\begin{itemize}
\setlength{\itemindent}{2em}
\item[exact] Exact quantile computation based on numpy arrays \cite{numpy}.
\item[prom] Quantile estimation based on Prometheus Histograms \cite{prom}.
\item[hdr] The HDR Hisotgram data-structure introduced in \cite{hdr}.
\item[dd] The DDSketch data-structure introduced in \cite{dd}.
\item[t-digest] The t-digest data-structure introduced in \cite{tdigest}.
\item[circllhist] The circllhist data-structure described in this document.
\end{itemize}
The Prometheus monitoring system provides a histogram data-type that consists of a list of ``less
than'' metrics, which count how many samples were inserted that are below manually configured
threshold values. This datum of a Prometheus histogram is equivalent to a histogram in the sense of
Definition \ref{def:hist}, with bin-boundaries at the configured thresholds. Prometheus Histograms
are included here for their wide use in practice. The method itself is not really comparable, since
it the results are highly dependent on the number and location of the chosen thresholds. For the
evaluation, we follow the recommended practice of using a total ten bins at locations which cover
the whole data range with emphasis on the likely quantile locations. We use a hand-written Python
translation of the original quantile functions written in go.
The HDR Histogram data structure is a log-linear Histogram in the sense of Definition \ref{def:ll}.
It uses a base of 2, and a configurable range and precision. It's notable that the exposed API
let's the user specify decimal precision. The internal precision is then chosen in such a way that
the resulting base-2 bins are smaller than the specified base-10 accuracy. Using base-2 arithmetic
has the advantage of allowing bit-wise manipulation of floating points numbers to determine the bin
location. For our evaluation used the Python
implementation\footnote{\url{https://github.com/HdrHistogram/HdrHistogram_py}}, and configured the
HDR Histograms to cover the same range of the circllhist ($10^{-128}\dots10^{+127}$) with two digits
of decimal precision.
The DDSketch is a histogram for the logarithmic binning introduced in Example \ref{example:log}.
It allows arbitrary positive real numbers as basis, to configure the desired precision.
We use the Python implementation\footnote{\url{github.com/DataDog/sketches-py}}, with the default
precision of 1\% (corresponding to $b=101/99$).
The t-digest is the only evaluated method, that is not internally using a histogram data structure.
Instead the inserted data is aggregated into clusters of adaptive size. The sizes are chosen in
such a way that, high resolution data is available at the tails of the distribution. We use the
original Java implementation\footnote{\url{github.com/tdunning/t-digest}} called from python
using pyjnius\footnote{\url{https://github.com/kivy/pyjnius}}. The digests are configured with a
compression parameter of 100, which is described to be a ``common value for normal uses'' in
the source code.
For the circllhist we used the Python binding\footnote{\url{github.com/circonus-labs/libcircllhist}}.
It does not allow any configuration of bin sizes or accuracy.
\subsection{Datasets}
For the evaluation we choose three different datasets: ``Uniform Distribution'', ``API Latencies'' and ``Simulated API Latencies''
each containing over 1 million samples, split into more than 1000 batches.
For the ``Uniform Distribution'' dataset, random samples are generated for a uniform distribution on $[10,100]$.
These are split in to 1000 batches each containing 100 values, for a total of 100.000 samples.
The ``API Latencies'' dataset contains data collected at one of our internal APIs collected in 10
minute batches over the course of 6 weeks. The data is spread between 0.29 and 11.000 (ms). There
is a cutoff value around 11sec, that was caused by timeout logic getting triggered. This results in
the higher quantile values being close together. It consists of a total of 4.65 million samples in
6048 batches.
For the ``Simulated Latencies'' dataset, we generate a total of 1000 batches of randomized sizes
with randomized data. The size of the batches follows a geometric distribution. The samples
themselves are generated as a randomly displaced and scaled paretro distribution. The data is
spread between on between $10^{-5}$ and $10^{10}$ with an extremely long tail. The dataset contains
a total of close to 1.000.000 samples.
Figure \ref{fig:ds} contains a histogram visualization of the total distribution of these datasets.
\subsection{Size}
\begin{figure}
\begin{subfigure}{0.33\textwidth}
\includegraphics[width=\textwidth]{evaluation/images/all_size_line.png}
\caption{Size Comparison}
\label{fig:size}
\end{subfigure}
\begin{subfigure}{0.66\textwidth}
\input{evaluation/tables/all_size.tex}
\caption{Tabluated sizes in bytes}
\label{fig:tsize}
\end{subfigure}
\caption{Size Comparison}
\end{figure}
Figure \ref{fig:size} and Table \ref{fig:tsize} show the sizes of the data structures after all samples
from the respective datasets have been inserted and the data structures have been merged.
With our choice of parameters, the size of the HDR Histogram, DDSketch, circllhist follow each other
quite closely. Note that the size increases with the spread of the data, not with the size of the
inserted samples.
It's important to note, that we compare the sizes of the serialized versions of the data-structures,
as they would be consumed on disk or on the network, not the in-memory size (as this is harder to
estimate). The in-memory size might be larger than the serialized size. This is particularly true
for the HDR Histogram with pre-allocates all bins in memory, and skips empty bins for the
serialization.
The Prometheus histogram stores a count for each of the ten configured threshold value, as well as
the total count (infinite bin), and hence consumes exactly 88 bytes.
The size of the t-digest is relatively constant across all three data-sets, which reflects the fact
that the number of clusters is kept constant when additional data is inserted.
\subsection{Performance}
\begin{figure}
\begin{subfigure}{\textwidth}
\includegraphics[width=\textwidth/3]{evaluation/images/all_perf_insert.png}
\includegraphics[width=\textwidth/3]{evaluation/images/all_perf_merge.png}
\includegraphics[width=\textwidth/3]{evaluation/images/all_perf_quantile.png}
\caption{Performance Comparison}
\label{fig:perf}
\end{subfigure}
\begin{subfigure}{\textwidth}
\centering
\input{evaluation/tables/all_perf.tex}
\caption{Tabulated performance data in usec\protect\footnotemark}
\label{tab:perf}
\end{subfigure}
\caption{Performance Comparison}
\end{figure}
\footnotetext{
The insertion time is reported as per inserted sample.
The merge time is the time per merged batch.
The quantile time is reported per calculated quantile.
}
Figure \ref{fig:perf} and Table \ref{tab:perf} show the measured performance for insertion, merge
and analysis phase for the three considered datasets. As always, performance measurements should be
taken with a grain of salt, since they are heavily influenced by the implementation, configuration
and hardware choices. In our case, we choose to perform all performance measurements in python, with
the most popular python implementations available. In the case of the t-digest we ran into accuracy
(and performance) problems with the the python version so we used the official Java version via
pyjnius.
The measurements itself were performed with the
timeit\footnote{https://docs.python.org/3/library/timeit.html} package, with 5 consecutive batches
of runs each taking longer than 0.2 seconds. As recommended, we report the minimal duration that
was attained in any of the runs.
We ran these experiments on a server with Intel(R) Xeon(R) D-1540 CPU and 128GB RAM running
Linux 5.3.13 and python 3.7.3.
With this configuration, the circllhist was consistently the fastest method when it comes to
insertion and merge. For the quantile calculations the t-digest is very efficient, followed by
circllhist and DDSketch. Quantile calculations for the HDR Histogram take a lot longer, which
probably indicates optimization potential.
\subsection{Accuracy}
As we have seen in section \ref{sec:quantiles}, there are a number of different quantile definitions
circulating in the wild. When evaluating quantile accuracy, we have to make sure we are comparing
the computed numbers to the theoretical quantiles the methods are approximating. With the notation
of Definition~\ref{def:quantiles}, the circllhist and Prometheus approximate type-1 quantiles,
DDSketch approximates minimal type-7 quantiles.
t-digest and HDR Histograms approximate custom quantile functions, that we called ``type-tdigest''
and ``type-hdr'' in Definition~\ref{def:quantiles}.
In all cases, the following quantile values were considered:
\begin{align*}
0, 0.25, 0.5, 0.75, 0.9, 0.95, 0.99, 0.995, 0.999, 0.9999, 0.99999, 1.
\end{align*}
For each value, the quantile is computed once with the evaluated data-structure and once with the
respective theoretical function on the raw data. The relative difference between those value is
reported in Figure \ref{fig:acc} and Table \ref{tab:acc}.
\begin{figure}
\begin{subfigure}{\textwidth}