-
Notifications
You must be signed in to change notification settings - Fork 1
/
introduction.tex
80 lines (72 loc) · 3.98 KB
/
introduction.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
\section{Introduction}
% ONE sentence defining what a performance model is
Processor performance models are analytical tools that statically predict
program performance without dynamic execution.
They can simplify compiler optimizations such as
auto-vectorization and instruction scheduling,
giving compiler writers an abstract view of the machines they are targeting.
Performance models can also guide manual optimization by showing
potential program bottlenecks.
It is, however, no easy feat to construct an accurate performance model,
which has to consider various microarchitectural optimizations employed by
modern processors such as micro-op decomposition and out-of-order execution.
To make the matter worse, some of these optimizations
are not documented or proprietary.
Despite the complexity that existing performance models have to deal with,
there is no systematic methodology to verify them.
This hinders the development of \emph{accurate} performance models.
For instance, llvm-mca~\cite{llvm-mca}---which uses
LLVM's scheduling models---has an average error of more than 40\% on vectorized
basic blocks (Figure~\ref{fig:skl-cluster-err}).
Inaccurate performance models can cause mis-optimizations;
for instance, goslp's \cite{goslp} optimal\footnote{
With certain caveat.
In particular their formulation is optimal only when the target machine only has vector-width of 2}
formulation of SLP vectorization~\cite{slp},
in some cases, causes performance regression due to inaccuracies in LLVM's~\cite{llvm} performance models.
Verifying a performance model
entails cross-validating its predictions with
measurements collected
from native profiling.
However, there is currently no scalable approach
to profile arbitrary machine code, which
prevents a systematic validation of existing models.
Existing machine code profilers are not fully automatic and
unload the responsibility to ensure a basic block's successful
execution to the user.
Part of this responsibility is to prevent a code snippet from crashing:
Code extracted from a larger program context, when executed
out-of-context, is likely to access memory addresses illegally and crash.
However, successfully executing a basic block alone is insufficient.
One also needs to ensure that a basic block
is profiled in an environment
conforming to common modeling invariants
assumed by a performance model,
such as the absence of cache misses,
but most of these invariants are \textit{not}
guaranteed by existing profilers.
In this paper, we present a novel profiler that can automatically
profile the throughput of \textit{arbitrary} basic blocks.
We show how we use this profiler to build a benchmark for validating
performance models of x86-64 basic blocks.
Specifically, we make the following contributions in this paper:
\begin{itemize}
\item We describe a fully automatic profiler
for arbitrary, memory-accessing basic blocks.
The profiler does not require any user intervention.
Its measurements conform with common
modeling assumption made by existing models;
e.g., it ensures that a basic block incurs
\textit{no cache misses}.
We have used this framework to profile more than 2 million basic blocks\footnote{
We are releasing less basic blocks than this because of licensing issues.
}.
\item We propose a benchmark suite of over 300,000 basic blocks collected from a wide range of domains from numerical computation (e.g., OpenBLAS) to databases (SQLite).
\item We evaluated four existing performance models---IACA~\cite{iaca}, llvm-mca~\cite{llvm-mca} (which exposes LLVM’s performance model used for instruction scheduling), OSACA\cite{osaca}, and Ithemal\cite{ithemal}---on three recent Intel microarchitectures:
Ivy Bridge, Haswell, and Skylake.
\item We automatically classified these basic blocks
by their use of different execution ports.
Using this classification,
we analyze the strengths and weaknesses of
the performance models on different categories of basic blocks.
\end{itemize}