-
Notifications
You must be signed in to change notification settings - Fork 1
/
impl.tex
387 lines (303 loc) · 27.9 KB
/
impl.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
\chapter{Design and implementation}\label{sec:impl}
In order to unroll a loop with non-static bounds, this thesis follows a specific approach:
First, we check whether we are able to unroll the loop.
\Cref{sec:impl:unrollability} describes the conditions necessary and how we check them.
If we determine a loop to be unrollable, we will unroll it with the unrolling process covered in \Cref{sec:impl:unroll}.
Once this process is complete, the loop condition of the unrolled loop will be adapted to make sure it runs less than or equal times compared to the original loop.
This is described in~\Cref{sec:impl:fixup:header-cond}.
After that, we will create the \textit{fixup code}\footnote{The term \textit{fixup code} describes that code that has to be added to account for cases where the number of times the loop is executed modulo the unrolling factor is not equal to zero.}
, as described in \Cref{sec:impl:fixup}.
It is to be noted that in terms of actually implementing this procedure, we will create the fixup code \textit{before} unrolling the loop.
While this order seems counter-intuitive, we chose it in order to simplify the implementation of loop duplication, as described in~\Cref{sec:impl:fixup:loop}.
Henceforth, we assume loops to be in the form of the loop shown in~\Cref{fig:impl:general-loop}.
In the reference loop \textit{cmp} refers to a comparison that can be one of the following: $<, >, \leq, \geq$.
Further, $I \in \mathbb{Z}$ refers to the starting value, $N \in \mathbb{Z}$ to the bound, and $c \in \mathbb{Z} \backslash \zeroset$ \label{sec:impl::def-c} to the increment\footnote{N.B.: $c$ may be negative and could hence also be a decrement} of such a loop.
We select this form in view of the fact that most loops follow the form of using a counter or iterating over a given container, which condenses down to this form.
Furthermore, this form allows for many arithmetic properties to be used, as seen in~\Cref{sec:impl:fixup}.
\input{fig/general-loop.tex}
\newpage
\section{Determining unrollability}\label{sec:impl:unrollability}
Given that the primary goal of any optimization is to conserve semantics, most optimizations are based upon assumptions.
These assumptions will be assured, by checking corresponding preconditions before the optimization is applied, so that its transformed product will be semantically equivalent.
In the case of loop unrolling, we laid out the structure of the targeted loops in \Cref{fig:impl:general-loop}.
This section formalizes the resluting requirements and extends them, such that the further unrolling process conserves semantics.
Firstly, in view of the fact that we use the existing loop unrolling functionality as a sub-step (see~\Cref{sec:impl:unroll}), it needs to be ensured that the respective~\libFIRM-graph is in LCSSA form.
We accomplish this by using the existing mechanics~\cite{aebi18bachelorarbeit}.
While it is a preliminary step, assuring LCSSA form can never be a hindrance to unrolling, since it is possible to convert any given \libFIRM{} graph into LCSSA form.
Due to the restrictions of the existing loop unrolling mechanism, a loop must also be the innermost loop, meaning it does not have any nested loops inside of it.
Nested loops inherently cause larger code sizes and hardly saves jumps, since most jumps will occur in the inner loops.
Therefore, the restriction will in practice most likely not harm performance.
Given these conditions are met, we use the mechanics, described in \Cref{sec:impl:sel-factor}, for determining if and how a given loop should be unrolled based on size.
Moreover, in order for loops to be in the form described in \Cref{fig:impl:general-loop}, loops have to have a header, which itself controls the control flow by comparing a counter to a bound, using any of the four allowed comparison types.
The header is the only point in the loop from which the loop can exit; meaning there are no conditionals in the body that allow the control flow to leave the loop.
This primarily requires there not to be any \texttt{break}-like structures.
Seeing that there is an explicit entry point for the loop, the header, there are no preconditions for $I$, since it is therefore only evaluated once in a block dominating the header, but inherently not determining of the control flow after the initial evaluation.
On the contrary, $N$, the bound, has to be loop-invariant, which means that it may not change through the entire evaluation of the loop, because it is checked against $i$ in every iteration.
As an example, consider a loop, such as the one in \Cref{fig:impl:general-loop}, replacing \texttt{DoSomething} with \texttt{$N =$ randomNumber}.
If we now execute the body $f > 1$ times consecutively, we will effectively lose $f - 1$ checks.
Assume that initially $I = 0, c = 1, f = 2, N = 2$, and assume in the first execution in the loop body $N$ is set to $0$ by chance, whereas in the second iteration it is set to $7$.
Now given that when unrolling the condition is removed for the entering the second body, the loop body would at least four times, which does not conserve semantics, as it should only be executed once.
Concluding, only if $N$ is loop-invariant, the bound checks can be performed less often, while keeping the original semantics intact.
If $N$ is constant it is obviously loop-invariant, but what if it is the result of a function call or of a load from memory?
For the case that $N$ is function call, the called function must be pure (i.e., not have any side-effects), and only have loop-invariant arguments, seeing that the call is then by definition loop-invariant itself.
In case that $N$ is being loaded from memory, stricter conditions have to be met.
All stores within the loop must be sure not to alias the memory location of $N$.
Further, any calls must either invoke functions known at compile time and none of these may contain aliasing stores or have aliasing parameters.
Otherwise, the loop cannot be unrolled with a loaded bound, due to these called functions potentially modifying $N$.
Lastly, the unroll-factor -- meaning how often the loop body is copied inside the unrolled loop -- $f$ is selected (see \Cref{sec:impl:sel-factor}) and hence known at compile time.
We can therefore further restrict the increment $c$, such that $t_{min} \leq c \cdot f \leq t_{max}$, where $t_{min}$ is the minimum value of the integer type of $c$ and $t_{max}$ the respective maximum.
Hence, we prevent $c \cdot f$ from overflowing, which will turn out to be important in Sections \ref{sec:impl:fixup:header-cond} and \ref{sec:impl:fixup:duff}, and further discussed there.
In order to assure this property, we have to force $c$ to be a compile-time constant (which inherently is loop invariant).
Even though the restrictions on $c$ seem comparatively tight, in real-world code (gcc, \texttt{spec2006}) only approximately $1.2\%$ of loops that meet the previous conditions are not unrollable because of the restriction that $t_{min} \leq c \cdot f \leq t_{max}$.
It is worth mentioning that the unrollability with the method above is only checked if the current loop unrolling mechanism~\cite{aebi18bachelorarbeit} determines that the current unrolling process cannot be applied.
We chose this design, for the reason that statically unrolling without any further fixup code inherently simplifies the control flow and hence should yield better, or (at least) equal, performance.
\section{Unrolling}\label{sec:impl:unroll}
To get started with unrolling loops that have unknown bounds, we unroll them by a given factor without considering whether the transformation is semantically invariant.
Semantic equivalence, which is broken due to the failure to consider how the factor relates to the original amount of iterations, will be restored in \Cref{sec:impl:fixup}.
\libFIRM{} already provides an unrolling mechanism for unrolling a loop with a given factor $f$~\cite{aebi18bachelorarbeit}.\footnote{N.B.: All following operations preserve the LCSSA property of the code.}
To avoid code duplication, we will use be utilizing this solution.
Further, figures~\ref{fig:impl:unroll:unroll-factor-2-before}~and~\ref{fig:impl:unroll:unroll-factor-2-after} show a \libFIRM{} graph of a loop that is to be unrolled or is unrolled using a factor of two, respectively.
Especially to be noted is that in \Cref{fig:impl:unroll:unroll-factor-2-after} we duplicate the loop header, and that hence the number of conditional jumps did not decrease through the loop unroll.
With the previous usage, this was not an issue, because~\libFIRM{} would automatically remove these excess headers using its constant bit analysis.
Unfortunately though in the use cases of an unknown bound, the constant bit analysis does not suffice.
This is due to the fact that the additional semantics, meaning that we are sure not to have to exit the unrolled loop from its body at any time, that are implicitly affixed to the transformed loop, cannot be recognized by~\libFIRM.
Therefore, the need to manually prune the graph to remove the excess headers arises.
We accomplish this by using \Cref{alg:impl:unroll:prune-headers}.
First, we rewire all $\Phi$-nodes in the excess header, such that all in-loop nodes depending on any given $\Phi$-node each get the in-loop predecessors of the $\Phi$-node as predecessors themselves, while the $\Phi$-node falls into desuetude and will therefore be automatically removed by a later optimization.
We apply the same transformations to the descendants of the block itself.
\input{fig/prune-header.tex}
\input{fig/alg-old-unrolling.tex}
\input{fig/plain-unrolling.tex}
\newpage
\section{Fixup strategies}\label{sec:impl:fixup}
In \Cref{sec:impl:unroll} we discussed the unrolling process.
There, we did not consider the fixup code needed, but instead plainly focused on unrolling the loop.
Firstly, we will now focus on making the loop run less than or equal times compared to the original loop in section~\Cref{sec:impl:fixup:header-cond}.
Less-than or equal is not good enough though, we want our transformed loop to run exactly as often as the original loop.
Therefore, we will create fixup code, as discussed in this section and its subsections.
\Cref{sec:impl:fixup:duff} uses a generalized version of Duff's device to create the required fixup code, whereas in \Cref{sec:impl:fixup:loop} a copy of the original loop will be used.
After that, in \Cref{sec:eval:perf}, we evaluate which approach yields faster binary run-times.
To see the reason why we need fixup code and to understand what is required of it, we formally lay out conditions that need to be met in Equations~(\ref{eqn:impl:fixup:duff:conserve-semantics-identity}) through~(\ref{eqn:impl:fixup:duff:fixup-i-mult}).
Let $M \in \mathbb{N}_0$ be the number of times a loop runs before the transformation; $\Mloop \in \mathbb{N}_0$, $\Mfixup \in \mathbb{N}_0$ the number of times the unrolled body will run in the unrolled loop, or the fixup code respectively, after the transformation.
Further, the unroll-factor will be again denoted by $f \in \mathbb{N}, f > 1$.
Henceforth, we will assume all arithmetic operations to be integer operations for integers in the interval $\lbrack t_{min}, t_{max} \rbrack$.
Please note that unmarked integer division will be assumed to round towards zero: For example $\frac{5}{3} \overset{\text{integer division}}{=} 1$.
Another convention we will introduce is that any interval will be integral, meaning it will only contain integers.
Additionally $x \mp y$ is henceforth defined as
$\begin{cases}
x + y &, x < 0\\
x - y &, x > 0
\end{cases}$
We will now lay out properties that form the basis of further arithmetic considerations.
The primary identity that is to be conserved, to retain the original semantics, is shown below in \Cref{eqn:impl:fixup:duff:conserve-semantics-identity}.
Since we know our original loop ran $M$ times, we know that our transformed loop and the fixup code must in total also run $M$ times.
\begin{equation}\label{eqn:impl:fixup:duff:conserve-semantics-identity}
\begin{aligned}
M = \Mloop + \Mfixup
\end{aligned}
\end{equation}
In order to use Duff's device, we need to restrict the amount of times the fixup code needs to run.
With the requirements to preserve the semantics in mind, we will maximize $\Mloop$ and minimize $\Mfixup$.
\begin{equation}\label{eqn:impl:fixup:duff:loop-iterations}
\begin{aligned}
\Mloop &\overset{\text{integer division}}{=} \frac{M}{f} \cdot f\\
&\overset{\text{integer division}}{\in} \medspace \rbrack M-f,M \rbrack
\end{aligned}
\end{equation}
By construction of the unrolled loop, \Cref{eqn:impl:fixup:duff:loop-iterations} is always true, as the unrolled loop tries to run as often as possible, while running less than or equal times compared to the original loop.
\begin{proof}\label{proof:impl:fixup:duff:loop-iterations}
To prove the conjecture of \Cref{eqn:impl:fixup:duff:loop-iterations}, assume for contradiction
\[\Mloop = M - f - b, b \in \left \lbrack 0, f \right\lbrack \]
and hence
\[\Mloop \leq M - f \Rightarrow \Mloop \notin \medspace \rbrack M-f,f \rbrack,\]
then by rerunning the unrolled again the body would be executed $f$ times causing $\Mloop = M - b \in \medspace \rbrack M-f,f \rbrack$, which would be a contradiction of the assumption.
We then induct this pattern for $\Mloop = M - nf - b, n \in \mathbb{N}_{+}, b \in \lbrack 0, f \lbrack $.
In these cases, the loop must merely be iterated multiple times.
\end{proof}
As the loop runs as often as possible, the fixup code will always run less than $f$ times.
\begin{equation}\label{eqn:impl:fixup:duff:fixup-interval}
\begin{aligned}
\Mfixup \in \lbrack0, f\lbrack
\end{aligned}
\end{equation}
\begin{proof}\label{proof:impl:fixup:duff:fixup-interval}
Conjecture: $\Mfixup \in \lbrack0, f\lbrack$.\\
Assume for contradiction $\Mfixup = f' > (f - 1)$
\begin{align*}
\Mloop + \Mfixup &\overset{\ref{eqn:impl:fixup:duff:loop-iterations}}{\geq} M - (f - 1) + f' \\
&> M - (f - 1) + (f - 1) \\
&= M \medspace \overset{\ref{eqn:impl:fixup:duff:conserve-semantics-identity}}{\mLightning}
\end{align*}
\end{proof}
For the following mathematical considerations, we need to round away from zero in integer division.
\Cref{lem:impl:fixup:duff:ceil-mp} describes how this can be accomplished.
\begin{lem}\label{lem:impl:fixup:duff:ceil-mp}
Given $Y \neq 0: \medspace \ceil{\frac{X}{Y}} = \frac{X + (Y \mp 1)}{Y}$
\end{lem}
\begin{proof}
To prove \Cref{lem:impl:fixup:duff:ceil-mp}, we consider the cases $X \text{ mod } Y = 0$ and $X \text{ mod } Y \neq 0$.
Further, we assume $Y > 0$, since the proof for $Y < 0$ can be performed analogously.
Consider the case that $X \text{ mod } Y = 0$.
In this case $\exists n \in \mathbb{N}: n \cdot Y = X$ and $\ceil{\frac{X}{Y}} = \frac{X}{Y} = n~(\star)$.
\begin{align*}
\Rightarrow \frac{X + (Y - 1)}{Y} &= \frac{n \cdot Y + (Y - 1)}{Y}\\
&=\underbrace{\frac{(n + 1) \cdot Y - 1}{Y}}_{< \frac{(n + 1) \cdot Y}{Y}}\\
&\overset{\text{integer division}}{=} \frac{n \cdot Y}{Y}\\
&= n\\
&\overset{\star}{=} \ceil{\frac{X}{Y}}
\end{align*}
Now consider $X \text{ mod } Y \neq 0$.
In this case $\exists n \in \mathbb{N}: n \cdot Y < X < (n + 1) \cdot Y$ and $\ceil{\frac{X}{Y}} \overset{\text{integer division}}{=} n + 1$.
\begin{alignat*}{2}
\Rightarrow n \cdot Y + (Y - 1) &< X + (Y - 1) &&< (n + 1) \cdot Y + (Y - 1)\\
\Rightarrow (n + 1) \cdot Y - 1&< X + (Y - 1) &&< (n + 2) \cdot Y - 1\\
\overset{\text{integers}}{\Rightarrow} (n + 1) \cdot Y &\leq X + (Y - 1) &&< (n + 2) \cdot Y\\
\overset{Y > 0}{\Rightarrow} \frac{(n + 1) \cdot Y}{Y} &\leq \frac{X + (Y - 1)}{Y} &&< \frac{(n + 2) \cdot Y}{Y}\\
&\Rightarrow \frac{X + (Y - 1)}{Y} &&\overset{\text{integer division}}= n + 1\\
& &&= \ceil{\frac{X}{Y}}
\end{alignat*}
\end{proof}
Using \Cref{lem:impl:fixup:duff:ceil-mp}, we use the loop parameters $I$, $N$ and $c$ to calculate the total number of loop iterations.
\begin{equation}\label{eqn:impl:fixup:duff:total-iteration-based-on-bounds}
\begin{aligned}
M &= \ceil{\frac{N - I}{c}} \\
& \overset{\text{integer division}}{=} \frac{N - I + (c \mp 1)}{c}
\end{aligned}
\end{equation}
With this result, we can then calculate $M$ accurately from the structure of the loop, since $M$ is not directly known.
As $I$ is not the initial value for the fixup's counter, we will need to calculate the initial value based on known parameters.
\begin{equation}\label{eqn:impl:fixup:duff:i-after-loop}
\begin{aligned}
\ipl = c \cdot \Mloop + I
\end{aligned}
\end{equation}
We can then use the last two equations to calculate $\Mfixup$ based on only quantities that are known at either compile-time or at run-time.
\begin{equation}\label{eqn:impl:fixup:duff:fixup-i}
\begin{aligned}
\Mfixup & \overset{\ref{eqn:impl:fixup:duff:conserve-semantics-identity}}{=} M - \Mloop \\
& \overset{\ref{eqn:impl:fixup:duff:total-iteration-based-on-bounds}}{=}
\frac{N - I + (c \mp 1)}{c} - \Mloop \\
& \overset{\ref{eqn:impl:fixup:duff:i-after-loop}}{=}
\frac{N - I + (c \mp 1)}{c} - \frac{\ipl + I}{c} \\
&= \frac{N - I + I - \ipl + (c \mp 1)}{c}\\
&= \frac{N - \ipl + (c \mp 1)}{c}
\end{aligned}
\end{equation}
As the above equation has a costly division operation in it, we will rearrange it, such that it never needs to be computed at runtime.
\begin{equation}\label{eqn:impl:fixup:duff:fixup-i-mult}
\begin{aligned}
(\ref{eqn:impl:fixup:duff:fixup-i}) &\overset{\text{integer division}}{\Longleftrightarrow} \Mfixup \cdot c = N - \ipl + (c \mp 1) \overset{\ref{eqn:impl:fixup:duff:fixup-interval}}{\in} \cinterval
\end{aligned}
\end{equation}
\Cref{eqn:impl:fixup:duff:fixup-i-mult} is especially significant in the construction of the generalization of Duff's device, as seen in~\Cref{sec:impl:fixup:duff}.
\subsection{Updating the loop condition}\label{sec:impl:fixup:header-cond}
In the following Sections~\ref{sec:impl:fixup:duff}, and~\ref{sec:impl:fixup:loop} we will use that $\Mloop = \frac{M}{f} \cdot f$.
Though, when unrolling (as described in~\Cref{sec:impl:unroll}), the original bound ($N$) is kept.
Unfortunately, this does not guarantee $\Mloop$ to be correct, as made clear by an example where a loop with $I = 0, N = 3, c = 1, f = 2, cmp = <$ is unrolled.
In this example, this would yield the following: $\Mloop = 4 > M = 3 \mLightning$, due to the fact that after the first iteration of the unrolled loop $i = 2 < 3 = N$.
To combat this, we set the bound of the unrolled loop to $\hat{N} = N - c \cdot (f - 1)$.
Now we will prove the conjecture that using the bound $\hat{N}$, the unrolled loop runs $\Mloop$ times, given the operation to calculate $\hat{N}$ will not over- or underflow.
\begin{proof}
Let $\Mloop'$ be the number of times then unrolled loop with bound $\hat{N}$ runs.
The proof is complete, iff $\Mloop' = \Mloop$.
Note that $c \cdot f$ cannot overflow, as per preconditions ($\star$).
\begin{align*}
\Mloop' &\overset{\text{loop construction}}{=} \ceil{\frac{\hat{N} - I}{c \cdot f}} \cdot f\\
&\overset{\text{integer division}}{=} \frac{\hat{N} - I + (c \cdot f \mp 1)}{c \cdot f} \cdot f\\
&= \frac{N - c \cdot (f - 1) - I + (c \cdot f \mp 1)}{c \cdot f} \cdot f\\
&= \frac{N - c \cdot f + c - I + c \cdot f \mp 1}{c \cdot f} \cdot f\\
&= \frac{N - I + c \mp 1}{c \cdot f} \cdot f\\
&\overset{\star}{=} \frac{\frac{N - I + c \mp 1}{c}}{f} \cdot f\\
&\overset{\text{integer division}}{=} \frac{\ceil{\frac{N - I}{c}}}{f} \cdot f\\
&\overset{\ref{eqn:impl:fixup:duff:total-iteration-based-on-bounds}}{=} \frac{M}{f} \cdot f\\
&\overset{\ref{eqn:impl:fixup:duff:loop-iterations}}{=} \Mloop
\end{align*}
\end{proof}
Therefore we change the header condition of the unrolled loop to $i~`cmp`~\hat{N}$.
Note that even though, we are calculating the rounding to a multiple of $f$ without the need for a slow division operation.
\Cref{fig:impl:fixup:header-cond:firm} shows the comparison of the original condition to the changed header condition, for the loop shown in \Cref{fig:impl:fixup:fixup-firm-loop}.
It is to be noted that the graph with bound $\hat{N}$ can be constant folded to the same size, as the original header.
\input{fig/changed-loop-condition.tex}
\input{fig/firm-duff-fixup-loop.tex}
We know that $c \cdot f$ cannot over- or underflow, as per the preconditions laid out in \Cref{sec:impl:unrollability}.
Since $f > 0$, $c \cdot (f - 1)$ will therefore also not overflow.
Though, $N - c \cdot (f - 1)$ can still over- or underflow, due to the subtraction of $c \cdot (f - 1)$ from $N$, and hence there is nevertheless a possibility to construct an example where this change does not conserve semantics.
Suppose the datatype of a loop with parameters $N = 2, I = 0, c = 1, cmp = <$ is a 32-bit unsigned integer, and we unroll this loop by a factor of four.
In this case $\hat{N} = 2 - 1 \cdot (4 - 1) = -1 \overset{\text{unsigned integer}}{=} t_{max}$.
Thus, the loop would run $t_{max} > 2$ times.
To circumvent this problem, we use \Cref{alg:basics:overflow:detect} from \Cref{sec:basics:overflow} as a check for over- or underflows of the operation.
We implement this check by placing a block between the header and its predecessors.
If an under- or overflow is detected, the control flow will jump directly the fixup code.
Otherwise, it will route the control flow to the header and let the loop progress as normal, given that $\hat{N}$ now restores semantics in the header.
\Cref{alg:impl:fixup:header-cond:preheader} shows the creation of this structure in~\libFIRM.
\input{fig/alg-preheader-check.tex}
\newpage
\subsection{Generalized Duff's device}\label{sec:impl:fixup:duff}
\Cref{sec:basics:duffs} describes the original version of Duff's device.
The problem with this initial approach is that it assumes $c = 1$, even though \hyperref[sec:impl::def-c]{$c$ is defined} as any non-zero integer in the considered loops.
Therefore, a need for generalization arises.
Using equations~\ref{eqn:impl:fixup:duff:conserve-semantics-identity} through~\ref{eqn:impl:fixup:duff:fixup-i-mult} and the general idea of Duff's device (see \Cref{sec:basics:duffs}), we will create fixup code in form of a generalized Duff's device.
The structure of this fixup code can be seen in \Cref{fig:impl:fixup:duff:fixup-M_fixup}, which we then practically implement, using \Cref{eqn:impl:fixup:duff:fixup-i-mult}, as shown in \Cref{fig:impl:fixup:duff:fixup-bound}.
\input{fig/alg-fixup-M_fixup.tex}
\input{fig/alg-fixup-bound.tex}
For the fixup code to work correctly, it is to be ensured that $c \cdot f$ does not overflow, as otherwise, the interval we switch over, i.e., \cinterval, could potentially be invalid, iff an integer over- or underflow occurs, meaning $
\begin{cases}
c \cdot f < 0 &, \medspace c > 0\\
c \cdot f > 0 &, \medspace c < 0
\end{cases}$.
To avoid these problems altogether, we restricted $c$ to being a compile-time constant, such that for integers defined from $t_{min}$ to $t_{max}$, $c \in \lbrack \frac{t_{min}}{f}, \frac{t_{max}}{f} \rbrack$.
Using this restriction, it can be asserted that $c \cdot f \in \lbrack t_{min}, t_{max} \rbrack$ and therefore does not overflow.
\Cref{alg:impl:fixup:duff:create-fixup} details how the mechanics described above are translated into \libFIRM.
At first, we duplicate the loop body $f - 1$ times and add keepalive edges to all duplicated nodes, to make sure they do not disappear through implicit premature optimizations.
Then we will create the \textit{fixup header}, meaning a block, with the calculation of $n \coloneqq N - i + (c \mp 1)$.
$f - 1$ newly created condition blocks will then use the calculated value by the header.
In the $i^{\text{th}}$ (counting starts at 0) condition block, it will be checked whether $n$ is in the interval spanned by $c \cdot (f - 1 - i)$ and $c \cdot (f - i)$.\footnote{N.B.: $c$ being positive or not determines, which limit is the upper and which is the lower bound.}
After this, we will wire all duplicated blocks such that they are reachable by the conditions.
Further, upon false evaluation of a condition, the following condition is evaluated, except if it is the last condition, in which case the false target is the post loop block.
Additionally, except in the case of the first duplicated header, they are attached to the previous blocks as fallthrough.
Lastly, the last block of the fixup code now precedes the post loop block, and the false exit fo the last condition.
An example of the result for creating fixup code for a loop and for $f = 2$, as seen in \Cref{fig:impl:fixup:fixup-firm-loop}, can be seen in \Cref{fig:impl:fixup:duff:fixup-firm}.
Further, \Cref{fig:impl:fixup:duff:fixup-firm-comp} shows the completed unroll process with the added generalized Duff's device, given $f = 2$.
\Cref{fig:impl:fixup:duff:general-loop} shows the resulting general structure in pseudo-code.
\input{fig/general-loop-duff.tex}
\input{fig/alg-create-fixup-duff.tex}
\input{fig/firm-duff-fixup-compl-duff.tex}
\input{fig/firm-duff-fixup.tex}
\subsection{Loop duplication}\label{sec:impl:fixup:loop}
Another, perhaps simpler, way of creating fixup code is to duplicate the original loop, such that it will run $\Mloop$ times after the unrolled loop.
Just like when using the generalized form of Duff's device, we unroll the loop using the existing mechanics by a factor of $f$.
Therefore, Equations~\ref{eqn:impl:fixup:duff:conserve-semantics-identity} through~\ref{eqn:impl:fixup:duff:i-after-loop} still hold true.
The approach now taken is to copy the original loop, change its initial value to $\ipl$ and use it as fixup code, as seen in \Cref{fig:impl:fixup:loop:fixup-loop}.
\input{fig/alg-fixup-loop.tex}
\begin{proof}
To prove that this fixup code preserves semantics, first note that $\Mloop \overset{\ref{eqn:impl:fixup:duff:loop-iterations}}{\leq} M$.
Then consider two cases:
\begin{enumerate}
\item $\Mloop = M$
\item $\Mloop < M$
\end{enumerate}
In the first case, $\ipl~`cmp`~N$ must be false, as otherwise the unrolled loop would have broken semantics, and hence the new loop is never run.
Therefore: $\Mfixup = 0 \Rightarrow \Mloop + \Mfixup = M$
In the second case, the new loop runs until the condition is met.
As the unrolled loop kept the increment semantics intact, the result is hence $\Mfixup = M - \Mloop$, which conserves the semantics, as per \Cref{eqn:impl:fixup:duff:conserve-semantics-identity}.
\end{proof}
\Cref{alg:impl:fixup:loop:fixup-loop} shows how we create this structure in~\libFIRM.
Firstly, we copy the loop, after which we rewire it, such that the fixup loop points to it, and its old successors point to the fixup loop.
Once this is completed, we can unroll the original loop.
\Cref{fig:impl:fixup:loop:general-loop} shows the resulting structure of the entire process in pseudo-code.
Note that, as mentioned in \Cref{sec:impl}, this process occurs before we unroll the original loop.
\Cref{fig:impl:fixup:loop:fixup-firm-comp} shows the result for unrolling the loop from \Cref{fig:impl:fixup:fixup-firm-loop} using loop duplication fixup code and a factor of two.
\input{fig/firm-duff-fixup-compl-loop.tex}
\input{fig/general-loop-loop.tex}
\input{fig/alg-create-fixup-loop.tex}
\newpage
\section{Selecting an unroll-factor}\label{sec:impl:sel-factor}
Previously the unroll-factor $f$ seemed like it was chosen somewhat arbitrarily.
Further, \Cref{sec:basics:unrolling} describes that there are multiple factors influencing the performance of unrolled loops.
Therefore, we devise a selection process.
As a convention, we will henceforth let size be the number of~\libFIRM-nodes in a given loop.
The -- admittedly straightforward -- algorithm tries to find an unroll-factor $f = 2^n, n \in \mathbb{N}_{>0}$, that minimizes the absolute difference between the unrolled size ($= f \cdot \text{original size}$), and a pre-determined maximum size.
\Cref{alg:impl:sel-factor:sel-factor} shows the procedure used to find these values.
It is to be noted that the algorithm can also return $0$ and $1$, which does not fit the definition of the $f$ described.
In the case that the algorithm returns one of these two values, we will interpret it as ``do not unroll''.
\input{fig/alg-calc-factor.tex}