-
Notifications
You must be signed in to change notification settings - Fork 0
/
OpenACC.tex
384 lines (308 loc) · 11.4 KB
/
OpenACC.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
\chapter{OpenACC}
\label{chap:openACC}
{\bf Why OpenACC?}: Existing code, with some part of the code can be deployed to
GPU. Such codes are typically loops.
{\bf What is OpenACC?}: open standard defining a collection of compiler
directives that specify loops and regions of code in standard C, C++, and
Fortran to be offloaded from a host CPU to an attached parallel accelerator,
such as CUDA.
OpenACC is based on OpenMP directives. OpenACC was developed by The Portland
Group (PGI), Cray, CAPS and NVIDIA.
OpenMP (Sect.\ref{chap:openmp}) use
\verb!#pragma! directives to run on multi cores CPU.
Based on OpenMP, OpenACC targets to non-expert programmers in parallel
computing. OpenACC can also be targeted to non-Nvidia devices.
\begin{verbatim}
#pragma omp parallel for reduction (+:pi)
for (...)
{
pi += 4.0/(1.0_t*t)
}
\end{verbatim}
then to
\begin{verbatim}
#pragma acc kernels
for (...)
{
}
\end{verbatim}
\verb!#pragma! is used in OpenACC C language; and \verb.!$acc. is used in
OpenACC Fortran.
PGI Fortran starts with 12.3 to support OpenACC partially.
\begin{verbatim}
!$acc kernels
DO k = 1, n1
Do i = 1, n2
!$acc end kernels
\end{verbatim}
OpenACC vs. Accelerator?
\begin{enumerate}
\item Instead of using \verb!region!, OpenACC use \verb!kernel! as directive
name
\begin{verbatim}
!$acc directive [clause ...]
structured block
!$acc end directive
\end{verbatim}
and
\begin{verbatim}
#pragma directive [clause ...]
{structured block}
#pragma end directive
\end{verbatim}
with \verb!clauses!
\begin{verbatim}
if (cond.)
async (expression)
\end{verbatim}
\end{enumerate}
Without explicit data transfer between CPU-GPU. \textcolor{red}{Question:Then
how to tell GPU to keep data without moving it back???} We need to decouple data
movement and compute off-load. Example: to avoid recopy after every iteration in
the loop, we can ask it to put, say A and Anew, by combining the two loops into
a single kernel; or a better data managment: \verb!data! clause with a bunch of
different directives.
\begin{verbatim}
copyin()
copyout()
...
\end{verbatim}
E.g.:
\begin{verbatim}
!$acc data copyin(a(1:size)), copyout(b(s/4:3*s/4))
In C (start index is zero, and use square bracket
$pragma acc data copyin(a[0:size]), copyout(b(s/4:3*s/4))
\end{verbatim}
Important keywords:
\begin{enumerate}
\item \verb!restrict!: this is an important keyword in C language with pointer
which adding caching performance by avoiding the effect of {\it pointer
aliasing}.
\verb!fp_kind! behind a floating value to tell its byte length
\end{enumerate}
Compile
\begin{verbatim}
pgcc -acc [-Minfo=accel] -o binaryfile code.c
pgf90 -acc [-Minfo=accel] -o binaryfile code.f90
\end{verbatim}
Example: Jacobi iteration
\begin{verbatim}
while (err > tol && iter < iter_max) {
#pragma omp parallel for shared(m, n, Anew, A) reduction(max:err)
for ()
for ()
}
}
iter++;
}
\end{verbatim}
\begin{verbatim}
while (err > tol && iter < iter_max) {
#pragma acc kernels
for (..)
for(...)
A[j][i] = Anew[j][i];
}
}
iter++;
}
\end{verbatim}
In Fortran, we also have the new \verb!_fb_kind! concept.
\textcolor{red}{Question: under what situation, that using directives doesn't
work, and we need explicit kernel implementation???}
\begin{enumerate}
\item Nested for loops are best for parallelization
\item Large loop counts needed to offset GPU/memcpy overhead
\item \verb!PGI_ACC_TIME=1! is environment variable to learn where time is
being spent
\item Eliminate pointer arithmetic
\end{enumerate}
\textcolor{red}{QUESTION: How to tell OpenACC to do the compute only, not data
transfer}??? -
Threads need to load/store blocks of memory; share results with each other or
coorporate to produce a result (e.g. reduction); or synchronize with each other.
OpenACC execution model has 3 levels {\bf gang}, {\bf worker} and {\bf vector}.
This works on architectures that is a collection of {\bf Processing Elements}
(PEs). The device has one or more PEs per node. Each PE is multi-threaded.
\begin{verbatim}
gang == block; worker == warp; vector == threads of a warp
we can omit "worker", i.e. gang == block, and vector == threads of a block
\end{verbatim}
So, the execution is implementation-dependent (i.e. compilers). Also, user can
explicitly specify how a given loop should map to gangs, workers, and/or
vectors. E.g.: specify the number of gangs/workers/vectors which allows
programmers to tune the code to fit a particular target architecture.
\begin{verbatim}
// here, the compiler automatically a suitable mapping
/ to threads and blocks, e.g. 16 blocks, 256 threads each
#pragma acc kernels loop
for (int i =0; i<n; ++i) y[i] += a*x[i];
\end{verbatim}
We want to explicity tell 100 thread blocks, and each with 128 threads, each
thread executes one iteration of the loop.
\begin{verbatim}
#pragma acc kernels loop gang(100), vector(128)
for (int i =0; i<n; ++i) y[i] += a*x[i];
\end{verbatim}
If we want to compile a bunch of loops which should be run together we use
\verb!acc parallel! and \verb!num_gangs()!, as well as \verb!vector_length()!.
\begin{verbatim}
#pragma acc parallel num_gangs(100), vector_length(128)
{
#pragma acc loop gang
for (int i =0; i<n; ++i) y[i] += a*x[i];
}
\end{verbatim}
Here, 100 thread blocks, each with apparently 1 thread, each thread redundantly
executes the loop
\begin{verbatim}
#pragma acc parallel num_gangs(100)
{
#pragma acc loop gang
for (int i =0; i<n; ++i) y[i] += a*x[i];
}
\end{verbatim}
Here, 100 thread blocks, each with 128 threads, each thread does 2
elements worth of work
\begin{verbatim}
n = 128000
#pragma acc loop gang(50), vector(128)
for (int i =0; i<n; ++i) y[i] += a*x[i];
\end{verbatim}
Thus, to do multi-dimensional blocks and grids, we do
\begin{verbatim}
#pragma acc kernels loop gang(100), vector(16)
for (...) {
#pragma acc loop gang(10), vector(32)
for (...) {
}
}
\end{verbatim}
NOTE: If we say \verb!gang!, it means \verb!gang(1)!.
Typically, on CUDA-capable GPUs, the number of threads in a block is a multiple
of 32, at least. The values between 256 to 512 is usually a good choice. Overly
large block can hinder the performance, e.g. by increasing the cost of any
asynchronous/barrier among all threads in a block. For 2D blocks, we can try a
few combinations like 32x8, 64x4, 32x16, 64x8, etc.
Each thread can do multiple pieces of work to help amortize the cost of setup
for simple kernels. {\bf Rule of thumb:} we at least want to have enough threads
to fill the GPU several times over the number of cores (perhaps 10 or more),
i.e. we need 100,000+ threads.
\begin{verbatim}
#pragma acc loop gang, vector(8)
#pragma acc loop gang(16), vector(32) //
\end{verbatim}
By tweaking the block sizes, how often to copy data in/out, we can significantly
improve the performance.
{\bf QUESTION: what architecture can be targeted for current OpenACC version}???
- Nvidia GPU only.
We can test using PGI Accelerator Profiling or Nvidia Visual Profiler (which is
different in interface between 4.0 and 4.1). Another option is Nvidia Nsight
Profiler (target to Windows).
There's a directive to tell the compiler to keep an array in the cache memory.
The compiler can also design with heuristics to do this for us.
\begin{verbatim}
#pragma acc cache
\end{verbatim}
A second instance of the stencil kernel.
We need stride-1 in the parallel loop to maximize performance. C doesn't have a
multi-dimensional array, which is an array of pointers to rows, indeed. How to
pass a multi-dimensional array in C with OpenACC??? - It doesn't work so well.
If we want to reuse the data already on the GPU, we use \verb!present! clause.
If it's not there, do the copy, the we use \verb!present_or_copy! or
\verb!present_or_copyin! clause
\begin{verbatim}
#pragma acc kernels loop present(a(0:n-1))
#pragma acc kernels loop present_or_copyin(a(0:n-1))
#pragma acc kernels loop present_or_copy(a(0:n-1))
\end{verbatim}
\begin{verbatim}
#pragma acc data copy(x[0:n]) ...
{
for (timestep = 0; ...) {
...compute on device ...
#pragma update host (x[0:n])
MPI_SENDRECV( x, ...)
#pragma update device (x[0:n])
... adjust on device
}
}
\end{verbatim}
Asynchronous data transfer
\begin{verbatim}
#pragma acc kernels loop async
for (i = 1;, i < n; ++i)
y[i] += a*x[i];
#pragma acc update device (x[0:n], y[0:n]) async
#pragma acc update host (y[0:n]) async
...
#pragma acc wait
\end{verbatim}
There's no concept of pinned memory in OpenACC. CRAY machines are huge, with
large memory, so pinned memory is not important. With PGI, pinned memory is
important as machines are smaller and less powerful. What if we communicate
through libraries which user may pinned memory, but the library assumed it's
unpinned; or vice versa. This needs more work with the OS to determined if the
memory is pinned or not.
\begin{verbatim}
!$acc kenerls loop present(a(:,js-1:je+1), b(:,js-1:js+1)) async if (usegpu)
...
\end{verbatim}
and then
\begin{verbatim}
do iter=1,maxiters
call smoothier(..., .true., ...)
call smoothier, ..., .false.,...)
!$acc update host (a(:,js)) device (a(:,js+1)) async
!$acc wait
call smoothier(..., .true., ...)
call smoothier, ..., .false.,...)
!$acc update host (b(:,js)) device (b(:,js+1)) async
!$acc wait
\end{verbatim}
But it's waste as the CPU wait while GPU do it work, we can rearrange the lines
so both do their jobs (compute in one, and copy in another).
OpenMP and OpenACC are actually not really friendly to each other.
How about multiple GPUs??? NO good answer to solve the problem of data
management. PGI requires using MPI or OpenMP
\begin{verbatim}
#include <openacc.h>
#include <omp.h>
#pragma omp parallel num_threads(2)
{
int i = omp_...
}
\end{verbatim}
in MPI
\begin{verbatim}
#include <openacc.h>
#include <mpi.h>
int myrank;
MPI_Comm_Rank(MPI_COMM_WORLD, &myrank);
int numdev = acc_get_num_devices(acc_device_nvidia);
int i = myrank % numdev;
acc_set_device_num( i, acc_device_nvidia);
\end{verbatim}
It's really hard to reindex all the data when we move it. The strides are
different, \ldots
\section{cuSPARSE}
cuSPARSE is $> 6$x than Intel MKL.
\section{cuRAND}
The two widely used methods in CUDA 4.1 are \verb!MRG32k3a! and MTGP11213
Mersenne Twister RNGs. They can support several output distributions (uniform,
normal, Poisson, etc.).
\section{cuFFT}
Example: 1D point-wise convolution using cuFFT
\begin{enumerate}
\item Convolution transform signal and kernel with \verb!cufftExecC2C!.
\item \verb!complexPointwiseMulAndScale!: multiply the coefficients together
and normalize the result
\item Transform the signal back
\end{enumerate}
NOTE: PGI C compiler does not support \verb!structs! in OpenACC loops, so we
need to cast \verb!Complex*! pointer to float* pointers and use interleaved
indexing.
We can use \verb!deviceptr! data clause to pass pre-allocated device data to
OpenACC region and loops. Use \verb!host_data! to get device address for
pointers inside acc data clause.