-
Notifications
You must be signed in to change notification settings - Fork 0
/
VSCSE.tex
1555 lines (1203 loc) · 54 KB
/
VSCSE.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\chapter{Lectures from VSCSE}
\label{chap:lectures-from-vscse}
\section{Desirable and Undesirable computing patterns}
\label{sec:desir-undes-comp}
% Convolution filtering (e.g. bilateral Gaussian filters)
Change the data structure to more friendly vectorizing in GPU.
How to develop an efficient kernels? - the most time consuming step in
GPU programming.
All threads in a grid run the same kernel code = SPMD (single program
multiple data). However, each thread operate on its own data
element. To tell on which memory address the thread to operate, each
thread is given an index (computed from thread ID, block ID).
For scalable operations, threads are organized in a hierarchy
structure: grid - thread blocks - threads.
\begin{itemize}
\item threads in different blocks cannot cooperate
\item threads in the same blocks cooperate via {\bf shared memory},
{\bf atomic operation}, and {\bf barrier synchronization}.
\end{itemize}
In CUDA 3.x, multiple kernels from a single program can run at the
same time.
In CUDA 4.0, multiple kernels from different program can run at the
same time.
Previously, only thread block can be of 1D, 2D, or 3D of threads; and
grid can be 1D, and 2D of blocks. From CUDA 4.0, grid can be in 3D of
blocks.
In some applications (N-body simulation, ...) we often see sequential
code and parallel code run at the same time.
Typically, one thread is mapped to 1 pair of input and 1 output.
IMPORTANT: Avoid conflict in critical resources:
\begin{itemize}
\item conflict parallel update to memory locations
\item off-chip DRAM (global memory) bandwidth (TUAN: what is the
conflict here???)
\end{itemize}
\section{Massive parallelism}
\label{sec:massive-parallelism}
Massive parallelism doesn't have to be regular, i.e. the data access
pattern can be non-regular. However, to maximize the throughput, data
access pattern should be regularity.
Example: In a wedding, food are serviced the same no matter the guests
like it or not. This guarantee a quick processing and organized (large
and high throughput); not by guests individual requests. So,
regularity is important for high throughput.
In parallel processing, there's always a huddle (bottle neck). So, we
need to maintain load balance.
How to define the model with regularity in data access pattern. How to
define a model to balance the load between threads to avoid bottle
neck issue. This requires a good understanding of the application
domains; the strength and limitations of the computational devices;
and a good algorithm design of the problem.
Atomic operations is the bottle neck that should be avoided. However,
in certain problems, we need to use atomic operations, e.g. sum
reduction of an array. In the example of sum reduction, typically the
last sum of 16 or 32 threads are sequential as this cannot be
avoided.
\section{Tiling and blocking}
\label{sec:tiling-blocking}
Sipping water from a glass is limited by the straw which represent the
memory interface. As a result, we want to maximize the reuse of the
data we read in, before sipping the next amount of water, or reading
in the next amount of data.
\subsection{Registers}
\label{sec:registers}
Access registers is part of the instruction, i.e. it doesn't require
memory access instruction. So, by using registers to access the data,
we can reduce the number of binary instructions. However, registers
are private per thread. So, we cannot share data between threads using
registers. However, if different a single thread access data
\subsection{Kernels}
\label{sec:kernels}
All parameters passed by reference to the kernel should point to
global memory on device. In CUDA Fortran, data is passed by reference
by default. So, to pass by value, you use \verb!VALUE! keyword.
Any call to a kernel function is asynchronous from CUDA 1.0 on,
explicit synch needed for blocking.
In CUDA C, a kernel function use \verb!__global__! attribute and must
return ``void'' to be callable from host. Otherwise, it must be called
from another kernel, and use \verb!__device__! attributed.
A \verb!__device__! and \verb!__host__! can be used together. In such
cases, 2 copies will be created; one to run on CPU and one to run on
GPU.
\section{Common algorithm techniques to convert from an undesirable one to a desirable one}
\label{sec:comm-algor-techn}
Check the website:
\url{http://courses.engr.illinois.edu/ece598/hk/}
\section{Sparse matrix processing}
\label{sec:sparse-matr-proc}
\section{Blocking/Tiling locality}
\label{sec:block-local}
the idea is to reuse the fetched data from global memory as many times
as possible.
\begin{itemize}
\item If the data is reused within a single thread, it's called {\bf
register tiling}. (Optimization 1)
\item If the data is reused across threads, the data is loaded into
shared memory blocking.
\end{itemize}
It's similar to car pool, where moving multiple people in a single car
(or single thread) is important to improve the efficient. In real
life, people go for car pool should have similar schedule, working at
nearby companies. Similarly, a thread load a group of data should
process using the same operations on these data, and these data should
locate nearby to each other.
So, we typically move the block/tile of data from the global memory to
the shared memory (on-chip memory). It's important to control the
working of two threads that use the same on-chip memory to make sure
it's roughly the same. In addition, the matrix can be very large, so
it's typically divide the matrix into blocks or tiles of small size
enough so that the whole tiles can be loaded into on-chip
memory. Typically, the size is 16x16 or 32x32. This is also the same
size as the size of a thread block. This maps perfectly to each thread
block to process one tile; where one thread process one element in the
tile.
If we do matrix multiplication, we need two tiles, on on-chip memory,
to load the two tiles from 2 matrices. Here, we not only don't avoid
memory accesses; but also creates more memory accesses. However, it
switches from low high-latency memory access; to multiple fast
low-latency memory access; and thus increasing the performance.
To make sure all threads already load the data from global memory to
the shared (on-chip) memory, we need to synchronize them using
\verb!__synchthreads()!. This will guarantee all threads can access
data from the shared memory successfully. We also need to call this to
make sure all the elements in the tile are consumed, i.e. before we
load new data to the shared memory, we need to make sure all threads
have used what they need from the on-chip memory to complete their
job.
Tiling is not only used to avoid low bandwidth memory access; but also
to improve coalescing memory access pattern, i.e. to help neighboring
threads access neighboring elements
Slide 34: Put more pressure on thread by using more registers per
thread.
The typical number of M (vertical) is 64 or more; and load into
registers. The typical number of N (horizontal) is 16 or more and load
into shared memory.
However, instead of loading a small segment (single line or column) of
M and N; they load a small rectangle of M and N. The size of the
rectangle is \verb!Width_N x K! with K is set so taht
\verb!Width_M = Width_N * K!; so each thread loads
\begin{itemize}
\item one element form N
\item K elements from M
\end{itemize}
So, combining register tiling and shared memory tiling to balance the
load; it can improve the matrix multiplication dramatically.
\section{Lab 1: 7-point stencil}
\label{sec:test-7-point}
\subsection{Optimization 1}
\label{sec:optimization-1}
Data reuse: saving loaded data into registers
\subsection{Optimization 2}
\label{sec:optimization-2}
Data reuse: each thread compute multiple outputs to increase data
reuse.
\section{Scalability}
\label{sec:scalability}
Tiling techniques work well with matrix multiplication and especially
with dense matrices.
Here, we learn how parallelism can be scaled, i.e. how you turn things
around to make it much more scalable, more parallelable (increasing
scalability).
{\bf Common sequential pattern}: Using a double nested loop to iterate
all input to produce output, i.e. the inner loop iterate all input to
produce one output element. The outer loop to iterate through all
output elements. The complexity is O(MN).
An example is in MRI image, with M is \# scan points, N is \#
regularized scan points.
\begin{lstlisting}
for (m=0; m< M; m++) {
for (n=0; n< N; n++) {
out[n] += f(in[m], m,n);
}
}
\end{lstlisting}
In the most accurate sense, all input points should affect to some
extent to the output points (the above example). However, practically,
it's too expensive. So, we come to a more relaxed model:
{\bf scatter parallelization} (Sect.~\ref{sec:scatt-parall}). A better
approach is {\bf gather parallelization}
(Sect.~\ref{sec:gath-parall}).
However, in practice, scatter parallelization is often used than
gather. The reason is that, in practice, each element does not affect
all output elements. In addition, output tends to be more more regular
(in matrix form, uniformly distributed), than input (which can be in
any form, or unregular ``distance'' or skewed).
%With scatter: easy thread kernel code, and harder to calculate
\subsection{Scatter parallelization}
\label{sec:scatt-parall}
A bunch of threads, each one to work with a single input element, and
make contribution to all outputs. However, the problem is all threads
having conflict updates to the same output elements. One solution is
to use atomic operations for update when only one can update at a
time. The order of update is unknown at compile time. This, however,
very costly and slow.
Summary: even though the processing state is parallelized. The final
stage, updating, is still serialized. So there is not much enhanced.
NOTE: A load-modify-store operation has 2 full memory access delays.
\begin{itemize}
\item A DRAM delay to read one element (not to mention taking one
element is waste the bandwidth). Thousands of cycles on global
memory. On shared memory, it cut down the latency (tens of cycles),
but still serialized. Using shared memory, atomic operations is
private to each thread block. As a result, we need algorithm work by
programmers to make proper update to data in global memory.
On Fermi, it use L2 cache for atomic operations, which give medium
latency, and global to all blocks. It relax programmers effort; yet
its still serialized. A variable is put in cache when there are many
threads accessing (read or update) to that variables. This reduces
the latency, than accessing them on global memory. In the end,
serialization is unavoidable.
\end{itemize}
A better approach is {\bf gather parallelization}
(Sect.~\ref{sec:gath-parall}).
\subsection{Gather parallelization}
\label{sec:gath-parall}
Here, instead of parallelize the input processing, we parallelize the
output update, where each thread update one output element; rather
than processing one input element like scatter parallelization does.
So, each thread read all input elements, and do proper calculation to
update one output element.
\subsection{Example: Direct Coulomb Summation (DCS) algorithm}
\label{sec:exampl-direct-coul}
If we assume all input elements affect output elements. At each
lattice point, the potential is
\begin{lstlisting}
potential += charge[i] / (distance to atom[i])
\end{lstlisting}
In 3D, we go through each z-slice, i.e. all the grid points on a slice
have the same z-coordinate.
Slice 20: good C sequential code
\begin{itemize}
\item loop 1: iterate z-slices. All grid points have the same
\verb!dz!, so they are calculated once.
\item loop 2: iterate y coordinate. All grid points on the same
z-slice have the same \verb!dy!, so they are calculated once.
\item loop 3: iterate x coordinate
\end{itemize}
Here, the input is very regular, i.e. \verb!dz,dy! are calculated
once. However, it's not a good algorithm to run on GPU. In practice,
we don't have that regular input. For example: atoms come from modeled
molecular structures (irregularity by necessity).
\subsubsection{Simple (straightforward) CUDA parallelization}
\label{sec:simple-stra-cuda}
At first, as CPU memory is larger than GPUs, we allocate whole
potential map on host CPU, and shift each z-slice to GPU for
processing. As a result, we know the \verb!dz! of the slice, so we can
precalculate it on CPU. In the kernel, it does the update jobs for
that slice. And iterate the process for every z-slice.
Each thread compute the contribution of an atom to all grid point (of
the current z-slice). So, the index of the thread is mapped to the
index of the atom to process.
Again, at each grid point, it receives multiple updates from different
atoms; and serialization is required, or atomic functions need to be
used. There is no effective and non-buggy atomic operations for
floating serialization. However, we know that at CUDA 4.0, atomic
addition works.
\subsubsection{A better CUDA parallelization (based on less efficient C sequential code)}
\label{sec:bett-cuda-parall}
Slide 29: it's output oriented, by using outer loop for grid points,
and inner loop to iterate all atoms. However, this works better on
GPU where serialization is avoided.
One important thing: there are two options
\begin{itemize}
\item IF inside the kernel, to check if the thread correspond to an
element inside the boundary of the matrix.
\item Using padding to guarantee all threads has data to do. This is necessary to align memory.
\end{itemize}
\textcolor{red}{CUDA 4.0 support better tiling of non-aligned memory;
as many people want to avoid data padding. As padding can consume
resources. CUDA 4.0 is better tolerance to non-aligned data.}
Here, the index for the thread is mapped to the index of the grid (the
coordinate of the output element).
QUESTION TO ASK YOURSELF ALL THE TIME: Is it parallel? if yes, is it
more efficient than sequential code?
``Numbers really matters for a good engineer'' (Weimen-Hu).
Issues:
\begin{enumerate}
\item Each thread, however, do redundant work; and
\item energygrid[] is very large array, typically 20x or more larger
than atom[]
\item in modern CPU, cache effectiveness is often more important than
compute efficiency. So, with large array, chance of cache misses
increases; which may deteriorate the performance.
\end{enumerate}
So, a good fast sequential code is to use 3 different array, dz array,
dy array and atom array, which is totally 3/20 of the energygrid[]
array. These 3 matrices typically fit to the cache, and thus work more
efficient. So, the matter here is numbers (the size).
Tiling the energygrid[] array can be a working solution; yet requires
more programming efforts.
L1 cache is non-coherance (no shared). Data will not go to L1 cache
unless it is constant. Shared data go to L2 cache. There is no private
copy between L1 and L2. So, because of the share nature, we may not
need to use L1 cache in some applications.
\subsection{Scatter 2 Gather transformation}
\label{sec:scatter-2-gather}
Even though Gather works better on GPU; it faces a problems of input
irregularity. Input tend to be much less regular. So it takes time
for each thread to locate relevant input data to process.
\begin{itemize}
\item In a naive implementation, each thread loop through all input
data to look for what it needs. This is the example of electrostatic
potentials (Sect.~\ref{sec:lab-2:-binning})
This make execution time scale badly with data size.
\item A better one, define the bins where it looks for. For those out
of the bins, put them in an extra array and process them on CPU. So,
a ``cut-off'' value is define for each grid point of the energy grid
matrix. Only atoms within this radius is considered contributing
energy to the center grid point.
\end{itemize}
\subsection{Lab 2: binning with uniform}
\label{sec:lab-2:-binning}
\subsubsection{The problem: electrostatic potential map}
\label{sec:probl-electr-potent}
Calculate electrostatic potential map: a regularly spaced lattice
points in a space containing the summed potential contributions from
the atoms. The location of the atoms are assumed uniformly
distributed.
An example show in 2D
\begin{verbatim}
+ + + +
* *
+ + + * +
*
+ * + * + +
\end{verbatim}
with + denotes the energy point to calculate and * represent the
atoms. Size of * is very large compare to number of atoms.
Binning is a technique to help solve the problem on GPU and CPU
efficiently. It groups data in chunks called {\bf bins}. This bring
greatly efficient for huge data. E.g.: in ray tracing, KD-tree (a kind
of non-uniform sized bins that divide a scene into multiple bounding
boxes to group polygons close to each other. So polygons inside a
bounding box can be ignored if have no chance of colliding with a ray
being traced).
\begin{framed}
So, for each tile of grid points, first identify the set of atoms
that need to be examined to calculate the energy at grid points in
the tile. Atoms data are grouped into chunks called {\bf
bins}.
Generally, each bin collectively represent properties of input data
points in the bin. E.g.: bins represent location properties of
atoms.
The bins can be uniform bin arrays, variable bins, or KD-tree
\end{framed}
Atoms are assumed as point charge, with a fixed partial charge $q_i$,
at a given coordinate (x,y,z). Electrostatic potential V is closely
related to electrostatic force and interaction energy between atoms
and is expressed as a sum of contributing from all N atoms. An exact
Coulombic potential function (calculate contributions from infinite
distances) requires huge computing demand. So, a working approach is
to use an approximate function which is composed of 2 parts:
\begin{enumerate}
\item component 1 calculates the exact contribution from atoms of a
given range (short-range or cut-off potential)
\item component 2 calculates the approximate contribution form atoms
out of that range (long-range potential).
\end{enumerate}
This example tell how to calculate the first component on GPU, which
is based on a predefined ``cut-off'' threshold.
\subsubsection{The fixed binning method}
\label{sec:binning-method}
So, the simulation volume of 3D grid can be defined into a number of
same size cubes (i.e. uniform bins or fixed bins) with a maximum
number of atoms the bin can contains. This is known as the capacity of
the bin. Also we assume the atoms are uniformly distributed in the
simulation volume. Then the number of atoms in each bin is almost
equal. Nevertheless, some bins may have no atoms and some cannot
contains all the atoms due to fixed capacity.
\textcolor{red}{Why don't we increase capacity so that it always hold
all atoms? - Answer: we don't want it to big as it will waste memory
space; and not very useful}.
So, in the latter case, we need to put the out-of-capacity atoms into
an extra list for sequential processing on CPU.
So, the initial configuration of binning is the size and capacity of a
bin. \textcolor{red}{The first important issue here is the choice of
bin capacity}. Using a uniform and fixed capacity bins allow us
using array implementation.
{\it Extended simulation volume}: Padding elements are added to the
surrounding volume to make the computation more regular around the
edges of the simulation volume. The size of the padding is equal to
the ``cut-off''. Also, the bounding cube is used, rather than a sphere
for the search space of the output grid point which is defined by the
``cut-off'' radius. NOTE: Using cube, rather a sphere, some more atoms
may be included in the neighbor lists although their distances to the
center grid point is longer than ``cutoff'' value. As a result, they
will be examined by each thread, yet not used.
So, the idea:
\begin{enumerate}
\item BINNING PROCESS: divide the extended simulation volume into
non-overlapping uniform cubes
\begin{verbatim}
sol_create_uniform_bin(grid, num_atoms, gridspacing, atoms, cutoff);
\end{verbatim}
given \verb!grid! array of energy grid, and numatoms, and
gridspacing, and pointer to array of atoms of (x,y,z,p), and cutoff value.
\begin{verbatim}
BIN_LENGTH = cube size in x,y,z (the bigger the size, the more atoms
will fall into each bin)
we don't want to use too large BIN_LENGTH (which may cause
bin overflow or over-allocate memory space)
BIN_INVLEN = just inversion of BIN_LENGTH
BIN_DEPTH = bin capacity
\end{verbatim}
depending on the cutoff value, we may need one or more cube on each
size of the simulation volume. So, the dimension of the extended
simulation volume is
\begin{verbatim}
2*c + lnx * gridspacing / BIN_LENGTH
\end{verbatim}
with
\begin{verbatim}
c = cutoff / BIN_LENGTH = cutoff * BIN_INVLEN
\end{verbatim}
is the number of additional bin on each side of each dimension.
\textcolor{red}{Each bin has a unique index in the simulation space
for easily parallel processing}, i.e. \verb!bincntBaseAddr!.
\verb!bincntZeroAddr! = point to the length of starting point of a
bin that are in the simulation volume.
\item Build the neighborlist for each output grid point: which is the
list of all neighboring atoms under the limit of capacity (a list of
neighbor offsets), and a list of extra atoms to be processed on CPU
\begin{verbatim}
sol_create_neighbor_list(gridspacing, cutoff);
\end{verbatim}
Both calculation on CPU and GPU can be done at the same time.
\item calculate the energy using bins and neighbor list
\begin{verbatim}
sol_calc_energy_with_bins( energygrid, grid, atoms, num_atoms,
gridspacing, cutoff, k);
calc_extra(energygrid, grid, gridspacing, cutoff, k);
\end{verbatim}
\begin{itemize}
\item kernel: iterate through output grid
\begin{enumerate}
\item for each grid point, identify the neighboring atoms from the
``cut-off'' threshold
\begin{verbatim}
dist = |p.loc - atom.loc|
p.energy += atom.q/dist*s(dist //s(dist)
\end{verbatim}
\end{enumerate}
\item CPU: iterate through output grid
\begin{enumerate}
\item for each grid point, identify the extra neighboring atoms and
update the contribution
\begin{verbatim}
dist = |p.loc - atom.loc|
p.energy += atom.q / dist * s(dist)
\end{verbatim}
\end{enumerate}
\end{itemize}
\end{enumerate}
The energygrid is organized in 1D form, though it's logically a 3D
array (grid.x,grid.y,grid.z). So the size is
\begin{lstlisting}
allocate(energygrid(grid.x * grid.y * grid.z))
\end{lstlisting}
The intermediate array which correspond to a z-slice of energy grid is
\verb!grid!. For generality, we use \verb!volume3i! with z=1.
\begin{lstlisting}
typedef struct {
int x;
int y;
int z;
} voldim3i;
\end{lstlisting}
An array of atoms with full information for each is organized as a
pointer array
\begin{lstlisting}
coordinateatoms[4 * n + 0] : x
coordinateatoms[4 * n + 1] : y
coordinateatoms[4 * n + 2] : z
coordinateatoms[4 * n + 3] : q (charge)
\end{lstlisting}
with $0 < n <$ number of atoms (numatoms). For a uniform lattice grid,
a single value is used to tell the grid spacing \verb!gridspacing!
\begin{verbatim}
./parboil run cp 1 uniform
.......................
mkdir -p build/1_default
gcc -I/home/ac/tra294/CUDA_WORKSHOP_UIUC1108/common/include -I/usr/local/cuda/include -c s
rc/1/main.c -o build/1_default/main.o
gcc -I/home/ac/tra294/CUDA_WORKSHOP_UIUC1108/common/include -I/usr/local/cuda/include -c s
rc/1/cenergy.c -o build/1_default/cenergy.o
gcc -I/home/ac/tra294/CUDA_WORKSHOP_UIUC1108/common/include -I/usr/local/cuda/include -c /
home/ac/tra294/CUDA_WORKSHOP_UIUC1108/common/src/parboil_cuda.c -o build/1_default/parboil_c
uda.o
/usr/local/cuda/bin/nvcc build/1_default/main.o build/1_default/cenergy.o build/1_default/pa
rboil_cuda.o -o build/1_default/cp -L/usr/local/cuda/lib64 -lm -lpthread -lm
** waiting for 1420508.acm to finish...
** done. 178.0 seconds.
Set compute mode to EXCLUSIVE_PROCESS for GPU 0:1E:0.
Resolving CUDA runtime library...
/usr/local/cuda_wrapper/lib64/cuda_wrapper.so (0x00007f4ab30e7000)
libcudart.so.4 => /usr/local/cuda/lib64/libcudart.so.4 (0x00007f4ab29d6000)
CUDA accelerated coulombic potential microbenchmark
Original version by John E. Stone <[email protected]>
This version maintained by Chris Rodrigues
IO : 2.025272s
Compute : 140.956296s
Timer Wall Time: 142.981598
Pass
Parboil parallel benchmark suite, version 0.2
\end{verbatim}
Enhancement 2.1: using binning algorithm
\begin{verbatim}
Length of neighborhood bins list: 343
IO : 1.599365
Compute : 7.377883
Timer Wall Time: 8.977266
\end{verbatim}
\subsubsection{Bin sizes - large bins}
\label{sec:bin-sizes}
The second issue is the bin size, i.e. how big of the bins.
\begin{enumerate}
\item A large bin contains many dummy (unused atoms)
\item A too small bin may not be able to contain the adequate amount
of atoms (specified by bin capacity)
\end{enumerate}
{\bf The large bin concept} is that a single bin is used to calculate
the potential energy for grid points in a block. Consider a 3D region
processed by a thread block. This is known as the {\bf map region}, a
3D thing. Each thread calculate the potential at the corresponding
grid point. A cut-off sphere being used that can cover the whole map
region. Then create a larger bin to cover the cut-off sphere. This
large cube will be used as the bin.
\begin{framed}
A typical cutoff distance in molecular structure is 8-12$\AA$; and
long-range potential can be calculated separately using an
approximate formula. The number of atoms within a cutoff distance
is relatively constant (with uniform atom density), e.g. 200-700
atoms within 8$\AA$ to 12$\AA$ cutoff sphere for typical molecular
structures.
\end{framed}
So, if using cutoff radius as 12$\AA$, the diameter of the cutoff
sphere is 24$\AA$ or the large cube should be of size $(24\AA)^3$ to
hold the sphere. Using large bin concept, each map region requires
only a single bin of atoms.
\begin{enumerate}
\item For each map region, atoms in the large bin are copied to the
constant memory (until full), then launch the kernel.
\begin{lstlisting}
static __constant__ float4 atominfo[MAXATOMS];
\end{lstlisting}
NOTE: Atoms can be shipped to 64KB constant memory.
\item Then in the kernel launch (loop through atoms in the copied
constant buffer, check if the atoms within the cutoff distance)
\begin{lstlisting}
__global__ static void mgpot_shortrng_energy(...) {
[...]
for (n = 0; n < natoms; n++) {
float dx = coorx - atominfo[n].x;
float dy = coory - atominfo[n].y;
float dz = coorz - atominfo[n].z;
float q = atominfo[n].w;
float dxdy2 = dx*dx + dy*dy;
float r2 = dxdy2 + dz*dz;
if (r2 < CUTOFF2) {
float gr2 = GC0 + r2*(GC1 + r2*GC2);
float r_1 = 1.f/sqrtf(r2);
accum_energy_z0 += q * (r_1 - gr2);
}
}
\end{lstlisting}
\end{enumerate}
Using radius 12\AA, we have a cube of 24x24x24. Using lattice of size
\verb!gridspacing=0.5!$\AA$, the map region correspond to $48^3$
lattice points, and each bin contains $20^3$ atoms on average, which
is quite big. However, a quick and dirty implementation increase 6x
performance. Can it be improved? YES - as only 6.5\% of the contained
atoms are really used, i.e. within the cutoff distance.
Here, bin size and bin capacity are designed to allow each kernel
launch cover enough lattice points to justify the kernel launch
overhead and fully utilize GPU hardware.
\subsubsection{Bin sizes - small bins}
\label{sec:bin-sizes-small}
So, to have a much more accurate atoms in the bins, we need to do more
work on binning function.
How-about using small-bin kernels? So, instead of using a single
24x24x24 bin, each thread block deal with a number of 4x4x4
bins. Here, all threads in a block scan the same bins and atoms
The
small bins, however, need to be large enough to cover all the atoms at
the corner of the sphere. However, there are bins at the border that
some will be used, some will be not. So there's still small
divergence.
{\bf Neighborhood offset list}: from the center grid point, keep a
list of offset to the bins that are within the cutoff distance. So, by
visiting the bins in the neighborhood offset list, we can iterate
through atoms in that bin.
For small bin design: e.g. 0.5$\AA$ lattice spacing, then a $(4\AA)^3$
cube is a 8x8x8 potential map points. This requires 128 threads per
block (i.e. 4 points/thread). Then, 34\% of examined points are within
cutoff distance. So, we improve the true-positive detection
rate. Small bins can also be used as tiles for locality.
Typically, the CPU run the most sequential part of the execution; and
the GPU run the most parallel part of the execution. However, in this
case, and some other cases, the CPU process part of the parallelize
part that is too much irregular that may downgrade the performance if
run on GPU. So, this can bring a better performance.
\subsubsection{At first, the data is assumed to be uniform.}
\label{sec:at-first-data}
What is the input data is non-uniform (i.e. irregular).
\section{Sparse data}
\label{sec:sparse-data}
Sparse matrix-vector multiplication is a big issue. Example: a system
of ten thousands equations, in which each contains a small number of
variables. The two main issues: (1) the sparse matrix is very
unregular (compression can be used to map the matrix to regularity);
(2) memory bandwidth (we cannot use input sharing like in dense
matrix, as in every row only a few is being used).
\subsection{CSR (compressed sparse row)}
\label{sec:csr-compr-sparse}
Move the element by removing the zero elements, and keep an index
array to tell the exact location of the moved non-zero elements.
Slide 6: for \verb!ptr! (row pointer) if we see two consecutive
values of the same value, then there is an empty row. That empty row
correspond to the preceding value.
\begin{verbatim}
data[7] = {3, 1, 2, 4, 1, 1, 1}
ptr[5] = {0, 2, 2, 5, 7}
\end{verbatim}
the second row start at the second element of `data' array; and since
there is no non-zero elements on that row; the third row also start at
the second element of `data' array.
The data elements doesn't move, we just use some special index.
\subsection{ELL}
\label{sec:ell}
We no longer need the \verb!ptr! array, but we still need the column
index array. Rows of the same length can be grouped
Then by transposing the matrix, each thread process each column,
rather than a row (in C). So, neighboring thread access adjacent data
elements. This will give more coalesced memory access.
\subsection{COO}
\label{sec:coo}
\subsection{Hybrid format}
\label{sec:hybrid-format}
For GPU, there is a high chance of sparse package nowadays use a
hybrid format, with ELL handles typical entries and COO handles
exceptional entries (implemented with segmented reduction).
This is similar the concept of binning where extra atoms are put to a
separate array. This is one of the few ways that can help regularizing
data.
\subsection{JDS}
\label{sec:jds}
Sort rows of increasing number of non-zero elements. We can use JDS
and Hybrid to launch multiple kernels.
\subsection{Variable Binning }
\label{sec:variable-binning-}
Using variables bins can have a compressed data structure. It looks
like CSR
Do a scan first (on CPU) to see how large we need (capacity) for each
bins. If we do on GPU, we need to use atomic operations to increase
the increment count for each bin.
But what we need is where should each bin begin in the linear
iteration. So, we need to find the cumulative sum to keep the index of
the beginning location of each bin. On GPU, we can use parallel prefix
scan operations of the bin capacity array to generate an array of
starting points of all bins (CUDPP package).
For the compact bins, give the
In the N-body simulation, where the atoms move after each
iteration. In this case, the input is not stationary. So, binning
algorithm becomes very tricky. We may need to rebuild the bins; or
brute force approach need to be used. Or use a large enough bins so
that we don't have to rebuild the neighbor list after every time
step.
% \section{Debug}
% \label{sec:debug}
\section{Privatization}
\label{sec:privatization}
Privatization is used when:
\begin{enumerate}
\item the number of outputs is small compared to input, e.g. sum reduction
\item the number of outputs cannot be statistically allocated,
e.g. histogram is a scatter operation whose output is not static.
\end{enumerate}
A good queue structure can support highly efficient extraction of
input data from bulk data structures.
Example: 1+2+...+10 can be done in parallel for 2 chunks
\begin{verbatim}
1+2+..+5 and 6+...+10
\end{verbatim}
In reality, when applying in floating-point number, order of additions
of different chunks can give different results on computer; due to
numerical issue. The order of execution related to associativity and
commutativity. This is true with simple addition, but in other
operations or even addition between very small and very large numbers;
it's not always the case.
For histogram, each thread creates its own chunk of output; so it
knows only it get access and modify it. This continues until there's
only a small enough number of chunks to be process by a single thread
to combine them. Thus, at each phase of computation, it need to
dynamically determine and extract from a bulk data structure. This is
a hard problem when the bulk data structure is not designed/organized
for massively parallel execution, e.g. graph.
The current CUDA model is to launch a kernel which create a bunk of
blocks, each block with a bunch of threads. Once it's launched,
there's no way to change it; nor changing the amount of data it
process. This make it's hard to deal with dynamic data with current
CUDA/OpenCL kernel configuration.
In addition, new generation of GPU add new features. Thus, it's
supposed that there's a change in strategies of algorithm which make
the algorithm hard to be generalized to any hardware architecture. So,
you first need to understand the fundamental design first.
\subsection{BFS}
\label{sec:bfs}
In Bread-First Search, suppose you start with a ``frontier''
vertex. And start searching to build a tree. You create a queue, where
you put the frontier vertex here; and remove the old one, add the new
ones when a new level is reached. So, the vertex on the same level are
done in parallel.
\begin{verbatim}
level 1: s
level 2: r, w
level 3: v, t, x,
level 4: u, y
\end{verbatim}
So, we need one GPU kernel for one level, and the size at each level
is known. By doing so, the kernel configuration can be determined
before the kernel launch. The complexity is O(V+E).
BFS can be used in VLSI CAD.
Node-oriented parallelization: each thread assign to a node; so
there's a large number of thread if there's a large number of
nodes. Each thread examines all the neighbors of the node in the
graph; and determine which node will be a frontier in the next
phase. As every thread look at all edges; doing brute-force, so the
complexity is O(VL+E) with V = number of vertex, L = number of
edges. This is higher than O(V+E).
Matrix-based parallelization: for sparsely connected graphs, then the
connectivity matrix will be a sparse matrix. The complexity is
O(V+EL). This is slower than sequential for large graphs.
Two-level hierarchy: Why don't splitting the large queue into smaller
queue, each thread will write to its queue; and there is no contention
of using atomic operation for writing data. In addition, if the queue
is small enough, we can put them into share memory. This is the idea
of {\bf parallel insert-compact queues}; each queue has a fixed-size
capacity. In the end, put these local-queues' data back into the
global big queue.
Each thread processes one or more frontier nodes: it find the index of
the new frontier node; (privatization process:) build the queue of the
next level for this node. The actual number of data elements to be
stored in each queue is unknown. So, atomic write is needed to make
sure the first thread block write say 5 elements (from 0-4), but the
second thread block may need 10 elements (so it need to wait for the
thread block to finish writing to know that it starts at location 5).
Three-level hierarchy: if we further split the local b-queue into
smaller queue (w-queue) to match the size of the w-queue match the
number of hardware unit for writing at the same time, e.g. half-warp
in Tesla or full-warp in Fermi. So w-queue means warp-queue; and
b-queue means block-queue. Each thread in a warp write to the same
w-queue. And then assemble all w-queues in the block to b-queue using
atomic operations; and finally assemble all b-queues into the global
g-queue using atomic operations. However, we cannot control which
thread blocks write data back first to the g-queue. So, the ordering
in g-queue changes at every run-time.
\textcolor{red}{This as a result not a working solution for sorting}.
Using privatized queue, shortest path for regular graph achieves good
speedup 6-10x. There are also many other (free-form) graph algorithm
that we still haven't a good parallel algorithm on GPU due to load
imbalance.
\section{Page-locked memory}
\label{sec:page-locked-memory}
\begin{verbatim}
- check return code from calls to mlock() , mmap() ...