+ - lambda expressions
+```cpp
+ [=](id<1> i) {
+ y[i] += a * x[i];
+ }
+```
+
+
+
+ - function object (functors)
+
+```cpp
+class AXPYFunctor {
+public:
+ AXPYFunctor(float a, accessor x, accessor y): a(a), x(x),
+ y(y) {}
+
+ void operator()(id<1> i) {
+ y[i] += a * x[i];
+ }
+
+private:
+ float a;
+ accessor x;
+ accessor y;
+};
+```
+
+
+
+# Launching Kernels{.section}
+
+# Grid of Work-Items
+
+
+
+
+![](img/Grid_threads.png){.center width=37%}
+
+
A grid of work-groups executing the same **kernel**
+
+
+
+
+![](img/mi100-architecture.png){.center width=53%}
+
+
AMD Instinct MI100 architecture (source: AMD)
+
+
+ - a grid of work-items is created on a specific device to perform the work.
+ - each work-item executes the same kernel
+ - each work-item typically processes different elements of the data.
+ - there is no global synchronization or data exchange.
+
+# Basic Parallel Launch with `parallel_for`
+
+
+
+ - **range** class to prescribe the span off iterations
+ - **id** class to index an instance of a kernel
+ - **item** class gives additional functions
+
+
+
+
+
+```cpp
+cgh.parallel_for(range<1>(N), [=](id<1> idx){
+ y[idx] += a * x[idx];
+});
+```
+
+```cpp
+cgh.parallel_for(range<1>(N), [=](item<1> item){
+ auto idx = item.get_id();
+ auto R = item.get_range();
+ y[idx] += a * x[idx];
+});
+```
+
+
+
+ - runtime choose how to group the work-items
+ - supports 1D,2D, and 3D-grids
+ - no control over the size of groups,no locality within kernels
+
+
+# Parallel launch with **nd-range** I
+
+![](img/ndrange.jpg){.center width=100%}
+
+
https://link.springer.com/book/10.1007/978-1-4842-9691-2
+
+# Parallel launch with **nd-range** II
+
+ - enables low level performance tuning
+ - **nd_range** sets the global range and the local range
+ - iteration space is divided into work-groups
+ - work-items within a work-group are scheduled on a single compute unit
+ - **nd_item** enables to querying for work-group range and index.
+
+```cpp
+cgh.parallel_for(nd_range<1>(range<1>(N),range<1>(64)), [=](nd_item<1> item){
+ auto idx = item.get_global_id();
+ auto local_id = item.get_local_id();
+ y[idx] += a * x[idx];
+});
+```
+
+# Parallel launch with **nd-range** III
+ - extra functionalities
+ - each work-group has work-group *local memory*
+ - faster to access than global memory
+ - can be used as programmable cache
+ - group-level *barriers* and *fences* to synchronize work-items within a group
+ - *barriers* force all work-items to reach a speciffic point before continuing
+ - *fences* ensures writes are visible to all work-items before proceedin
+ - group-level collectives, for communication, e.g. broadcasting, or computation, e.g. scans
+ - useful for reductions at group-level
+
+
+# Summary
+
+ - **queues** are bridges between host and devices
+ - each queue maps to one device
+ - work is enqued by submitting **command groups**
+ - give lots of flexibility
+ - parallel code (kernel) is submitted as a lambda function or as a function operator
+ - two methods to express the parallelism
+ - basic launching
+ - via **nd-range**