diff --git a/.wordlist.txt b/.wordlist.txt index c0fee3d594..eb2fffa8ba 100644 --- a/.wordlist.txt +++ b/.wordlist.txt @@ -16,6 +16,7 @@ clr cuBLASLt cuCtx cuDNN +dataflow deallocate denormal dll @@ -74,6 +75,7 @@ Nsight overindex overindexing oversubscription +pragmas preconditioners prefetched preprocessor diff --git a/docs/tutorials/reduction.rst b/docs/tutorials/reduction.rst index 8dd155237c..58d5ab634b 100644 --- a/docs/tutorials/reduction.rst +++ b/docs/tutorials/reduction.rst @@ -244,7 +244,7 @@ accesses". A notable exception is when the shared read uniformly evaluates to the same address across the entire warp/wavefront turning it into a broadcast. A better change naive implementation is to have not only the activity of -threads form continous ranges but their memory accesses too. +threads form continuous ranges but their memory accesses too. .. code-block:: diff @@ -409,8 +409,8 @@ This compiles to the following binaries: LLVM unrolls the the loop and compiles to a flat series of ``printf`` invocations while GCC and MSVC both agree to keep the loop intact, visible from the compare -(``cmp``) and the jump (``jne``, ``jl``) instructions. LLVM codegen is identical to -us having written the unrolled loop manually: +(``cmp``) and the jump (``jne``, ``jl``) instructions. LLVM code generation is +identical to us having written the unrolled loop manually: .. code-block:: C++ @@ -697,7 +697,7 @@ elements in shared as warps within out block. Much like we could only launch kernels at block granularity to begin with, we can only warp reduce with ``WarpSize`` granularity (due to the collective nature of the cross-lane built-ins), hence we introduce ``read_shared_safe`` to pad overindexing by -reading ``zero_elem`` -ents. Reading from global remains unchanged. +reading ``zero_elem`` -s. Reading from global remains unchanged. .. code-block:: C++