diff --git a/.wordlist.txt b/.wordlist.txt
index c0fee3d594..eb2fffa8ba 100644
--- a/.wordlist.txt
+++ b/.wordlist.txt
@@ -16,6 +16,7 @@ clr
 cuBLASLt
 cuCtx
 cuDNN
+dataflow
 deallocate
 denormal
 dll
@@ -74,6 +75,7 @@ Nsight
 overindex
 overindexing
 oversubscription
+pragmas
 preconditioners
 prefetched
 preprocessor
diff --git a/docs/tutorials/reduction.rst b/docs/tutorials/reduction.rst
index 8dd155237c..58d5ab634b 100644
--- a/docs/tutorials/reduction.rst
+++ b/docs/tutorials/reduction.rst
@@ -244,7 +244,7 @@ accesses".
 A notable exception is when the shared read uniformly evaluates to the same
 address across the entire warp/wavefront turning it into a broadcast. A
 better change naive implementation is to have not only the activity of
-threads form continous ranges but their memory accesses too.
+threads form continuous ranges but their memory accesses too.
 
 .. code-block:: diff
 
@@ -409,8 +409,8 @@ This compiles to the following binaries:
 
 LLVM unrolls the the loop and compiles to a flat series of ``printf`` invocations
 while GCC and MSVC both agree to keep the loop intact, visible from the compare
-(``cmp``) and the jump (``jne``, ``jl``) instructions. LLVM codegen is identical to
-us having written the unrolled loop manually:
+(``cmp``) and the jump (``jne``, ``jl``) instructions. LLVM code generation is
+identical to us having written the unrolled loop manually:
 
 .. code-block:: C++
 
@@ -697,7 +697,7 @@ elements in shared as warps within out block. Much like we could only launch
 kernels at block granularity to begin with, we can only warp reduce with
 ``WarpSize`` granularity (due to the collective nature of the cross-lane
 built-ins), hence we introduce ``read_shared_safe`` to pad overindexing by
-reading ``zero_elem`` -ents. Reading from global remains unchanged.
+reading ``zero_elem`` -s. Reading from global remains unchanged.
 
 .. code-block:: C++