-
Notifications
You must be signed in to change notification settings - Fork 19
/
data_services.tex
110 lines (89 loc) · 9.67 KB
/
data_services.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
% !TEX root = main.tex
% MMW: I'd order these subsections roughly in order of the software stack... most fundamental to most complex
\subsection{Kokkos Core}\label{subsec:kokkos}
Kokkos Core\footnote{For historical reasons the package name is ``Kokkos'' since it precedes the
creation of other Kokkos subprojects such as Kokkos Kernels.} is a programming model
designed for performance portability across hardware architectures such as CPUs and GPUs.
Two primary insights into achieving performance portability shaped Kokkos's development:
i) abstractions for both parallel execution and data structures are required, and
ii) the fundamental programming model must be \todo{descriptive?} descriptible not prescriptive.
The parallel execution include
\texttt{execution spaces} to specify where an algorithm should run, \texttt{execution
policies} to indicate how the algorithm should be run in terms of hardware resources and
\texttt{execution patterns} to designate what pattern among \texttt{for, reduce, scan}
will be used by the algorithm. The data structure abstractions are built around the
concepts of \texttt{memory space} to indicate where memory is allocated,
\texttt{memory layouts} to specify how data is laid out in memory, and \texttt{memory traits}
that enable customizations of how data is accessed. The primary data structure incorporating
these concepts is \texttt{Kokkos::View}, a multidimensional array abstraction that inspired
the ISO C++23 \texttt{mdspan} class. Detailed descriptions of Kokkos Core can be found
in~\cite{edwards2014} and \cite{trott2022kokkos}.
\subsection{Kokkos Kernels}\label{subsec:kk}
Kokkos Kernels~\cite{rajamanickam2021kokkoskernels} is part of the Kokkos ecosystem
~\cite{trott2021kokkos} and provides node-local implementations of mathematical kernels
widely used across packages in Trilinos. As a member of the Kokkos
ecosystem, Kokkos Kernels is tightly integrated into Kokkos features and aims at
delivering performance-portable algorithms across major CPU- and GPU-based HPC systems.
Due to its node-local nature, Kokkos Kernels does not rely on MPI or other communication
libraries unlike numerous other packages in Trilinos.
The implementations of Kokkos Kernels algorithms leverage the hierarchical parallelism
exposed by the Kokkos library~\cite{kim2017designing} and increasingly provide coverage
for stream-callable kernels. To ensure flexibility for the distributed libraries that
might call its algorithms, Kokkos Kernels provides thread-safe and asynchronous
implementations for most of its kernels. Kokkos Kernels also serves as a major point of
integration for vendor optimized libraries such as cuBLAS, cuSPARSE, rocBLAS, rocSPARSE,
MKL, ARMpl and others.
The capabilities provided by Kokkos Kernels can be divided into four major categories:
1. BLAS algorithms, 2. sparse linear algebra and preconditioners, 3. graph algorithms, and
4. batched dense and sparse linear algebra~\cite{liegeois2023performance}. The main
points of integration of Kokkos Kernels in Trilinos are Tpetra for the dense and sparse
linear algebra capabilities, Ifpack2 for the preconditioners and batched algorithms,
the multigrid package MueLu that relies on these features both directly and indirectly
as well as on some specialized algorithms such as graph
coloring/coarsening~\cite{kelley2022parallel} and fused Jacobi-SpGEMM (generalized sparse matrix-matrix multiply) kernels.
Similarly to the Kokkos library, Kokkos Kernels is developed in its own GitHub
repository\footnote{https://github.com/kokkos/kokkos-kernels} outside of the Trilinos
GitHub repository. Every version of the library is integrated and tested in Trilinos
as part of the Kokkos ecosystem release process. Additional information on Kokkos
Kernels' capabilities can be found here~\cite{deveci2018multithreaded,wolf2017fast}.
%\todo{I'm not sure if we want to include Teuchos, but it might be necessary since it is omnipresent in the use of Trilinos. We probably should expand this a bit more.}
% MMW I don't think Teuchos should be under framework. I think we want to stick to logical groupings versus load-balanced ones.
\subsection{Teuchos}
Teuchos provides a suite of common tools for many Trilinos packages. These tools include memory management classes~\cite{bartlett2010} such as ``smart'' pointers and arrays, ``parameter lists'' for communicating hierarchical lists of parameters between library or application layers, templated wrappers for the BLAS and LAPACK, XML parsers, and other utilities. They provide a unified ``look and feel'' across Trilinos packages, and help avoid common programming mistakes.
\subsection{Tpetra}\label{subsec:tpetra}
Tpetra \cite{hoemmen2015tpetra} provides the distributed-memory
infrastructure for sparse linear algebra computations. It implements
distributed-memory linear algebra objects, such as sparse graphs,
sparse matrices and dense vectors, using Kokkos for local data
storage. Distributed-memory sparse linear algebra operations, such as
a sparse matrix-vector product, are implemented through on-node calls
to Kokkos Kernels and inter-node MPI communication. Tpetra features
include:
\begin{itemize}
\item \textit{Maps} --- distributions of objects over MPI ranks.
\item \textit{(Multi)Vectors} --- storage of dense vectors or collections of
vectors (multivectors) and associated BLAS-1 like kernels (e.g., dot
products, norms, scaling, vector addition, pointwise vector
multiplication) as well as tall skinny QR (TSQR) factorization for multivectors.
\item \textit{Import/export} --- moving vector, graph and matrix data
between different distributions (maps). This is key for performing
halo/boundary exchanges, as well as other kernels such as
sparse matrix-matrix multiplication.
\item \textit{Sparse graphs} --- in compressed sparse row (CSR)
format. Graphs also include import/export objects for use in
halo/boundary exchanges associate with the graph.
\item \textit{Sparse matrices} --- in compressed sparse row (CSR) and
block compressed sparse row (BSR) formats. Associated kernels include
sparse matrix vector product (SPMV), sparse matrix-matrix
multiplication (SPGEMM) and triple-product, sparse matrix-matrix addition, sparse matrix transpose, diagonal extraction,
Frobenius norm calculation, and row/column scaling.
\end{itemize}
\todo{@Brian/Chris: do you want to expand this a bit?}
\subsection{Thyra}
The Thyra package provides abstractions for use in writing high-level abstract numerical algorithms (ANAs). Thyra is used by other Trilinos packages for implementing ANAs such as solvers for linear equations, nonlinear equations, bifurcation and stability analysis, optimization and uncertainty quantification. The abstractions hide concrete linear algebra and solver implementations so that the algorithms can be written once and reused across many application codes. For example, one could write a Krylov solver and it could be used with multiple concrete linear algebra implementations such as Tpetra in Trilinos, Epetra in Trilinos or the vector/matrix objects in PETSc.
Thyra is organized into three sets of abstractions depending on the type of operation required. All are templated on the scalar type. These sets are:
\begin{itemize}
\item \emph{Vector/Operator Interfaces:} The first level is an abstraction for vectors and operators \cite{Bartlett2007}. This set includes abstractions for vector spaces, vectors, multivectors and linear operators. Thyra provides a default set of vector/vector and vector-operator operations for standard linear algebra tasks such ad AXPY from BLAS. Defining an extensible interface such that users can add arbitrary vector/matrix operations and maintain performance can be difficult. A general tool to achieve this is the Trilinos utility package RTOp \cite{rtop}, which provides an interface for users to add new Thyra operations. RTOp accepts a vector of Thyra input objects and a vector of Thyra output objects along with a functor. RTOp will then apply the functor to each element of the data.
\item \emph{Operator Solve Interfaces:} The second set of abstractions supply abstract linear solvers that can be attached to operators. The interfaces include base classes for preconditioners, preconditioner factories, linear operators with a solver, and factories for linear operators with a solver. The factories build and update the preconditioners and linear solvers as needed. The solver interfaces allow one to wrap any concrete linear solver or preconditioner for use with the ANAs. Trilinos provides a separate library named Stratimikos that wraps all of the Trilinos preconditioners and linear solvers in Thyra abstractions. This library is driven by \code{Teuchos::ParameterList} options.
\item \emph{Nonlinear Interfaces:} The final set of abstractions support nonlinear operations. Thyra provides an interface that wraps nonlinear solvers. Implicit time integration and optimization solvers typically require nonlinear solves and can abstract away the concrete implementation via Thyra. The NOX package in Trilinos provides a concrete implementation of this interface. The final interface in Thyra is for querying a physics application for common objects required by high level Newton-based solvers. The Thyra::ModelEvaluator interface allows codes to ask the application to evaluate residuals, Jacobians, parametric sensitivities, Hessians and post processed quantities/derivatives of interest. The interface is designed to be stateless (i.e. consecutive calls with the same inputs should produce the same outputs). This design was chosen to avoid pitfalls in complex application codes interacting with general nonlinear algorithms. The design does not preclude its use in stateful applications that have a ``memory'' between time steps, but rather forces the explicit use of the stateful information in the interface.
\end{itemize}