diff --git a/docs/src/packages.dot b/docs/src/assets/packages.dot similarity index 100% rename from docs/src/packages.dot rename to docs/src/assets/packages.dot diff --git a/docs/src/packages.png b/docs/src/assets/packages.png similarity index 100% rename from docs/src/packages.png rename to docs/src/assets/packages.png diff --git a/docs/src/packages_sketchy.png b/docs/src/assets/packages_sketchy.png similarity index 100% rename from docs/src/packages_sketchy.png rename to docs/src/assets/packages_sketchy.png diff --git a/docs/src/dev-notes/AssemblyStrategies.md b/docs/src/dev-notes/AssemblyStrategies.md new file mode 100644 index 00000000..c83a4bb1 --- /dev/null +++ b/docs/src/dev-notes/AssemblyStrategies.md @@ -0,0 +1,38 @@ +# Assembly strategies + +GridapDistributed offers several assembly strategies for distributed linear systems. These strategies modify the ghost layout for the rows and columns of the assembled matric and vector. Depending on your usecase, one strategy may be more convenient than the others. + +## SubAssembledRows + +!!! info + - **Main idea:** Both columns and rows are ghosted, whith (potentially) different ghost layouts. Assembly is costly but matrix-vector products are cheap. + - **Pros:** Matrix-vector product fills both owned and ghost rows of the output vector. Communication is therefore not required to make the output vector consistent. + - **Cons:** Communication is required to assemble the matrix and vector. + - **Use cases:** Default assembly strategy. + +- Each processor integrates over the **owned cells**, i.e there are no duplicated cell contributions. However, processors do not hold all the contributions they need to assemble their matrix and vector. + +## FullyAssembledRows + +!!! info + - **Main idea:** Columns are ghosted, but rows ownly contain owned indices. Assembly is cheap but matrix-vector products are costly. + - **Pros:** Assembly is local, i.e no communication is required. Column vectors can also be used as row vectors. + - **Cons:** Matrix-vector product only fills the owned rows of the output vector. Communication is therefore required to make the output vector consistent. + - **Use cases:** This is the strategy used by PETSc. You should also use this strategy if you plan to feed back output row vectors as input column vectors during successive matrix-vector products. + +- Each processor integrates over **all it's local (owned + ghost) cells**, i.e contributions for interface cells are duplicated. This implies that each processor has access to **all** the contributions for its **owned dofs** without need for any communication. +- Contributions whose row index is not owned by the processor are discarded, while owned rows can be fully assembled without any communication. + +## FEConsistentAssembly + +!!! info + - **Main idea:** Same as `FullyAssembledRows` but the ghost layout for the columns is the same as the original `FESpace` ghost layout. + - **Pros:** Assembly is local, i.e no communication is required. DoF `PVector`s from the `FESpace` can be used as column and row vectors for the matrix (like in serial). + - **Cons:** Matrix-vector product only fills the owned rows of the output vector. Communication is therefore required to make the output vector consistent. + - **Use cases:** You should use this strategy if you are constantly creating `FEFunction`s with vectors coming from the linear system (and viceversa). This is quite typical for geometric solvers. + +Problem: + +We cannot have the same partition for FESpaces and Matrix. The reason is that we want the matrix layouts to be owned then ghots. However FESpace local numberings are mixed owned and ghost. + +We can create the matrix layouts so that the order of the owned and ghosts (separately) is the same as in the FESpace. Paired with the OwnedAndGhostValues data type, this allows no-copy no-allocation reuse of values between the two dof layouts. However this is prone to create crash-less bugs (the worst kind) if the user is careless. How should this be handled? Is there a cheap way to check that we are not doing anything wrong (without checking the whole index ranges every time)? diff --git a/docs/src/index.md b/docs/src/index.md index 9fcca624..ca7e658b 100644 --- a/docs/src/index.md +++ b/docs/src/index.md @@ -8,21 +8,18 @@ Documentation of the `GridapDistributed.jl` library. ## Introduction -The ever-increasing demand for resolution and accuracy in mathematical models of physical processes governed by systems of Partial Differential Equations (PDEs) -can only be addressed using fully-parallel advanced numerical discretization methods and scalable solution methods, thus able to exploit the vast amount of computational resources in state-of-the-art supercomputers. To this end, `GridapDistributed.jl` is a registered software package which provides -fully-parallel distributed memory data structures and associated methods -for the Finite Element (FE) numerical solution of PDEs on parallel computers. Thus, it can be run on multi-core CPU desktop computers at small scales, as well as on HPC clusters and supercomputers at medium/large scales. The data structures in `GridapDistributed.jl` are designed to mirror as far as possible their counterparts in the `Gridap.jl` Julia software package, while implementing/leveraging most of their abstract interfaces. As a result, sequential Julia scripts written in the high-level Application Programming Interface (API) of `Gridap.jl` can be used verbatim up to minor adjustments in a parallel distributed memory context using `GridapDistributed.jl`. +The ever-increasing demand for resolution and accuracy in mathematical models of physical processes governed by systems of Partial Differential Equations (PDEs) can only be addressed using fully-parallel advanced numerical discretization methods and scalable solution methods, thus able to exploit the vast amount of computational resources in state-of-the-art supercomputers. To this end, `GridapDistributed.jl` is a registered software package which provides fully-parallel distributed memory data structures and associated methods for the Finite Element (FE) numerical solution of PDEs on parallel computers. Thus, it can be run on multi-core CPU desktop computers at small scales, as well as on HPC clusters and supercomputers at medium/large scales. The data structures in `GridapDistributed.jl` are designed to mirror as far as possible their counterparts in the `Gridap.jl` Julia software package, while implementing/leveraging most of their abstract interfaces. As a result, sequential Julia scripts written in the high-level Application Programming Interface (API) of `Gridap.jl` can be used verbatim up to minor adjustments in a parallel distributed memory context using `GridapDistributed.jl`. This equips end-users with a tool for the development of simulation codes able to solve real-world application problems on massively parallel supercomputers while using a highly expressive, compact syntax, that resembles mathematical notation. This is indeed one of the main advantages of `GridapDistributed.jl` and a major design goal that we pursue. In order to scale FE simulations to large core counts, the mesh used to discretize the computational domain on which the PDE is posed must be partitioned (distributed) among the parallel tasks such that each of these only holds a local portion of the global mesh. The same requirement applies to the rest of data structures in the FE simulation pipeline, i.e., FE space, linear system, solvers, data output, etc. The local portion of each task is composed by a set of cells that it owns, i.e., the **local cells** of the task, and a set of off-processor cells (owned by remote processors) which are in touch with its local cells, i.e., the **ghost cells** of the task. -This overlapped mesh partition is used by `GridapDistributed.jl`, among others, to exchange data among nearest neighbors, and to glue together global Degrees of Freedom (DoFs) which are sitting on the interface among subdomains. Following this design principle, `GridapDistributed.jl` provides scalable parallel data structures and associated methods for simple grid handling (in particular, Cartesian-like meshes of arbitrary-dimensional, topologically n-cube domains), FE spaces setup, and distributed linear system assembly. It is in our future plans to provide highly scalable linear and nonlinear solvers tailored for the FE discretization of PDEs (e.g., linear and nonlinear matrix-free geometric multigrid and domain decomposition preconditioners). In the meantime, however, `GridapDistributed.jl` can be combined with other Julia packages in order to realize the full potential required in real-world applications. These packages and their relation with `GridapDistributed.jl` are overviewed in the next section. +This overlapped mesh partition is used by `GridapDistributed.jl`, among others, to exchange data among nearest neighbors, and to glue together global Degrees of Freedom (DoFs) which are sitting on the interface among subdomains. Following this design principle, `GridapDistributed.jl` provides scalable parallel data structures and associated methods for simple grid handling (in particular, Cartesian-like meshes of arbitrary-dimensional, topologically n-cube domains), FE spaces setup, and distributed linear system assembly. It is in our future plans to provide highly scalable linear and nonlinear solvers tailored for the FE discretization of PDEs (e.g., linear and nonlinear matrix-free geometric multigrid and domain decomposition preconditioners). In the meantime, however, `GridapDistributed.jl` can be combined with other Julia packages in order to realize the full potential required in real-world applications. These packages and their relation with `GridapDistributed.jl` are overviewed in the next section. ## Building blocks and composability The figure below depicts the relation among `GridapDistributed.jl` and other packages in the Julia package ecosystem. The interaction of `GridapDistributed.jl` and its dependencies is mainly designed with separation of concerns in mind towards high composability and modularity. On the one hand, `Gridap.jl` provides a rich set of abstract types/interfaces suitable for the FE solution of PDEs. It also provides realizations (implementations) of these abstractions tailored to serial/multi-threaded computing environments. `GridapDistributed.jl` **implements** these abstractions for parallel distributed-memory computing environments. To this end, `GridapDistributed.jl` also leverages (**uses**) the serial realizations in `Gridap.jl` and associated methods to handle the local portion on each parallel task. (See arrow labels in the figure below.) On the other hand, `GridapDistributed.jl` relies on `PartitionedArrays.jl` in order to handle the parallel execution model (e.g., message-passing via the Message Passing Interface (MPI)), global data distribution layout, and communication among tasks. `PartitionedArrays.jl` also provides a parallel implementation of partitioned global linear systems (i.e., linear algebra vectors and sparse matrices) as needed in grid-based numerical simulations. While `PartitionedArrays.jl` is an stand-alone package, segregated from `GridapDistributed.jl`, it was designed with parallel FE packages such as `GridapDistributed.jl` in mind. In any case, `GridapDistributed.jl` is designed so that a different distributed linear algebra library from `PartitionedArrays.jl` might be used as well, as far as it is able to provide the same functionality. -| ![fig:packages](./packages_sketchy.png) | +| ![fig:packages](./assets/packages_sketchy.png) | |:--:| |`GridapDistributed.jl` and its relation to other packages in the Julia package ecosystem. In this diagram, each rectangle represents a Julia package, while the (directed) arrows represent relations (dependencies) among packages. Both the direction of the arrow and the label attached to the arrows are used to denote the nature of the relation. Thus, e.g., `GridapDistributed.jl` depends on `Gridap.jl` and `PartitionedArrays.jl` , and GridapPETSc depends on `Gridap.jl` and `PartitionedArrays.jl` . Note that, in the diagram, the arrow direction is relevant, e.g., GridapP4est depends on `GridapDistributed.jl` but not conversely.| @@ -43,4 +40,4 @@ Here, one can find a list of resources to get started with this programming lang * First steps to learn Julia form the [Gridap wiki](https://github.com/gridap/Gridap.jl/wiki/Start-learning-Julia) page. * Official webpage [docs.julialang.org](https://docs.julialang.org/) -* Official list of learning resources [julialang.org/learning](https://julialang.org/learning/) \ No newline at end of file +* Official list of learning resources [julialang.org/learning](https://julialang.org/learning/) diff --git a/src/Algebra.jl b/src/Algebra.jl index e8fedd02..f7f28be8 100644 --- a/src/Algebra.jl +++ b/src/Algebra.jl @@ -9,6 +9,35 @@ function Algebra.allocate_vector(::Type{<:BlockPVector{V}},ids::BlockPRange) whe BlockPVector{V}(undef,ids) end +# Row/Col vector allocations for serial +function allocate_row_vector(A::AbstractMatrix{T}) where T + return zeros(T,size(A,1)) +end + +function allocate_col_vector(A::AbstractMatrix{T}) where T + return zeros(T,size(A,2)) +end + +# Row/Col vector allocations for parallel +function allocate_row_vector(A::PSparseMatrix) + T = eltype(A) + return pfill(zero(T),partition(axes(A,1))) +end + +function allocate_col_vector(A::PSparseMatrix) + T = eltype(A) + return pfill(zero(T),partition(axes(A,2))) +end + +# Row/Col vector allocations for blocks +function allocate_row_vector(A::AbstractBlockMatrix) + return mortar(map(Aii->allocate_row_vector(Aii),blocks(A)[:,1])) +end + +function allocate_col_vector(A::AbstractBlockMatrix) + return mortar(map(Aii->allocate_col_vector(Aii),blocks(A)[1,:])) +end + # This might go to Gridap in the future. We keep it here for the moment. function change_axes(a::Algebra.ArrayCounter,axes) @notimplemented @@ -303,6 +332,7 @@ end struct FullyAssembledRows end struct SubAssembledRows end +struct FEConsistentAssembly end # For the moment we use COO format even though # it is quite memory consuming. @@ -549,6 +579,29 @@ function _sa_create_from_nz_with_callback(callback,async_callback,a) return A, b end +function Algebra.create_from_nz(a::DistributedAllocationCOO{<:FEConsistentAssembly}) + f(x) = nothing + A, = _feca_create_from_nz_with_callback(f,a) + return A +end + +function Algebra.create_from_nz(a::ArrayBlock{<:DistributedAllocationCOO{<:FEConsistentAssembly}}) + f(x) = nothing + A, = _feca_create_from_nz_with_callback(f,a) + return A +end + +function _feca_create_from_nz_with_callback(callback,a) + I,J,V = get_allocations(a) + rows = _setup_prange(get_test_gids(a),I;ghost=false,ax=:rows) + cols = get_trial_gids(a) + b = callback(rows) + + asys = change_axes(a,(rows,cols)) + values = map(create_from_nz,local_views(asys)) + A = _setup_matrix(values,rows,cols) + return A, b +end # PVector assembly @@ -595,6 +648,12 @@ function Arrays.nz_allocation(a::PVectorCounter{<:FullyAssembledRows}) PVectorAllocationTrackOnlyValues(a.par_strategy,values,dofs) end +function Arrays.nz_allocation(a::PVectorCounter{<:FEConsistentAssembly}) + dofs = a.test_dofs_gids_prange + values = map(nz_allocation,a.counters) + PVectorAllocationTrackOnlyValues(a.par_strategy,values,dofs) +end + struct PVectorAllocationTrackOnlyValues{A,B,C} par_strategy::A values::B @@ -615,6 +674,11 @@ function Algebra.create_from_nz(a::PVectorAllocationTrackOnlyValues{<:FullyAssem _rhs_callback(a,rows) end +function Algebra.create_from_nz(a::PVectorAllocationTrackOnlyValues{<:FEConsistentAssembly}) + rows = _setup_prange_without_ghosts(a.test_dofs_gids_prange) + _rhs_callback(a,rows) +end + function Algebra.create_from_nz(a::PVectorAllocationTrackOnlyValues{<:SubAssembledRows}) # This point MUST NEVER be reached. If reached there is an inconsistency # in the parallel code in charge of vector assembly @@ -673,6 +737,18 @@ function Algebra.create_from_nz( return A,b end +function Algebra.create_from_nz( + a::DistributedAllocationCOO{<:FEConsistentAssembly}, + c_fespace::PVectorAllocationTrackOnlyValues{<:FEConsistentAssembly}) + + function callback(rows) + _rhs_callback(c_fespace,rows) + end + + A,b = _feca_create_from_nz_with_callback(callback,a) + return A,b +end + struct PVectorAllocationTrackTouchedAndValues{A,B,C} allocations::A values::B diff --git a/src/FESpaces.jl b/src/FESpaces.jl index cd922ddd..72f2c9ef 100644 --- a/src/FESpaces.jl +++ b/src/FESpaces.jl @@ -646,6 +646,14 @@ function local_assembly_strategy(::FullyAssembledRows,rows,cols) row->rows_local_to_ghost[row]==0, col->true) end +function local_assembly_strategy(::FEConsistentAssembly,rows,cols) + rows_local_to_ghost = local_to_ghost(rows) + GenericAssemblyStrategy( + identity, + identity, + row->rows_local_to_ghost[row]==0, + col->true) +end # Assembler high level constructors function FESpaces.SparseMatrixAssembler( diff --git a/src/Geometry.jl b/src/Geometry.jl index d80d9cb9..ed9bb1b6 100644 --- a/src/Geometry.jl +++ b/src/Geometry.jl @@ -517,6 +517,14 @@ function filter_cells_when_needed( remove_ghost_cells(trian,cell_gids) end +function filter_cells_when_needed( + portion::FEConsistentAssembly, + cell_gids::AbstractLocalIndices, + trian::Triangulation) + + trian +end + function remove_ghost_cells(trian::Triangulation,gids) model = get_background_model(trian) Dt = num_cell_dims(trian) diff --git a/src/GridapDistributed.jl b/src/GridapDistributed.jl index 27f4fc02..827cdac5 100644 --- a/src/GridapDistributed.jl +++ b/src/GridapDistributed.jl @@ -33,6 +33,7 @@ import Gridap.ODEs.ODETools: ∂t, ∂tt export FullyAssembledRows export SubAssembledRows +export FEConsistentAssembly export get_cell_gids export get_face_gids @@ -40,6 +41,7 @@ export get_face_gids export local_views, get_parts export with_ghost, no_ghost +export allocate_col_vector, allocate_row_vector export redistribute include("BlockPartitionedArrays.jl") diff --git a/test/FESpacesTests.jl b/test/FESpacesTests.jl index bf1e78e6..f11a63f1 100644 --- a/test/FESpacesTests.jl +++ b/test/FESpacesTests.jl @@ -73,6 +73,7 @@ end function main(distribute,parts) main(distribute,parts,SubAssembledRows()) main(distribute,parts,FullyAssembledRows()) + main(distribute,parts,FEConsistentAssembly()) end function main(distribute,parts,das) @@ -199,4 +200,8 @@ function main(distribute,parts,das) end +with_debug() do distribute + main(distribute,(2,2)) +end + end # module