Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use direct evaluation of kernel functions on GPU #39

Merged
merged 52 commits into from
Oct 25, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
52 commits
Select commit Hold shift + click to select a range
5c45c7d
Start work on shared-memory implementation
jipolanco Oct 17, 2024
6be3d8e
WIP
jipolanco Oct 17, 2024
6784317
Interpolation now works (in 3D, ntransforms = 1)
jipolanco Oct 18, 2024
cec2a65
Minor changes
jipolanco Oct 18, 2024
0ac8886
Optimisations
jipolanco Oct 18, 2024
ecef038
Further optimisations
jipolanco Oct 18, 2024
e4aa4ca
Generalise interpolation to all dimensions
jipolanco Oct 18, 2024
7d1e4f6
Use get_inds_vals_gpu in spreading
jipolanco Oct 18, 2024
2a34d0b
Spreading with shmem works but is slow
jipolanco Oct 18, 2024
87f050c
SM kernels now work with KA CPU backends
jipolanco Oct 18, 2024
46acbda
Test SM kernels on CPU
jipolanco Oct 18, 2024
3fd04b9
Fix kernel compilation on CUDA
jipolanco Oct 19, 2024
6b0fad6
Reorganise some code
jipolanco Oct 20, 2024
867b73e
[WIP] avoid atomics in shared-memory arrays
jipolanco Oct 20, 2024
801eb5e
Avoid atomic operations on shared memory
jipolanco Oct 21, 2024
24f5956
Minor improvement
jipolanco Oct 21, 2024
67e4baa
Make window_vals a matrix
jipolanco Oct 21, 2024
765b73e
Implement hybrid parallelisation in SM spreading
jipolanco Oct 21, 2024
5bf7ae4
More optimisations
jipolanco Oct 21, 2024
f9e7931
Try to fix tests (on CPU)
jipolanco Oct 21, 2024
e1c2daf
Shared memory array can be complex
jipolanco Oct 22, 2024
4898e6e
Update interpolation based on spreading changes
jipolanco Oct 23, 2024
df0690f
Simplify setting workgroupsize
jipolanco Oct 23, 2024
2b28729
Minor changes
jipolanco Oct 23, 2024
d0f439f
Fix CPU tests
jipolanco Oct 23, 2024
2b0b4b6
Remove unused functions
jipolanco Oct 23, 2024
6bbeb8e
Add tests for multiple transforms
jipolanco Oct 23, 2024
7bc11bc
Simplify atomic adds with complex data
jipolanco Oct 23, 2024
05edddf
point_to_cell now also returns x/Δx
jipolanco Oct 23, 2024
1f6c02a
Remove direct evaluation functions (for now)
jipolanco Oct 23, 2024
e8ea63d
Update CHANGELOG [skip ci]
jipolanco Oct 23, 2024
92356d3
Add documentation
jipolanco Oct 23, 2024
c9311e0
Update docs and comments
jipolanco Oct 23, 2024
7010a61
Define direct evaluation of KB kernels
jipolanco Oct 23, 2024
31f7975
Merge branch 'master' into eval-direct
jipolanco Oct 24, 2024
1f3efa1
Fix direct evaluation
jipolanco Oct 24, 2024
3316471
Minor optimisations
jipolanco Oct 24, 2024
7661594
Use direct evaluation in GPU kernels
jipolanco Oct 24, 2024
c3b4ed2
Avoid division by zero in BKB kernel
jipolanco Oct 25, 2024
d5295e9
Simplify window evaluation
jipolanco Oct 25, 2024
5c7d2a3
Add comment on besseli0
jipolanco Oct 25, 2024
543f3b3
Add empty CUDA extension
jipolanco Sep 22, 2024
3c705eb
KB kernel: call CUDA version of besseli0
jipolanco Oct 25, 2024
d39375d
Update comment
jipolanco Oct 25, 2024
aa98132
Define direct evaluation for GaussianKernel
jipolanco Oct 25, 2024
a29d615
Define direct evaluation for BSplineKernel
jipolanco Oct 25, 2024
6f9603d
Allow different default kernel per backend
jipolanco Oct 25, 2024
ff78364
Update comments
jipolanco Oct 25, 2024
ade1a5f
Add tests
jipolanco Oct 25, 2024
72972b7
Update CHANGELOG
jipolanco Oct 25, 2024
7ddeaab
Remove old comment
jipolanco Oct 25, 2024
b10d7cd
Update tests
jipolanco Oct 25, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 9 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,12 +13,20 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

### Changed

- [BREAKING] Change default precision of transforms.
- **BREAKING**: Change default precision of transforms.
By default, transforms on `Float64` or `ComplexF64` now have a relative precision of the order of $10^{-7}$.
This corresponds to setting `m = HalfSupport(4)` and oversampling factor `σ = 2.0`.
Previously, the default was `m = HalfSupport(8)` and `σ = 2.0`, corresponding
to a relative precision of the order of $10^{-14}$.

- On GPUs, we now use direct evaluation of kernel functions (e.g.
Kaiser-Bessel) instead of polynomial approximations, as this seems to be
faster and uses far fewer GPU registers.

- On CUDA, the default kernel is now `KaiserBesselKernel` instead of `BackwardsKaiserBesselKernel`.
The former seems to be a bit faster when using Bessel functions defined in CUDA.
Accuracy doesn't change much since both kernels have similar precisions.

## [v0.5.6](https://github.com/jipolanco/NonuniformFFTs.jl/releases/tag/v0.5.6) - 2024-10-17

### Changed
Expand Down
7 changes: 7 additions & 0 deletions Project.toml
Original file line number Diff line number Diff line change
Expand Up @@ -20,13 +20,20 @@ StaticArrays = "90137ffa-7385-5640-81b9-e52037218182"
StructArrays = "09ab397b-f2b6-538f-b94a-2f83cf4a842a"
TimerOutputs = "a759f4b9-e2f1-59dc-863e-4aeb61b1ea8f"

[weakdeps]
CUDA = "052768ef-5323-5732-b1bb-66c8b64840ba"

[extensions]
NonuniformFFTsCUDAExt = ["CUDA"]

[compat]
AbstractFFTs = "1.5.0"
AbstractNFFTs = "0.8.2"
Adapt = "4.0.4"
Atomix = "0.1.0"
Bessels = "0.2"
Bumper = "0.6, 0.7"
CUDA = "5.5"
FFTW = "1.7"
GPUArraysCore = "0.1.6, 0.2"
KernelAbstractions = "0.9.25"
Expand Down
21 changes: 21 additions & 0 deletions ext/NonuniformFFTsCUDAExt.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
module NonuniformFFTsCUDAExt

using NonuniformFFTs
using NonuniformFFTs.Kernels: Kernels
using CUDA
using CUDA: @device_override

# This is currently not wrapped in CUDA.jl, probably because besseli0 is not defined by
# SpecialFunctions.jl either (the more general besseli is defined though).
# See:
# - https://docs.nvidia.com/cuda/cuda-math-api/index.html for available functions
# - https://github.com/JuliaGPU/CUDA.jl/blob/master/ext/SpecialFunctionsExt.jl for functions wrapped in CUDA.jl
@device_override Kernels._besseli0(x::Float64) = ccall("extern __nv_cyl_bessel_i0", llvmcall, Cdouble, (Cdouble,), x)
@device_override Kernels._besseli0(x::Float32) = ccall("extern __nv_cyl_bessel_i0f", llvmcall, Cfloat, (Cfloat,), x)

Check warning on line 14 in ext/NonuniformFFTsCUDAExt.jl

View check run for this annotation

Codecov / codecov/patch

ext/NonuniformFFTsCUDAExt.jl#L13-L14

Added lines #L13 - L14 were not covered by tests

# Set KaiserBesselKernel as default backend on CUDA.
# It's slightly faster than BackwardsKaiserBesselKernel when using Bessel functions from CUDA (wrapped above).
# The difference is not huge though.
NonuniformFFTs.default_kernel(::CUDABackend) = KaiserBesselKernel()

Check warning on line 19 in ext/NonuniformFFTsCUDAExt.jl

View check run for this annotation

Codecov / codecov/patch

ext/NonuniformFFTsCUDAExt.jl#L19

Added line #L19 was not covered by tests

end
7 changes: 7 additions & 0 deletions src/Kernels/Kernels.jl
Original file line number Diff line number Diff line change
Expand Up @@ -85,6 +85,13 @@ end
# Note: evaluate_kernel_func generates a function which is callable from GPU kernels.
@inline evaluate_kernel(g::AbstractKernelData, x₀) = evaluate_kernel_func(g)(x₀)

@inline function evaluate_kernel_direct(g::AbstractKernelData, x)
Δx = gridstep(g)
i, r = point_to_cell(x, Δx)
values = _evaluate_kernel_direct(g, i, r)
(; i, values,)
end

@inline function kernel_indices(i, ::AbstractKernelData{K, M}, args...) where {K, M}
kernel_indices(i, HalfSupport(M), args...)
end
Expand Down
8 changes: 8 additions & 0 deletions src/Kernels/bspline.jl
Original file line number Diff line number Diff line change
Expand Up @@ -107,6 +107,14 @@ function evaluate_kernel_func(g::BSplineKernelData{M}) where {M}
end
end

function _evaluate_kernel_direct(
::BSplineKernelData{M}, i::Integer, r::T,
) where {M, T}
x′ = i - r # normalised coordinate, 0 < x′ ≤ 1
k = 2M # B-spline order
bsplines_evaluate_all(x′, Val(k), T)
end

function evaluate_fourier_func(g::BSplineKernelData{M}) where {M}
(; Δt,) = g
function (k)
Expand Down
13 changes: 13 additions & 0 deletions src/Kernels/gaussian.jl
Original file line number Diff line number Diff line change
Expand Up @@ -131,6 +131,19 @@ function evaluate_kernel_func(g::GaussianKernelData{M}) where {M}
end
end

# Note: direct evaluation seems to be faster than Gaussian gridding on GPU (tested on A100).
@inline function _evaluate_kernel_direct(
g::GaussianKernelData{M}, i::Integer, r::T,
) where {M, T}
(; τ, Δx,) = g
X = r - T(i - 1) # = (x - x[i]) / Δx = x / Δx - (i - 1)
# @assert 0 ≤ X < 1
js = SVector(ntuple(identity, Val(2M)))
ys = @. T(M - js + X) * Δx
vals = @. exp(-ys^2 / τ)
Tuple(vals)
end

@inline function gaussian_gridding(a, b, cs::NTuple{M}) where {M}
if @generated
v₀ = Symbol(:v_, M)
Expand Down
34 changes: 29 additions & 5 deletions src/Kernels/kaiser_bessel.jl
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
export KaiserBesselKernel

using Bessels: besseli0, besseli1
using Bessels: Bessels

# This is the equivalent variance of a KB window with width w = 1.
# Should be compared to the variance of a Gaussian window.
kb_equivalent_variance(β) = besseli0(β) / (β * besseli1(β))
kb_equivalent_variance(β) = Bessels.besseli0(β) / (β * Bessels.besseli1(β))

@doc raw"""
KaiserBesselKernel <: AbstractKernel
Expand Down Expand Up @@ -118,7 +118,7 @@ struct KaiserBesselKernelData{
gk = KA.allocate(backend, T, 0)
Npoly = M + 4 # degree of polynomial is d = Npoly - 1
cs = solve_piecewise_polynomial_coefficients(T, Val(M), Val(Npoly)) do x
besseli0(β * sqrt(1 - x^2))
Bessels.besseli0(β * sqrt(1 - x^2))
end
KaiserBesselKernelData{M}(Δx, σ, w, β, β², cs, gk)
end
Expand Down Expand Up @@ -166,9 +166,33 @@ function evaluate_kernel_func(g::KaiserBesselKernelData{M, T}) where {M, T}
(; Δx, cs,) = g
function (x)
i, r = point_to_cell(x, Δx)
X = (r - T(i - 1)) / M # source position relative to xs[i]
# @assert 0 ≤ X < 1 / M
X = r - T(i - 1) # = (x - x[i]) / Δx = x / Δx - (i - 1)
# @assert 0 ≤ X < 1
values = evaluate_piecewise(X, cs)
(; i, values,)
end
end

# Not sure Bessels.besseli0 is optimised/overridden for GPUs. It works but I'm not sure it's
# optimal, since it uses the implementation in Bessels.jl which is not made for
# GPUs. An alternative would be to use functions from SpecialFunctions.jl, for which CUDA.jl
# redirects to CUDA functions. However, CUDA.jl currently doesn't wrap the cyl_bessel_i0
# function needed here (and also, SpecialFunctions doesn't provide a besseli0 function, but
# only a besseli function for arbitrary order).
# So, on CUDA, we use a package extension (ext/NonuniformFFTsCUDAExt.jl) to override this
# function, and call the CUDA-defined version instead.
_besseli0(x) = Bessels.besseli0(x)

@inline function _evaluate_kernel_direct(
g::KaiserBesselKernelData{M, T}, i::Integer, r::T,
) where {M, T}
(; β,) = g
X = r - T(i - 1) # = (x - x[i]) / Δx = x / Δx - (i - 1)
# @assert 0 ≤ X < 1
js = SVector(ntuple(identity, Val(2M)))
ys = @. T(M - js + X) / M
zs = @. 1 - ys^2
s = @fastmath sqrt.(zs) # the @fastmath avoids checking that z ≥ 0, returns NaN otherwise
vals = _besseli0.(β * s)
Tuple(vals)
end
23 changes: 21 additions & 2 deletions src/Kernels/kaiser_bessel_backwards.jl
Original file line number Diff line number Diff line change
Expand Up @@ -137,9 +137,28 @@ function evaluate_kernel_func(g::BackwardsKaiserBesselKernelData{M, T}) where {M
(; Δx, cs,) = g
function (x)
i, r = point_to_cell(x, Δx) # r = x / Δx
X = (r - T(i - 1)) / M # source position relative to xs[i]
# @assert 0 ≤ X < 1 / M
X = r - T(i - 1) # = (x - x[i]) / Δx = x / Δx - (i - 1)
# @assert 0 ≤ X < 1
values = evaluate_piecewise(X, cs)
(; i, values,)
end
end

@inline function _evaluate_kernel_direct(
g::BackwardsKaiserBesselKernelData{M, T}, i::Integer, r::T,
) where {M, T}
(; β,) = g
X = r - T(i - 1) # = (x - x[i]) / Δx = x / Δx - (i - 1)
# @assert 0 ≤ X < 1
js = SVector(ntuple(identity, Val(2M)))
ys = @. T(M - js + X) / M
zs = @. 1 - ys^2
s = @fastmath sqrt.(zs) # the @fastmath avoids checking that z ≥ 0, returns NaN otherwise
vals = map(s) do s
@inline
# In rare cases `s` can be zero (if the point x is on the grid, and thus r = 0).
# The kernel is well defined at s = 0 since sinh(x)/x → 1 when x → 0 (same as sinc).
ifelse(iszero(s), one(s), sinh(β * s) / (β * s)) * (β / T(π))
end
Tuple(vals)
end
5 changes: 2 additions & 3 deletions src/Kernels/piecewise_polynomial.jl
Original file line number Diff line number Diff line change
Expand Up @@ -74,9 +74,8 @@ function solve_piecewise_polynomial_coefficients(
end

function evaluate_piecewise(δ, cs::NTuple)
L = length(first(cs))
# @assert 0 ≤ δ < 2/L == 1/M
x = L * δ - 1 # in [-1, 1]
# @assert 0 ≤ δ < 1
x = 2δ - 1 # in [-1, 1]
evaluate_horner(x, cs)
end

Expand Down
3 changes: 2 additions & 1 deletion src/NonuniformFFTs.jl
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,8 @@ export
exec_type1!,
exec_type2!

default_kernel() = BackwardsKaiserBesselKernel()
default_kernel(::KA.Backend) = BackwardsKaiserBesselKernel()
default_kernel() = default_kernel(CPU())

default_block_size(::Dims, ::CPU) = 4096 # in number of linear elements
default_block_size(::Dims, ::GPU) = 1024 # except in 2D and 3D (see below)
Expand Down
8 changes: 4 additions & 4 deletions src/abstractNFFTs.jl
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@
invoke(AbstractNFFTs.nodes!, Tuple{PlanNUFFT, AbstractMatrix{T}}, p, xp)
end

Base.@constprop :aggressive function convert_window_function(w::Symbol)
Base.@constprop :aggressive function convert_window_function(w::Symbol, backend)
if w === :gauss
GaussianKernel()
elseif w === :spline
Expand All @@ -67,7 +67,7 @@
elseif w === :kaiser_bessel
BackwardsKaiserBesselKernel()
else
default_kernel()
default_kernel(backend)

Check warning on line 70 in src/abstractNFFTs.jl

View check run for this annotation

Codecov / codecov/patch

src/abstractNFFTs.jl#L70

Added line #L70 was not covered by tests
end
end

Expand All @@ -81,7 +81,7 @@
Base.@constprop :aggressive function PlanNUFFT(
xp::AbstractMatrix{Tr}, Ns::Dims;
fftflags = FFTW.ESTIMATE, blocking = true, sortNodes = false,
window = default_kernel(),
window = default_kernel(KA.get_backend(xp)),
fftshift = true, # for compatibility with NFFT.jl
synchronise = false,
kwargs...,
Expand All @@ -93,7 +93,7 @@
backend = KA.get_backend(xp) # e.g. use GPU backend if xp is a GPU array
sort_points = sortNodes ? True() : False() # this is type-unstable (unless constant propagation happens)
block_size = blocking ? default_block_size(Ns, backend) : nothing # also type-unstable
kernel = window isa AbstractKernel ? window : convert_window_function(window)
kernel = window isa AbstractKernel ? window : convert_window_function(window, backend)
p = PlanNUFFT(
Complex{Tr}, Ns, HalfSupport(m);
backend, σ = Tr(σ), sort_points, fftshift, block_size,
Expand Down
2 changes: 1 addition & 1 deletion src/gpu_common.jl
Original file line number Diff line number Diff line change
Expand Up @@ -81,7 +81,7 @@ end

@inline function get_inds_vals_gpu(g::AbstractKernelData, points::AbstractVector, N::Integer, j::Integer)
x = @inbounds points[j]
gdata = Kernels.evaluate_kernel(g, x)
gdata = Kernels.evaluate_kernel_direct(g, x)
vals = gdata.values # kernel values
M = Kernels.half_support(g)
i₀ = gdata.i - M # active region is (i₀ + 1):(i₀ + 2M) (up to periodic wrapping)
Expand Down
2 changes: 1 addition & 1 deletion src/interpolation/gpu.jl
Original file line number Diff line number Diff line change
Expand Up @@ -256,7 +256,7 @@ end
indvals = ntuple(Val(D)) do d
@inline
x = @inbounds points[d][j]
gdata = Kernels.evaluate_kernel(gs[d], x)
gdata = Kernels.evaluate_kernel_direct(gs[d], x)
local i₀ = gdata.i - ishifts_sm[d]
local vals = gdata.values # kernel values
# @assert i₀ ≥ 0
Expand Down
5 changes: 3 additions & 2 deletions src/plan.jl
Original file line number Diff line number Diff line change
Expand Up @@ -413,11 +413,12 @@ end
function PlanNUFFT(
::Type{T}, Ns::Dims, h::HalfSupport;
ntransforms = Val(1),
kernel::AbstractKernel = default_kernel(),
backend = CPU(),
kernel::AbstractKernel = default_kernel(backend),
σ::Real = real(T)(2), kws...,
) where {T <: Number}
R = real(T)
_PlanNUFFT(T, kernel, h, R(σ), Ns, to_static(ntransforms); kws...)
_PlanNUFFT(T, kernel, h, R(σ), Ns, to_static(ntransforms); backend, kws...)
end

@inline to_static(ntrans::Val) = ntrans
Expand Down
2 changes: 1 addition & 1 deletion src/spreading/gpu.jl
Original file line number Diff line number Diff line change
Expand Up @@ -318,7 +318,7 @@ end
p, d = Tuple(inds[n])
g = gs[d]
x = points_sm[d, p]
gdata = Kernels.evaluate_kernel(g, x)
gdata = Kernels.evaluate_kernel_direct(g, x)
ishift = ishifts_sm[d]
inds_start[d, p] = gdata.i - ishift
local vals = gdata.values
Expand Down
35 changes: 35 additions & 0 deletions test/approx_window_functions.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# Test direct ("exact") and approximate evaluation of window functions.
# This is particularly useful for testing approximations based on piecewise polynomials.

using NonuniformFFTs
using NonuniformFFTs.Kernels
using StaticArrays
using Test

function test_kernel(kernel; σ = 1.5, m = HalfSupport(4), N = 256)
backend = CPU()
Δx = 2π / N
g = Kernels.optimal_kernel(kernel, m, Δx, σ; backend)
xs = range(0.8, 2.2; length = 1000) .* Δx

for x ∈ xs
a = Kernels.evaluate_kernel(g, x)
b = Kernels.evaluate_kernel_direct(g, x)
@test a.i == b.i # same bin
@test SVector(a.values) ≈ SVector(b.values) rtol=1e-7 # TODO: rtol should depend on (M, σ, kernel)
end

nothing
end

@testset "Kernel approximations" begin
kernels = (
BSplineKernel(),
GaussianKernel(),
KaiserBesselKernel(),
BackwardsKaiserBesselKernel(),
)
@testset "$(nameof(typeof(kernel)))" for kernel ∈ kernels
test_kernel(kernel)
end
end
9 changes: 7 additions & 2 deletions test/pseudo_gpu.jl
Original file line number Diff line number Diff line change
Expand Up @@ -103,8 +103,13 @@ function compare_with_cpu(::Type{T}, dims; Np = prod(dims), ntransforms::Val{Nc}
r_gpu = run_plan(p_gpu, xp_init, vp_init)

for c ∈ 1:Nc
@test r_cpu.us[c] ≈ r_gpu.us[c] # output of type-1 transform
@test r_cpu.wp[c] ≈ r_gpu.wp[c] # output of type-2 transform
# The differences of the order of 1e-7 (= roughly the expected accuracy given the
# chosen parameters) are explained by the fact that the CPU uses a polynomial
# approximation of the KB kernel, while the GPU evaluates it "exactly" from its
# definition (based on Bessel functions for KB).
rtol = Tr === Float64 ? 1e-7 : Tr === Float32 ? 1f-5 : nothing
@test r_cpu.us[c] ≈ r_gpu.us[c] rtol=rtol # output of type-1 transform
@test r_cpu.wp[c] ≈ r_gpu.wp[c] rtol=rtol # output of type-2 transform
end

nothing
Expand Down
1 change: 1 addition & 0 deletions test/runtests.jl
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ end

@testset "NonuniformFFTs.jl" begin
@includetest "errors.jl"
@includetest "approx_window_functions.jl"
@includetest "accuracy.jl"
@includetest "multidimensional.jl"
@includetest "pseudo_gpu.jl"
Expand Down