Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid GPU allocation in CUDA type-2 NUFFTs #45

Merged
merged 2 commits into from
Nov 17, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,12 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

## Unreleased

### Changed

- Avoid large GPU allocation in type-2 transforms when using the CUDA backend.
The allocation was due to CUDA.jl creating a copy of the input in complex-to-real FFTs
(see [CUDA.jl#2249](https://github.com/JuliaGPU/CUDA.jl/issues/2249)).

## [v0.6.2](https://github.com/jipolanco/NonuniformFFTs.jl/releases/tag/v0.6.1) - 2024-11-04

### Changed
Expand Down
14 changes: 14 additions & 0 deletions ext/NonuniformFFTsCUDAExt.jl
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ module NonuniformFFTsCUDAExt
using NonuniformFFTs
using NonuniformFFTs.Kernels: Kernels
using CUDA
using CUDA.CUFFT: CUFFT
using CUDA: @device_override

# This is currently not wrapped in CUDA.jl, probably because besseli0 is not defined by
Expand Down Expand Up @@ -46,4 +47,17 @@ end

NonuniformFFTs.groupsize_interp_gpu_shmem(::CUDABackend) = 64

# Override usual `mul!` to avoid GPU allocations.
# See https://github.com/JuliaGPU/CUDA.jl/issues/2249
# This is adapted from https://github.com/JuliaGPU/CUDA.jl/blob/a1db081cbc3d20fa3cb28a9f419b485db03a250f/lib/cufft/fft.jl#L308-L317
# but without the copy.
function NonuniformFFTs._fft_c2r!(
y::DenseCuArray{T}, p, x::DenseCuArray{Complex{T}},
) where {T}
# Perform plan (this may modify not only y, but also the input x)
CUFFT.assert_applicable(p, x, y)
CUFFT.unsafe_execute_trailing!(p, x, y)
y
end

end
10 changes: 8 additions & 2 deletions src/NonuniformFFTs.jl
Original file line number Diff line number Diff line change
Expand Up @@ -271,12 +271,18 @@ end
function _type2_fft!(data::RealNUFFTData)
(; us, ûs, plan_bw,) = data
for (u, û) ∈ zip(us, ûs)
# TODO: can we avoid big GPU allocation on CUDA.jl? (https://github.com/JuliaGPU/CUDA.jl/issues/2249)
mul!(u, plan_bw, û) # perform inverse r2c FFT
_fft_c2r!(u, plan_bw, û) # perform inverse r2c FFT
end
us
end

# Perform inverse r2c FFT.
# This function is overridden by the CUDA extension to avoid GPU allocations.
# See https://github.com/JuliaGPU/CUDA.jl/issues/2249
function _fft_c2r!(u::AbstractArray{T}, plan_bw, û::AbstractArray{Complex{T}}) where {T}
mul!(u, plan_bw, û)
end

function _type2_fft!(data::ComplexNUFFTData)
(; us, plan_bw,) = data
for u ∈ us
Expand Down