Use direct evaluation of kernel functions on GPU #39

jipolanco · 2024-10-25T11:09:09Z

This seems to be slightly faster (and uses far fewer GPU registers) than approximate evaluations based on piecewise polynomials.

Also, specifically on CUDA, we now default to kernel = KaiserBesselKernel() instead of BackwardsKaiserBesselKernel() as it seems to be a bit faster. Accuracy is not affected since both kernels have equivalent precisions.

Atomic add on shared memory is very slow. Is it Atomix's fault?

Much faster!

Doesn't change performance.

It is the same or faster than polynomial approximation (and more accurate!). Especially for shared-memory interpolation, it seems to improve performance by a lot.

Doesn't really affect performance, on GPU at least.

codecov · 2024-10-25T16:51:53Z

Codecov Report

Attention: Patch coverage is 92.59259% with 4 lines in your changes missing coverage. Please review.

Project coverage is 92.32%. Comparing base (ae8f012) to head (b10d7cd).
Report is 1 commits behind head on master.

Files with missing lines	Patch %	Lines
ext/NonuniformFFTsCUDAExt.jl	0.00%	3 Missing ⚠️
src/abstractNFFTs.jl	66.66%	1 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##           master      #39   +/-   ##
=======================================
  Coverage   92.31%   92.32%           
=======================================
  Files          18       19    +1     
  Lines        1614     1654   +40     
=======================================
+ Hits         1490     1527   +37     
- Misses        124      127    +3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

jipolanco added 30 commits October 18, 2024 11:25

Start work on shared-memory implementation

5c45c7d

WIP

6be3d8e

Interpolation now works (in 3D, ntransforms = 1)

6784317

Minor changes

cec2a65

Optimisations

0ac8886

Further optimisations

ecef038

Generalise interpolation to all dimensions

e4aa4ca

Use get_inds_vals_gpu in spreading

7d1e4f6

Spreading with shmem works but is slow

2a34d0b

Atomic add on shared memory is very slow. Is it Atomix's fault?

SM kernels now work with KA CPU backends

87f050c

Test SM kernels on CPU

46acbda

Fix kernel compilation on CUDA

3fd04b9

Reorganise some code

6b0fad6

[WIP] avoid atomics in shared-memory arrays

867b73e

Avoid atomic operations on shared memory

801eb5e

Minor improvement

24f5956

Make window_vals a matrix

67e4baa

Implement hybrid parallelisation in SM spreading

765b73e

Much faster!

More optimisations

5bf7ae4

Try to fix tests (on CPU)

f9e7931

Shared memory array can be complex

e1c2daf

Update interpolation based on spreading changes

4898e6e

Simplify setting workgroupsize

df0690f

Minor changes

2b28729

Fix CPU tests

d0f439f

Remove unused functions

2b0b4b6

Add tests for multiple transforms

6bbeb8e

Simplify atomic adds with complex data

7bc11bc

Doesn't change performance.

point_to_cell now also returns x/Δx

05edddf

Remove direct evaluation functions (for now)

1f6c02a

jipolanco added 22 commits October 23, 2024 17:00

Update CHANGELOG [skip ci]

e8ea63d

Add documentation

92356d3

Update docs and comments

c9311e0

Define direct evaluation of KB kernels

7010a61

Merge branch 'master' into eval-direct

31f7975

Fix direct evaluation

1f3efa1

Minor optimisations

3316471

Use direct evaluation in GPU kernels

7661594

It is the same or faster than polynomial approximation (and more accurate!). Especially for shared-memory interpolation, it seems to improve performance by a lot.

Avoid division by zero in BKB kernel

c3b4ed2

Simplify window evaluation

d5295e9

Doesn't really affect performance, on GPU at least.

Add comment on besseli0

5c7d2a3

Add empty CUDA extension

543f3b3

KB kernel: call CUDA version of besseli0

3c705eb

Update comment

d39375d

Define direct evaluation for GaussianKernel

aa98132

Define direct evaluation for BSplineKernel

a29d615

Allow different default kernel per backend

6f9603d

Update comments

ff78364

Add tests

ade1a5f

Update CHANGELOG

72972b7

Remove old comment

7ddeaab

Update tests

b10d7cd

jipolanco merged commit dd4d842 into master Oct 25, 2024
6 checks passed

jipolanco deleted the eval-direct branch October 25, 2024 19:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use direct evaluation of kernel functions on GPU #39

Use direct evaluation of kernel functions on GPU #39

jipolanco commented Oct 25, 2024

codecov bot commented Oct 25, 2024 •

edited

Loading

Use direct evaluation of kernel functions on GPU #39

Use direct evaluation of kernel functions on GPU #39

Conversation

jipolanco commented Oct 25, 2024

codecov bot commented Oct 25, 2024 • edited Loading

Codecov Report

codecov bot commented Oct 25, 2024 •

edited

Loading