Performance overhead when running on a single process #35

musoke · 2021-06-22T02:11:24Z

I have found some odd benchmarks when using PencilFFTs with a single process. It's unclear to me whether this is expected behavior.

The following script compares FFTW and PencilFFTs with three different FFTW flags.

using BenchmarkTools
using FFTW
using LinearAlgebra
using MPI
using PencilFFTs
using Random

function fft_inplace!(a, b, fft_plan)
    mul!(b, fft_plan, a)
end

function runtests()
    suite = BenchmarkGroup()

    if ~MPI.Initialized()
        MPI.Init()
    end

    n = 128
    dims = (n, n, n)

    suite["PencilFFTs"] = BenchmarkGroup()
    for (flag_name, flags) in [
                           ("FFTW.ESTIMATE", FFTW.ESTIMATE),
                           ("FFTW.MEASURE", FFTW.MEASURE),
                           ("FFTW.PATIENT", FFTW.PATIENT),
                          ]
        @info flag_name

        # Distribute 12 processes on a 1 × 1 grid.
        comm = MPI.COMM_WORLD
        Nproc = MPI.Comm_size(comm)

        # Let MPI_Dims_create choose the decomposition.
        proc_dims = let pdims = zeros(Int, 2)
            MPI.Dims_create!(Nproc, pdims)
            pdims[1], pdims[2]
        end

        transform = Transforms.FFT()

        fft_plan = PencilFFTPlan(dims, transform, proc_dims, comm; fftw_flags=flags)

        # Allocate and initialise input data, and apply transform.
        a_pen = allocate_input(fft_plan)
        b_pen = allocate_output(fft_plan)
        randn!(a_pen)
        randn!(b_pen)

        suite["PencilFFTs"][flag_name] = @benchmarkable fft_inplace!($a_pen, $b_pen, $fft_plan)
    end

    suite["FFTW"] = BenchmarkGroup()
    for (flag_name, flags) in [
                           ("FFTW.ESTIMATE", FFTW.ESTIMATE),
                           ("FFTW.MEASURE", FFTW.MEASURE),
                           ("FFTW.PATIENT", FFTW.PATIENT),
                          ]
        a_arr = Array{ComplexF64}(undef, dims)
        b_arr = Array{ComplexF64}(undef, dims)
        rand!(a_arr)
        rand!(b_arr)

        fft_plan = plan_fft(a_arr; flags=flags)

        suite["FFTW"][flag_name] = @benchmarkable fft_inplace!($a_arr, $b_arr, $fft_plan)
    end

    @show tune!(suite)

    results = run(suite)

    @show res_min = minimum(results)
end

runtests()

Typical output for me is

tune!(suite) = 2-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "PencilFFTs" => 3-element BenchmarkTools.BenchmarkGroup:
	  tags: []
	  "FFTW.PATIENT" => Benchmark(evals=1, seconds=5.0, samples=10000)
	  "FFTW.ESTIMATE" => Benchmark(evals=1, seconds=5.0, samples=10000)
	  "FFTW.MEASURE" => Benchmark(evals=1, seconds=5.0, samples=10000)
  "FFTW" => 3-element BenchmarkTools.BenchmarkGroup:
	  tags: []
	  "FFTW.PATIENT" => Benchmark(evals=1, seconds=5.0, samples=10000)
	  "FFTW.ESTIMATE" => Benchmark(evals=1, seconds=5.0, samples=10000)
	  "FFTW.MEASURE" => Benchmark(evals=1, seconds=5.0, samples=10000)
res_min = minimum(results) = 2-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "PencilFFTs" => 3-element BenchmarkTools.BenchmarkGroup:
	  tags: []
	  "FFTW.PATIENT" => TrialEstimate(68.955 ms)
	  "FFTW.ESTIMATE" => TrialEstimate(69.221 ms)
	  "FFTW.MEASURE" => TrialEstimate(69.830 ms)
  "FFTW" => 3-element BenchmarkTools.BenchmarkGroup:
	  tags: []
	  "FFTW.PATIENT" => TrialEstimate(37.748 ms)
	  "FFTW.ESTIMATE" => TrialEstimate(75.944 ms)
	  "FFTW.MEASURE" => TrialEstimate(38.887 ms)

With FFTW.ESTIMATE, FFTW and PencilFFTs are on par. However, while using FFTW.MEASURE or FFTW.PATIENT makes the FFTW twice as fast, PencilFFTs is unaffected.

At first it seemed that this might be a problem with how the flags are passed to FFTW, but I don't see an obvious bug there. The correct flags do seem to be passed through to _make_1d_fft_plan.

Perhaps FFTW does some clever optimization of 3D FFTs that isn't possible when the transform is done one axis at a time?

The text was updated successfully, but these errors were encountered:

jipolanco · 2021-06-22T08:01:39Z

Thanks for noticing this issue.

After doing some tests, it seems that the performance issue is not related to the FFTW flags, which are properly passed, but to the overhead caused by data copies and by local data transpositions (basically permutedims!).

Here is an updated example showing this:

using BenchmarkTools
using FFTW
using LinearAlgebra
using MPI
using PencilFFTs
using Random
using TimerOutputs
using PencilArrays

TimerOutputs.enable_debug_timings(PencilFFTs)
TimerOutputs.enable_debug_timings(PencilArrays)
TimerOutputs.enable_debug_timings(Transpositions)

function fft_inplace!(a, b, fft_plan)
    mul!(b, fft_plan, a)
end

function runtests()
    suite = BenchmarkGroup()

    if !MPI.Initialized()
        MPI.Init()
    end

    n = 128
    dims = (n, n, n)

    timers = Dict{String,TimerOutput}()

    suite["PencilFFTs"] = BenchmarkGroup()
    for (flag_name, flags) in [
                           ("FFTW.ESTIMATE", FFTW.ESTIMATE),
                           ("FFTW.MEASURE", FFTW.MEASURE),
                           # ("FFTW.PATIENT", FFTW.PATIENT),
                          ]
        @info "PencilFFTs -- $flag_name"

        # Distribute 12 processes on a 1 × 1 grid.
        comm = MPI.COMM_WORLD
        Nproc = MPI.Comm_size(comm)

        # Let MPI_Dims_create choose the decomposition.
        proc_dims = let pdims = zeros(Int, 2)
            MPI.Dims_create!(Nproc, pdims)
            pdims[1], pdims[2]
        end

        transform = Transforms.FFT()

        fft_plan = PencilFFTPlan(
            dims, transform, proc_dims, comm;
            fftw_flags = flags,
            permute_dims = Val(true),
        )

        timers[flag_name] = timer(fft_plan)

        # Allocate and initialise input data, and apply transform.
        a_pen = allocate_input(fft_plan)
        b_pen = allocate_output(fft_plan)
        randn!(a_pen)
        randn!(b_pen)

        suite["PencilFFTs"][flag_name] = @benchmarkable fft_inplace!($a_pen, $b_pen, $fft_plan) seconds=1.0 samples=100
    end

    suite["FFTW"] = BenchmarkGroup()
    for (flag_name, flags) in [
                           ("FFTW.ESTIMATE", FFTW.ESTIMATE),
                           ("FFTW.MEASURE", FFTW.MEASURE),
                           # ("FFTW.PATIENT", FFTW.PATIENT),
                          ]
        @info "FFTW -- $flag_name"
        a_arr = Array{ComplexF64}(undef, dims)
        b_arr = Array{ComplexF64}(undef, dims)
        rand!(a_arr)
        rand!(b_arr)

        # FFTW.set_num_threads(1)
        fft_plan = plan_fft(a_arr; flags=flags)

        suite["FFTW"][flag_name] = @benchmarkable fft_inplace!($a_arr, $b_arr, $fft_plan) seconds=1.0 samples=100
    end

    @show tune!(suite)

    for to in values(timers)
        reset_timer!(to)
    end

    results = run(suite)

    @show minimum(results)

    println(timers["FFTW.ESTIMATE"])

    nothing
end

runtests()

On my machine, the BenchmarkTools output gives the following:

  "PencilFFTs" => 2-element BenchmarkTools.BenchmarkGroup:
          tags: []
          "FFTW.ESTIMATE" => TrialEstimate(63.527 ms)
          "FFTW.MEASURE" => TrialEstimate(60.602 ms)
  "FFTW" => 2-element BenchmarkTools.BenchmarkGroup:
          tags: []
          "FFTW.ESTIMATE" => TrialEstimate(25.541 ms)
          "FFTW.MEASURE" => TrialEstimate(24.178 ms)

which suggest that FFTW flags are not the issue. This is confirmed by the TimerOutputs timings:

 ───────────────────────────────────────────────────────────────────────────────
                                        Time                   Allocations      
                                ──────────────────────   ───────────────────────
        Tot / % measured:            7.83s / 13.3%            451KiB / 34.9%    

 Section                ncalls     time   %tot     avg     alloc   %tot      avg
 ───────────────────────────────────────────────────────────────────────────────
 PencilFFTs mul!            11    1.04s   100%  94.8ms    157KiB  100%   14.3KiB
   transpose!               22    564ms  54.1%  25.6ms   20.0KiB  12.7%     930B
     unpack data            22    336ms  32.2%  15.3ms   8.22KiB  5.23%     383B
       copy_permuted!       22    336ms  32.2%  15.3ms   7.39KiB  4.70%     344B
     pack data              22    228ms  21.9%  10.4ms   1.52KiB  0.96%    70.5B
       copy_range!          22    228ms  21.8%  10.4ms     0.00B  0.00%    0.00B
   FFT                      33    478ms  45.8%  14.5ms     0.00B  0.00%    0.00B
   MPI.Waitall!             22    172μs  0.02%  7.83μs   4.47KiB  2.84%     208B
 ───────────────────────────────────────────────────────────────────────────────

Half of the time is spent copying data around and permuting dimensions via permutedims!. It should be possible to avoid the unnecessary overhead when running on a single process...

musoke · 2021-06-22T10:50:41Z

Ok, this makes sense, thanks for the explanation! I was misled by the times with FFTW.ESTIMATE coincidentally being the same.

…

On Tue, 22 Jun 2021, 05:02 Juan Ignacio Polanco, ***@***.***> wrote: Thanks for noticing this issue. After doing some tests, it seems that the performance issue is not related to the FFTW flags, which are properly passed, but to the overhead caused by the local data transpositions (basically permutedims!). Here is an updated example showing this: using BenchmarkTools using FFTW using LinearAlgebra using MPI using PencilFFTs using Random using TimerOutputs using PencilArrays TimerOutputs.enable_debug_timings(PencilFFTs) TimerOutputs.enable_debug_timings(PencilArrays) TimerOutputs.enable_debug_timings(Transpositions) function fft_inplace!(a, b, fft_plan) mul!(b, fft_plan, a) end function runtests() suite = BenchmarkGroup() if !MPI.Initialized() MPI.Init() end n = 128 dims = (n, n, n) timers = Dict{String,TimerOutput}() suite["PencilFFTs"] = BenchmarkGroup() for (flag_name, flags) in [ ("FFTW.ESTIMATE", FFTW.ESTIMATE), ("FFTW.MEASURE", FFTW.MEASURE), # ("FFTW.PATIENT", FFTW.PATIENT), ] @info "PencilFFTs -- $flag_name" # Distribute 12 processes on a 1 × 1 grid. comm = MPI.COMM_WORLD Nproc = MPI.Comm_size(comm) # Let MPI_Dims_create choose the decomposition. proc_dims = let pdims = zeros(Int, 2) MPI.Dims_create!(Nproc, pdims) pdims[1], pdims[2] end transform = Transforms.FFT() fft_plan = PencilFFTPlan( dims, transform, proc_dims, comm; fftw_flags = flags, permute_dims = Val(true), ) timers[flag_name] = timer(fft_plan) # Allocate and initialise input data, and apply transform. a_pen = allocate_input(fft_plan) b_pen = allocate_output(fft_plan) randn!(a_pen) randn!(b_pen) suite["PencilFFTs"][flag_name] = @benchmarkable fft_inplace!($a_pen, $b_pen, $fft_plan) seconds=1.0 samples=100 end suite["FFTW"] = BenchmarkGroup() for (flag_name, flags) in [ ("FFTW.ESTIMATE", FFTW.ESTIMATE), ("FFTW.MEASURE", FFTW.MEASURE), # ("FFTW.PATIENT", FFTW.PATIENT), ] @info "FFTW -- $flag_name" a_arr = Array{ComplexF64}(undef, dims) b_arr = Array{ComplexF64}(undef, dims) rand!(a_arr) rand!(b_arr) # FFTW.set_num_threads(1) fft_plan = plan_fft(a_arr; flags=flags) suite["FFTW"][flag_name] = @benchmarkable fft_inplace!($a_arr, $b_arr, $fft_plan) seconds=1.0 samples=100 end @show tune!(suite) for to in values(timers) reset_timer!(to) end results = run(suite) @show minimum(results) println(timers["FFTW.ESTIMATE"]) nothing end runtests() In my machine, the BenchmarkTools output gives the following: "PencilFFTs" => 2-element BenchmarkTools.BenchmarkGroup: tags: [] "FFTW.ESTIMATE" => TrialEstimate(63.527 ms) "FFTW.MEASURE" => TrialEstimate(60.602 ms) "FFTW" => 2-element BenchmarkTools.BenchmarkGroup: tags: [] "FFTW.ESTIMATE" => TrialEstimate(25.541 ms) "FFTW.MEASURE" => TrialEstimate(24.178 ms) which suggest that FFTW flags are not the issue. This is confirmed by the TimerOutputs timings: ─────────────────────────────────────────────────────────────────────────────── Time Allocations ────────────────────── ─────────────────────── Tot / % measured: 7.83s / 13.3% 451KiB / 34.9% Section ncalls time %tot avg alloc %tot avg ─────────────────────────────────────────────────────────────────────────────── PencilFFTs mul! 11 1.04s 100% 94.8ms 157KiB 100% 14.3KiB transpose! 22 564ms 54.1% 25.6ms 20.0KiB 12.7% 930B unpack data 22 336ms 32.2% 15.3ms 8.22KiB 5.23% 383B copy_permuted! 22 336ms 32.2% 15.3ms 7.39KiB 4.70% 344B pack data 22 228ms 21.9% 10.4ms 1.52KiB 0.96% 70.5B copy_range! 22 228ms 21.8% 10.4ms 0.00B 0.00% 0.00B FFT 33 478ms 45.8% 14.5ms 0.00B 0.00% 0.00B MPI.Waitall! 22 172μs 0.02% 7.83μs 4.47KiB 2.84% 208B ─────────────────────────────────────────────────────────────────────────────── Half of the time is spent copying data around and permuting dimensions via permutedims!. It should be possible to avoid the unnecessary overhead when running on a single process... — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#35 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AD7ET7GME4BIMESI3UDKEHLTUA7PRANCNFSM47CW634A> .

jipolanco changed the title ~~FFTW flags not used~~ Performance overhead when running on a single process Jun 22, 2021

jipolanco added the performance label Jun 22, 2021

musoke mentioned this issue Jun 22, 2021

Benchmark for PencilFFT slower than FFTW musoke/UltraDark.jl#37

Closed

jipolanco mentioned this issue Jul 21, 2022

CuArray Performance #56

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance overhead when running on a single process #35

Performance overhead when running on a single process #35

musoke commented Jun 22, 2021

jipolanco commented Jun 22, 2021 •

edited

Loading

musoke commented Jun 22, 2021 via email

Performance overhead when running on a single process #35

Performance overhead when running on a single process #35

Comments

musoke commented Jun 22, 2021

jipolanco commented Jun 22, 2021 • edited Loading

musoke commented Jun 22, 2021 via email

jipolanco commented Jun 22, 2021 •

edited

Loading