Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance overhead when running on a single process #35

Open
musoke opened this issue Jun 22, 2021 · 2 comments
Open

Performance overhead when running on a single process #35

musoke opened this issue Jun 22, 2021 · 2 comments

Comments

@musoke
Copy link

musoke commented Jun 22, 2021

I have found some odd benchmarks when using PencilFFTs with a single process. It's unclear to me whether this is expected behavior.

The following script compares FFTW and PencilFFTs with three different FFTW flags.

using BenchmarkTools
using FFTW
using LinearAlgebra
using MPI
using PencilFFTs
using Random

function fft_inplace!(a, b, fft_plan)
    mul!(b, fft_plan, a)
end

function runtests()
    suite = BenchmarkGroup()

    if ~MPI.Initialized()
        MPI.Init()
    end

    n = 128
    dims = (n, n, n)

    suite["PencilFFTs"] = BenchmarkGroup()
    for (flag_name, flags) in [
                           ("FFTW.ESTIMATE", FFTW.ESTIMATE),
                           ("FFTW.MEASURE", FFTW.MEASURE),
                           ("FFTW.PATIENT", FFTW.PATIENT),
                          ]
        @info flag_name

        # Distribute 12 processes on a 1 × 1 grid.
        comm = MPI.COMM_WORLD
        Nproc = MPI.Comm_size(comm)

        # Let MPI_Dims_create choose the decomposition.
        proc_dims = let pdims = zeros(Int, 2)
            MPI.Dims_create!(Nproc, pdims)
            pdims[1], pdims[2]
        end

        transform = Transforms.FFT()

        fft_plan = PencilFFTPlan(dims, transform, proc_dims, comm; fftw_flags=flags)

        # Allocate and initialise input data, and apply transform.
        a_pen = allocate_input(fft_plan)
        b_pen = allocate_output(fft_plan)
        randn!(a_pen)
        randn!(b_pen)

        suite["PencilFFTs"][flag_name] = @benchmarkable fft_inplace!($a_pen, $b_pen, $fft_plan)
    end

    suite["FFTW"] = BenchmarkGroup()
    for (flag_name, flags) in [
                           ("FFTW.ESTIMATE", FFTW.ESTIMATE),
                           ("FFTW.MEASURE", FFTW.MEASURE),
                           ("FFTW.PATIENT", FFTW.PATIENT),
                          ]
        a_arr = Array{ComplexF64}(undef, dims)
        b_arr = Array{ComplexF64}(undef, dims)
        rand!(a_arr)
        rand!(b_arr)

        fft_plan = plan_fft(a_arr; flags=flags)

        suite["FFTW"][flag_name] = @benchmarkable fft_inplace!($a_arr, $b_arr, $fft_plan)
    end

    @show tune!(suite)

    results = run(suite)

    @show res_min = minimum(results)
end

runtests()

Typical output for me is

tune!(suite) = 2-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "PencilFFTs" => 3-element BenchmarkTools.BenchmarkGroup:
	  tags: []
	  "FFTW.PATIENT" => Benchmark(evals=1, seconds=5.0, samples=10000)
	  "FFTW.ESTIMATE" => Benchmark(evals=1, seconds=5.0, samples=10000)
	  "FFTW.MEASURE" => Benchmark(evals=1, seconds=5.0, samples=10000)
  "FFTW" => 3-element BenchmarkTools.BenchmarkGroup:
	  tags: []
	  "FFTW.PATIENT" => Benchmark(evals=1, seconds=5.0, samples=10000)
	  "FFTW.ESTIMATE" => Benchmark(evals=1, seconds=5.0, samples=10000)
	  "FFTW.MEASURE" => Benchmark(evals=1, seconds=5.0, samples=10000)
res_min = minimum(results) = 2-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "PencilFFTs" => 3-element BenchmarkTools.BenchmarkGroup:
	  tags: []
	  "FFTW.PATIENT" => TrialEstimate(68.955 ms)
	  "FFTW.ESTIMATE" => TrialEstimate(69.221 ms)
	  "FFTW.MEASURE" => TrialEstimate(69.830 ms)
  "FFTW" => 3-element BenchmarkTools.BenchmarkGroup:
	  tags: []
	  "FFTW.PATIENT" => TrialEstimate(37.748 ms)
	  "FFTW.ESTIMATE" => TrialEstimate(75.944 ms)
	  "FFTW.MEASURE" => TrialEstimate(38.887 ms)

With FFTW.ESTIMATE, FFTW and PencilFFTs are on par. However, while using FFTW.MEASURE or FFTW.PATIENT makes the FFTW twice as fast, PencilFFTs is unaffected.

At first it seemed that this might be a problem with how the flags are passed to FFTW, but I don't see an obvious bug there. The correct flags do seem to be passed through to _make_1d_fft_plan.

Perhaps FFTW does some clever optimization of 3D FFTs that isn't possible when the transform is done one axis at a time?

@jipolanco
Copy link
Owner

jipolanco commented Jun 22, 2021

Thanks for noticing this issue.

After doing some tests, it seems that the performance issue is not related to the FFTW flags, which are properly passed, but to the overhead caused by data copies and by local data transpositions (basically permutedims!).

Here is an updated example showing this:

using BenchmarkTools
using FFTW
using LinearAlgebra
using MPI
using PencilFFTs
using Random
using TimerOutputs
using PencilArrays

TimerOutputs.enable_debug_timings(PencilFFTs)
TimerOutputs.enable_debug_timings(PencilArrays)
TimerOutputs.enable_debug_timings(Transpositions)

function fft_inplace!(a, b, fft_plan)
    mul!(b, fft_plan, a)
end

function runtests()
    suite = BenchmarkGroup()

    if !MPI.Initialized()
        MPI.Init()
    end

    n = 128
    dims = (n, n, n)

    timers = Dict{String,TimerOutput}()

    suite["PencilFFTs"] = BenchmarkGroup()
    for (flag_name, flags) in [
                           ("FFTW.ESTIMATE", FFTW.ESTIMATE),
                           ("FFTW.MEASURE", FFTW.MEASURE),
                           # ("FFTW.PATIENT", FFTW.PATIENT),
                          ]
        @info "PencilFFTs -- $flag_name"

        # Distribute 12 processes on a 1 × 1 grid.
        comm = MPI.COMM_WORLD
        Nproc = MPI.Comm_size(comm)

        # Let MPI_Dims_create choose the decomposition.
        proc_dims = let pdims = zeros(Int, 2)
            MPI.Dims_create!(Nproc, pdims)
            pdims[1], pdims[2]
        end

        transform = Transforms.FFT()

        fft_plan = PencilFFTPlan(
            dims, transform, proc_dims, comm;
            fftw_flags = flags,
            permute_dims = Val(true),
        )

        timers[flag_name] = timer(fft_plan)

        # Allocate and initialise input data, and apply transform.
        a_pen = allocate_input(fft_plan)
        b_pen = allocate_output(fft_plan)
        randn!(a_pen)
        randn!(b_pen)

        suite["PencilFFTs"][flag_name] = @benchmarkable fft_inplace!($a_pen, $b_pen, $fft_plan) seconds=1.0 samples=100
    end

    suite["FFTW"] = BenchmarkGroup()
    for (flag_name, flags) in [
                           ("FFTW.ESTIMATE", FFTW.ESTIMATE),
                           ("FFTW.MEASURE", FFTW.MEASURE),
                           # ("FFTW.PATIENT", FFTW.PATIENT),
                          ]
        @info "FFTW -- $flag_name"
        a_arr = Array{ComplexF64}(undef, dims)
        b_arr = Array{ComplexF64}(undef, dims)
        rand!(a_arr)
        rand!(b_arr)

        # FFTW.set_num_threads(1)
        fft_plan = plan_fft(a_arr; flags=flags)

        suite["FFTW"][flag_name] = @benchmarkable fft_inplace!($a_arr, $b_arr, $fft_plan) seconds=1.0 samples=100
    end

    @show tune!(suite)

    for to in values(timers)
        reset_timer!(to)
    end

    results = run(suite)

    @show minimum(results)

    println(timers["FFTW.ESTIMATE"])

    nothing
end

runtests()

On my machine, the BenchmarkTools output gives the following:

  "PencilFFTs" => 2-element BenchmarkTools.BenchmarkGroup:
          tags: []
          "FFTW.ESTIMATE" => TrialEstimate(63.527 ms)
          "FFTW.MEASURE" => TrialEstimate(60.602 ms)
  "FFTW" => 2-element BenchmarkTools.BenchmarkGroup:
          tags: []
          "FFTW.ESTIMATE" => TrialEstimate(25.541 ms)
          "FFTW.MEASURE" => TrialEstimate(24.178 ms)

which suggest that FFTW flags are not the issue. This is confirmed by the TimerOutputs timings:

 ───────────────────────────────────────────────────────────────────────────────
                                        Time                   Allocations      
                                ──────────────────────   ───────────────────────
        Tot / % measured:            7.83s / 13.3%            451KiB / 34.9%    

 Section                ncalls     time   %tot     avg     alloc   %tot      avg
 ───────────────────────────────────────────────────────────────────────────────
 PencilFFTs mul!            11    1.04s   100%  94.8ms    157KiB  100%   14.3KiB
   transpose!               22    564ms  54.1%  25.6ms   20.0KiB  12.7%     930B
     unpack data            22    336ms  32.2%  15.3ms   8.22KiB  5.23%     383B
       copy_permuted!       22    336ms  32.2%  15.3ms   7.39KiB  4.70%     344B
     pack data              22    228ms  21.9%  10.4ms   1.52KiB  0.96%    70.5B
       copy_range!          22    228ms  21.8%  10.4ms     0.00B  0.00%    0.00B
   FFT                      33    478ms  45.8%  14.5ms     0.00B  0.00%    0.00B
   MPI.Waitall!             22    172μs  0.02%  7.83μs   4.47KiB  2.84%     208B
 ───────────────────────────────────────────────────────────────────────────────

Half of the time is spent copying data around and permuting dimensions via permutedims!. It should be possible to avoid the unnecessary overhead when running on a single process...

@jipolanco jipolanco changed the title FFTW flags not used Performance overhead when running on a single process Jun 22, 2021
@musoke
Copy link
Author

musoke commented Jun 22, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants