-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance overhead when running on a single process #35
Comments
Thanks for noticing this issue. After doing some tests, it seems that the performance issue is not related to the FFTW flags, which are properly passed, but to the overhead caused by data copies and by local data transpositions (basically Here is an updated example showing this: using BenchmarkTools
using FFTW
using LinearAlgebra
using MPI
using PencilFFTs
using Random
using TimerOutputs
using PencilArrays
TimerOutputs.enable_debug_timings(PencilFFTs)
TimerOutputs.enable_debug_timings(PencilArrays)
TimerOutputs.enable_debug_timings(Transpositions)
function fft_inplace!(a, b, fft_plan)
mul!(b, fft_plan, a)
end
function runtests()
suite = BenchmarkGroup()
if !MPI.Initialized()
MPI.Init()
end
n = 128
dims = (n, n, n)
timers = Dict{String,TimerOutput}()
suite["PencilFFTs"] = BenchmarkGroup()
for (flag_name, flags) in [
("FFTW.ESTIMATE", FFTW.ESTIMATE),
("FFTW.MEASURE", FFTW.MEASURE),
# ("FFTW.PATIENT", FFTW.PATIENT),
]
@info "PencilFFTs -- $flag_name"
# Distribute 12 processes on a 1 × 1 grid.
comm = MPI.COMM_WORLD
Nproc = MPI.Comm_size(comm)
# Let MPI_Dims_create choose the decomposition.
proc_dims = let pdims = zeros(Int, 2)
MPI.Dims_create!(Nproc, pdims)
pdims[1], pdims[2]
end
transform = Transforms.FFT()
fft_plan = PencilFFTPlan(
dims, transform, proc_dims, comm;
fftw_flags = flags,
permute_dims = Val(true),
)
timers[flag_name] = timer(fft_plan)
# Allocate and initialise input data, and apply transform.
a_pen = allocate_input(fft_plan)
b_pen = allocate_output(fft_plan)
randn!(a_pen)
randn!(b_pen)
suite["PencilFFTs"][flag_name] = @benchmarkable fft_inplace!($a_pen, $b_pen, $fft_plan) seconds=1.0 samples=100
end
suite["FFTW"] = BenchmarkGroup()
for (flag_name, flags) in [
("FFTW.ESTIMATE", FFTW.ESTIMATE),
("FFTW.MEASURE", FFTW.MEASURE),
# ("FFTW.PATIENT", FFTW.PATIENT),
]
@info "FFTW -- $flag_name"
a_arr = Array{ComplexF64}(undef, dims)
b_arr = Array{ComplexF64}(undef, dims)
rand!(a_arr)
rand!(b_arr)
# FFTW.set_num_threads(1)
fft_plan = plan_fft(a_arr; flags=flags)
suite["FFTW"][flag_name] = @benchmarkable fft_inplace!($a_arr, $b_arr, $fft_plan) seconds=1.0 samples=100
end
@show tune!(suite)
for to in values(timers)
reset_timer!(to)
end
results = run(suite)
@show minimum(results)
println(timers["FFTW.ESTIMATE"])
nothing
end
runtests()
On my machine, the BenchmarkTools output gives the following:
which suggest that FFTW flags are not the issue. This is confirmed by the TimerOutputs timings:
Half of the time is spent copying data around and permuting dimensions via |
Ok, this makes sense, thanks for the explanation!
I was misled by the times with FFTW.ESTIMATE coincidentally being the same.
…On Tue, 22 Jun 2021, 05:02 Juan Ignacio Polanco, ***@***.***> wrote:
Thanks for noticing this issue.
After doing some tests, it seems that the performance issue is not related
to the FFTW flags, which are properly passed, but to the overhead caused by
the local data transpositions (basically permutedims!).
Here is an updated example showing this:
using BenchmarkTools
using FFTW
using LinearAlgebra
using MPI
using PencilFFTs
using Random
using TimerOutputs
using PencilArrays
TimerOutputs.enable_debug_timings(PencilFFTs)
TimerOutputs.enable_debug_timings(PencilArrays)
TimerOutputs.enable_debug_timings(Transpositions)
function fft_inplace!(a, b, fft_plan)
mul!(b, fft_plan, a)
end
function runtests()
suite = BenchmarkGroup()
if !MPI.Initialized()
MPI.Init()
end
n = 128
dims = (n, n, n)
timers = Dict{String,TimerOutput}()
suite["PencilFFTs"] = BenchmarkGroup()
for (flag_name, flags) in [
("FFTW.ESTIMATE", FFTW.ESTIMATE),
("FFTW.MEASURE", FFTW.MEASURE),
# ("FFTW.PATIENT", FFTW.PATIENT),
]
@info "PencilFFTs -- $flag_name"
# Distribute 12 processes on a 1 × 1 grid.
comm = MPI.COMM_WORLD
Nproc = MPI.Comm_size(comm)
# Let MPI_Dims_create choose the decomposition.
proc_dims = let pdims = zeros(Int, 2)
MPI.Dims_create!(Nproc, pdims)
pdims[1], pdims[2]
end
transform = Transforms.FFT()
fft_plan = PencilFFTPlan(
dims, transform, proc_dims, comm;
fftw_flags = flags,
permute_dims = Val(true),
)
timers[flag_name] = timer(fft_plan)
# Allocate and initialise input data, and apply transform.
a_pen = allocate_input(fft_plan)
b_pen = allocate_output(fft_plan)
randn!(a_pen)
randn!(b_pen)
suite["PencilFFTs"][flag_name] = @benchmarkable fft_inplace!($a_pen, $b_pen, $fft_plan) seconds=1.0 samples=100
end
suite["FFTW"] = BenchmarkGroup()
for (flag_name, flags) in [
("FFTW.ESTIMATE", FFTW.ESTIMATE),
("FFTW.MEASURE", FFTW.MEASURE),
# ("FFTW.PATIENT", FFTW.PATIENT),
]
@info "FFTW -- $flag_name"
a_arr = Array{ComplexF64}(undef, dims)
b_arr = Array{ComplexF64}(undef, dims)
rand!(a_arr)
rand!(b_arr)
# FFTW.set_num_threads(1)
fft_plan = plan_fft(a_arr; flags=flags)
suite["FFTW"][flag_name] = @benchmarkable fft_inplace!($a_arr, $b_arr, $fft_plan) seconds=1.0 samples=100
end
@show tune!(suite)
for to in values(timers)
reset_timer!(to)
end
results = run(suite)
@show minimum(results)
println(timers["FFTW.ESTIMATE"])
nothing
end
runtests()
In my machine, the BenchmarkTools output gives the following:
"PencilFFTs" => 2-element BenchmarkTools.BenchmarkGroup:
tags: []
"FFTW.ESTIMATE" => TrialEstimate(63.527 ms)
"FFTW.MEASURE" => TrialEstimate(60.602 ms)
"FFTW" => 2-element BenchmarkTools.BenchmarkGroup:
tags: []
"FFTW.ESTIMATE" => TrialEstimate(25.541 ms)
"FFTW.MEASURE" => TrialEstimate(24.178 ms)
which suggest that FFTW flags are not the issue. This is confirmed by the
TimerOutputs timings:
───────────────────────────────────────────────────────────────────────────────
Time Allocations
────────────────────── ───────────────────────
Tot / % measured: 7.83s / 13.3% 451KiB / 34.9%
Section ncalls time %tot avg alloc %tot avg
───────────────────────────────────────────────────────────────────────────────
PencilFFTs mul! 11 1.04s 100% 94.8ms 157KiB 100% 14.3KiB
transpose! 22 564ms 54.1% 25.6ms 20.0KiB 12.7% 930B
unpack data 22 336ms 32.2% 15.3ms 8.22KiB 5.23% 383B
copy_permuted! 22 336ms 32.2% 15.3ms 7.39KiB 4.70% 344B
pack data 22 228ms 21.9% 10.4ms 1.52KiB 0.96% 70.5B
copy_range! 22 228ms 21.8% 10.4ms 0.00B 0.00% 0.00B
FFT 33 478ms 45.8% 14.5ms 0.00B 0.00% 0.00B
MPI.Waitall! 22 172μs 0.02% 7.83μs 4.47KiB 2.84% 208B
───────────────────────────────────────────────────────────────────────────────
Half of the time is spent copying data around and permuting dimensions via
permutedims!. It should be possible to avoid the unnecessary overhead
when running on a single process...
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#35 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AD7ET7GME4BIMESI3UDKEHLTUA7PRANCNFSM47CW634A>
.
|
I have found some odd benchmarks when using PencilFFTs with a single process. It's unclear to me whether this is expected behavior.
The following script compares FFTW and PencilFFTs with three different FFTW flags.
Typical output for me is
With FFTW.ESTIMATE, FFTW and PencilFFTs are on par. However, while using FFTW.MEASURE or FFTW.PATIENT makes the FFTW twice as fast, PencilFFTs is unaffected.
At first it seemed that this might be a problem with how the flags are passed to FFTW, but I don't see an obvious bug there. The correct flags do seem to be passed through to _make_1d_fft_plan.
Perhaps FFTW does some clever optimization of 3D FFTs that isn't possible when the transform is done one axis at a time?
The text was updated successfully, but these errors were encountered: