Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removed allocation from filter_pixel() #656

Closed
wants to merge 3 commits into from

Conversation

ripytide
Copy link
Member

@ripytide ripytide commented May 27, 2024

Achieved via always computing with 4 channels using an array of length 4 since const generic expressions are unstable.

@cospectrum
Copy link
Contributor

Still 4 times slower for me

@ripytide
Copy link
Member Author

ripytide commented May 27, 2024

Can you push a branch with the code you are running to get that result so I can investigate your use-case?

@ripytide
Copy link
Member Author

Okay now that #654 was merged this comparison becomes simpler.
In the generic case (not exactly 3x3) this version reduces the code duplication and is simply faster.
image

@cospectrum
Copy link
Contributor

cospectrum commented May 27, 2024

Can you push a branch with the code you are running to get that result so I can investigate your use-case?

The same bench from #654 (comment) using your branch

@ripytide
Copy link
Member Author

ripytide commented May 27, 2024

@cospectrum can you share your implementation of filter3x3() as well so I can try to reproduce your results or are you using the filter3x3_reference()?

@cospectrum
Copy link
Contributor

@cospectrum can you share your implementation of filter3x3() as well so I can try to reproduce your results or are you using the filter3x3_reference()?

It's not "my" implementation. It's from 0.25

@ripytide
Copy link
Member Author

ripytide commented May 27, 2024

Okay I think I've tracked down the discrepancy between the imageproc benchmarks and your external benchmarks via a separate crate.

I get the same results as you by default. But if you add the same build config as imageproc to your external crate I get the same results as imageproc, I think that is what is causing your 4x regression.

Compile Settings

[profile.release]
opt-level = 3
debug = true

[profile.bench]
opt-level = 3
debug = true
rpath = false
lto = false
debug-assertions = false
codegen-units = 1

Benchmarked Code

#![feature(test)]

extern crate test;

use std::hint::black_box;

use test::Bencher;

const SIZE: u32 = 300;

fn main() {}

#[bench]
fn bench_025(b: &mut Bencher) {
    let image = black_box(imageproc_025::image::RgbImage::new(SIZE, SIZE));

    let kernel: Vec<i32> = (0..3 * 3).collect();

    b.iter(|| {
        let filtered = imageproc_025::filter::filter3x3::<_, _, i16>(&image, &kernel);
        black_box(filtered);
    });
}
#[bench]
fn bench_master(b: &mut Bencher) {
    let image = black_box(imageproc_master::image::RgbImage::new(SIZE, SIZE));

    let kernel: Vec<i32> = (0..3 * 3).collect();
    let kernel = imageproc_master::kernel::Kernel::new(&kernel, 3, 3);

    b.iter(|| {
        let filtered = imageproc_master::filter::filter_clamped::<_, _, i16>(&image, kernel);
        black_box(filtered);
    });
}

Results

Master No Settings

test bench_025    ... bench:   3,831,394 ns/iter (+/- 10,242)
test bench_master ... bench:   3,902,486 ns/iter (+/- 13,441)

PR No Settings

test bench_025    ... bench:   2,179,298 ns/iter (+/- 16,869)
test bench_master ... bench:   7,558,955 ns/iter (+/- 30,916)

Master With Settings

test bench_025    ... bench:   2,003,191 ns/iter (+/- 19,839)
test bench_master ... bench:   3,912,089 ns/iter (+/- 8,554)

PR With Settings

test bench_025    ... bench:   1,990,703 ns/iter (+/- 10,605)
test bench_master ... bench:   3,253,459 ns/iter (+/- 11,167)

@ripytide
Copy link
Member Author

ripytide commented May 27, 2024

Now I suppose I just need to track down which specific build setting is making such a big difference, and why in the master no settings test is there a difference between bench_025 and bench_master.

@ripytide
Copy link
Member Author

ripytide commented May 27, 2024

When using debug=false now the two implementations are both 3,300,000 but master_025 is still 2,000,000 which doesn't make any sense at all?

@cospectrum
Copy link
Contributor

cospectrum commented May 28, 2024

If I cargo run --release my bench #654 (comment) with 0.25 against master, master will have the same settings as your branch, and master will have the same speed as 0.25, while yours will be slower (by 4 times)

@theotherphil
Copy link
Contributor

I'll try this PR vs the current master on my MacBook when I have time within the next few days.

@theotherphil
Copy link
Contributor

@cospectrum what machine are you benchmarking on? I appear to be seeing much larger regressions on an Apple ARM chip than @ripytide is on their Ryzen x86.

@cospectrum
Copy link
Contributor

@cospectrum what machine are you benchmarking on? I appear to be seeing much larger regressions on an Apple ARM chip than @ripytide is on their Ryzen x86.

M1

@ripytide
Copy link
Member Author

ripytide commented May 28, 2024

I think the setting which changes which implementation is faster (on my machine) is codegen-units:

Here I've benchmarked the default compile setting:

[profile.release]
codegen-units = 16

vs codegen-units=1 which is used by the benchmarks:

[profile.release]
codegen-units = 1

Results

test master_codegen=1 ... bench:   2,375,138 ns/iter (+/- 3,180)
test pr_codegen=1 ... bench:   1,974,065 ns/iter (+/- 6,905)
test master_codegen=16 ... bench:   2,489,427 ns/iter (+/- 10,247)
test pr_codegen=16 ... bench:   4,585,778 ns/iter (+/- 27,787)
``

@theotherphil
Copy link
Contributor

I’ve not had much time in the last couple of weeks but I still intend to benchmark this branch against master on my MacBook.

@theotherphil
Copy link
Contributor

git checkout master
git pull
cargo +nightly bench filter_clamped_rgb >> bench_master.txt

git fetch origin pull/656/head:pr656
git checkout pr656
cargo +nightly bench filter_clamped_rgb >> bench_pr656.txt

Results on master:

test filter::benches::bench_filter_clamped_rgb_3x3                                    ... bench:   1,190,189.60 ns/iter (+/- 33,991.58)
test filter::benches::bench_filter_clamped_rgb_5x5                                    ... bench:   2,903,974.95 ns/iter (+/- 10,175.80)
test filter::benches::bench_filter_clamped_rgb_7x7                                    ... bench:   5,348,075.00 ns/iter (+/- 38,132.12)

Results for this PR:

test filter::benches::bench_filter_clamped_rgb_3x3                                    ... bench:   1,762,945.85 ns/iter (+/- 57,415.80)
test filter::benches::bench_filter_clamped_rgb_5x5                                    ... bench:   3,940,433.40 ns/iter (+/- 10,009.45)
test filter::benches::bench_filter_clamped_rgb_7x7                                    ... bench:   7,041,208.30 ns/iter (+/- 22,404.16)

This still has large regressions, at least on an Apple ARM chip, so I'll close this PR.

I'll need to look at filter performance a bit anyway before the next release of this crate, to resolve #664.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants