Removed allocation from `filter_pixel()` #656

ripytide · 2024-05-27T13:48:09Z

Achieved via always computing with 4 channels using an array of length 4 since const generic expressions are unstable.

cospectrum · 2024-05-27T14:46:49Z

Still 4 times slower for me

ripytide · 2024-05-27T14:55:35Z

Can you push a branch with the code you are running to get that result so I can investigate your use-case?

ripytide · 2024-05-27T16:04:26Z

Okay now that #654 was merged this comparison becomes simpler.
In the generic case (not exactly 3x3) this version reduces the code duplication and is simply faster.

cospectrum · 2024-05-27T16:43:07Z

Can you push a branch with the code you are running to get that result so I can investigate your use-case?

The same bench from #654 (comment) using your branch

ripytide · 2024-05-27T18:15:06Z

@cospectrum can you share your implementation of filter3x3() as well so I can try to reproduce your results or are you using the filter3x3_reference()?

cospectrum · 2024-05-27T18:36:42Z

@cospectrum can you share your implementation of filter3x3() as well so I can try to reproduce your results or are you using the filter3x3_reference()?

It's not "my" implementation. It's from 0.25

ripytide · 2024-05-27T19:28:51Z

Okay I think I've tracked down the discrepancy between the imageproc benchmarks and your external benchmarks via a separate crate.

I get the same results as you by default. But if you add the same build config as imageproc to your external crate I get the same results as imageproc, I think that is what is causing your 4x regression.

Compile Settings

[profile.release]
opt-level = 3
debug = true

[profile.bench]
opt-level = 3
debug = true
rpath = false
lto = false
debug-assertions = false
codegen-units = 1

Benchmarked Code

#![feature(test)]

extern crate test;

use std::hint::black_box;

use test::Bencher;

const SIZE: u32 = 300;

fn main() {}

#[bench]
fn bench_025(b: &mut Bencher) {
    let image = black_box(imageproc_025::image::RgbImage::new(SIZE, SIZE));

    let kernel: Vec<i32> = (0..3 * 3).collect();

    b.iter(|| {
        let filtered = imageproc_025::filter::filter3x3::<_, _, i16>(&image, &kernel);
        black_box(filtered);
    });
}
#[bench]
fn bench_master(b: &mut Bencher) {
    let image = black_box(imageproc_master::image::RgbImage::new(SIZE, SIZE));

    let kernel: Vec<i32> = (0..3 * 3).collect();
    let kernel = imageproc_master::kernel::Kernel::new(&kernel, 3, 3);

    b.iter(|| {
        let filtered = imageproc_master::filter::filter_clamped::<_, _, i16>(&image, kernel);
        black_box(filtered);
    });
}

Results

Master No Settings

test bench_025    ... bench:   3,831,394 ns/iter (+/- 10,242)
test bench_master ... bench:   3,902,486 ns/iter (+/- 13,441)

PR No Settings

test bench_025    ... bench:   2,179,298 ns/iter (+/- 16,869)
test bench_master ... bench:   7,558,955 ns/iter (+/- 30,916)

Master With Settings

test bench_025    ... bench:   2,003,191 ns/iter (+/- 19,839)
test bench_master ... bench:   3,912,089 ns/iter (+/- 8,554)

PR With Settings

test bench_025    ... bench:   1,990,703 ns/iter (+/- 10,605)
test bench_master ... bench:   3,253,459 ns/iter (+/- 11,167)

ripytide · 2024-05-27T19:33:34Z

Now I suppose I just need to track down which specific build setting is making such a big difference, and why in the master no settings test is there a difference between bench_025 and bench_master.

ripytide · 2024-05-27T19:39:08Z

When using debug=false now the two implementations are both 3,300,000 but master_025 is still 2,000,000 which doesn't make any sense at all?

cospectrum · 2024-05-28T07:49:11Z

If I cargo run --release my bench #654 (comment) with 0.25 against master, master will have the same settings as your branch, and master will have the same speed as 0.25, while yours will be slower (by 4 times)

theotherphil · 2024-05-28T07:55:24Z

I'll try this PR vs the current master on my MacBook when I have time within the next few days.

theotherphil · 2024-05-28T07:57:38Z

@cospectrum what machine are you benchmarking on? I appear to be seeing much larger regressions on an Apple ARM chip than @ripytide is on their Ryzen x86.

cospectrum · 2024-05-28T07:58:57Z

@cospectrum what machine are you benchmarking on? I appear to be seeing much larger regressions on an Apple ARM chip than @ripytide is on their Ryzen x86.

M1

ripytide · 2024-05-28T08:46:51Z

I think the setting which changes which implementation is faster (on my machine) is codegen-units:

Here I've benchmarked the default compile setting:

[profile.release]
codegen-units = 16

vs codegen-units=1 which is used by the benchmarks:

[profile.release]
codegen-units = 1

Results

test master_codegen=1 ... bench:   2,375,138 ns/iter (+/- 3,180)
test pr_codegen=1 ... bench:   1,974,065 ns/iter (+/- 6,905)
test master_codegen=16 ... bench:   2,489,427 ns/iter (+/- 10,247)
test pr_codegen=16 ... bench:   4,585,778 ns/iter (+/- 27,787)
``

theotherphil · 2024-06-07T02:49:21Z

I’ve not had much time in the last couple of weeks but I still intend to benchmark this branch against master on my MacBook.

theotherphil · 2024-06-07T20:12:43Z

git checkout master
git pull
cargo +nightly bench filter_clamped_rgb >> bench_master.txt

git fetch origin pull/656/head:pr656
git checkout pr656
cargo +nightly bench filter_clamped_rgb >> bench_pr656.txt

Results on master:

test filter::benches::bench_filter_clamped_rgb_3x3                                    ... bench:   1,190,189.60 ns/iter (+/- 33,991.58)
test filter::benches::bench_filter_clamped_rgb_5x5                                    ... bench:   2,903,974.95 ns/iter (+/- 10,175.80)
test filter::benches::bench_filter_clamped_rgb_7x7                                    ... bench:   5,348,075.00 ns/iter (+/- 38,132.12)

Results for this PR:

test filter::benches::bench_filter_clamped_rgb_3x3                                    ... bench:   1,762,945.85 ns/iter (+/- 57,415.80)
test filter::benches::bench_filter_clamped_rgb_5x5                                    ... bench:   3,940,433.40 ns/iter (+/- 10,009.45)
test filter::benches::bench_filter_clamped_rgb_7x7                                    ... bench:   7,041,208.30 ns/iter (+/- 22,404.16)

This still has large regressions, at least on an Apple ARM chip, so I'll close this PR.

I'll need to look at filter performance a bit anyway before the next release of this crate, to resolve #664.

ripytide force-pushed the no-alloc branch from c7de194 to fba42bc Compare May 27, 2024 16:01

ripytide added 3 commits May 27, 2024 19:45

fix conflict

edaced6

remove allocation from filter implementation

2c8175c

remove todos

f3b200d

ripytide force-pushed the no-alloc branch from fba42bc to f3b200d Compare May 27, 2024 18:46

theotherphil closed this Jun 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Removed allocation from `filter_pixel()` #656

Removed allocation from `filter_pixel()` #656

ripytide commented May 27, 2024 •

edited

Loading

cospectrum commented May 27, 2024

ripytide commented May 27, 2024 •

edited

Loading

ripytide commented May 27, 2024

cospectrum commented May 27, 2024 •

edited

Loading

ripytide commented May 27, 2024 •

edited

Loading

cospectrum commented May 27, 2024

ripytide commented May 27, 2024 •

edited

Loading

ripytide commented May 27, 2024 •

edited

Loading

ripytide commented May 27, 2024 •

edited

Loading

cospectrum commented May 28, 2024 •

edited

Loading

theotherphil commented May 28, 2024

theotherphil commented May 28, 2024

cospectrum commented May 28, 2024

ripytide commented May 28, 2024 •

edited

Loading

theotherphil commented Jun 7, 2024

theotherphil commented Jun 7, 2024

Removed allocation from filter_pixel() #656

Removed allocation from filter_pixel() #656

Conversation

ripytide commented May 27, 2024 • edited Loading

cospectrum commented May 27, 2024

ripytide commented May 27, 2024 • edited Loading

ripytide commented May 27, 2024

cospectrum commented May 27, 2024 • edited Loading

ripytide commented May 27, 2024 • edited Loading

cospectrum commented May 27, 2024

ripytide commented May 27, 2024 • edited Loading

Compile Settings

Benchmarked Code

Results

Master No Settings

PR No Settings

Master With Settings

PR With Settings

ripytide commented May 27, 2024 • edited Loading

ripytide commented May 27, 2024 • edited Loading

cospectrum commented May 28, 2024 • edited Loading

theotherphil commented May 28, 2024

theotherphil commented May 28, 2024

cospectrum commented May 28, 2024

ripytide commented May 28, 2024 • edited Loading

Results

theotherphil commented Jun 7, 2024

theotherphil commented Jun 7, 2024

Removed allocation from `filter_pixel()` #656

Removed allocation from `filter_pixel()` #656

ripytide commented May 27, 2024 •

edited

Loading

ripytide commented May 27, 2024 •

edited

Loading

cospectrum commented May 27, 2024 •

edited

Loading

ripytide commented May 27, 2024 •

edited

Loading

ripytide commented May 27, 2024 •

edited

Loading

ripytide commented May 27, 2024 •

edited

Loading

ripytide commented May 27, 2024 •

edited

Loading

cospectrum commented May 28, 2024 •

edited

Loading

ripytide commented May 28, 2024 •

edited

Loading