Performance regression materializing `Selection` #364

mgeplf · 2024-07-24T08:04:28Z

Two people independently contacted me about a large performance regression while materializing Selections.
They had to roll back to 0.1.24 to be able to complete selections.
It seems that operations that completed nearly instantly (if the data was cached) are now taking much longer.
I will try and make a reproducer.

mgeplf · 2024-07-24T08:41:04Z

Seems like in both cases the selections are distributed over the complete set, randomly:
Using the following script:

import libsonata
import numpy as np

np.random.seed(42)

sto = libsonata.NodeStorage('/home/gevaert/src/libsonata/sscx-nodes-sonata.h5')
pop = sto.open_population('All')

#ids = np.arange(0, 100000, 2)
count = int(0.01*pop.size)
print(f'selecting {count} from {pop.size}')
ids = np.random.randint(0, pop.size, count)
sel = libsonata.Selection(ids)

print(np.mean(pop.get_attribute('x', sel)))

I get the following timings (note, FS caches should be hot):

v0.1.24:


# note: this is 0.001 of population size; seems fast
$ time python3 ../bench-selection.py
selecting 4234 from 4234929
3985.7162878066642
python3 ../bench-selection.py  0.28s user 0.96s system 1531% cpu 0.081 total

# note: this is 0.01 of population size; seems fast
$ time python3 ../bench-selection.py
selecting 42349 from 4234929
3973.392037563209
python3 ../bench-selection.py  0.32s user 1.05s system 1572% cpu 0.087 total

# note: this is 0.01 of population size; seems fast
$ time python3 ../bench-selection.py
selecting 42349 from 4234929
3973.392037563209
python3 ../bench-selection.py  0.30s user 1.09s system 1585% cpu 0.087 total

v0.1.25:

# note: this is 0.001 of population size; seems fast
selecting 4234 from 4234929
3985.7162878066642
python3 ../bench-selection.py  0.53s user 1.15s system 557% cpu 0.301 total

# note: this is 0.01 of population size; seems slow
$ time python3 ../bench-selection.py
selecting 42349 from 4234929
3973.392037563209
python3 ../bench-selection.py  44.39s user 1.17s system 103% cpu 44.177 total

# note: this is 0.01 of population size; seems slow
$ time python3 ../bench-selection.py
selecting 42349 from 4234929
3973.392037563209
python3 ../bench-selection.py  43.23s user 1.14s system 103% cpu 42.991 total

So it looks like the user time has gone up by ~40x

mgeplf · 2024-07-24T08:45:08Z

This is the screenshots from one of the reports:
Bad:

Good:

Note that the IDs in the selection are different, but the number being pulled is the same, while not apples-to-apples, I think the time difference is too large to account for different IDs being the problem.

mgeplf · 2024-07-24T08:59:21Z

git bisect is point me at:

git bisect start
# status: waiting for both good and bad commits
# bad: [a7d1453fadce88201d009ea3c843f562491238bf] Refactor edge_index::resolve. (#314)
git bisect bad a7d1453fadce88201d009ea3c843f562491238bf
# status: waiting for good commit(s), bad commit known
# good: [e49d49c2b844d6e8834e723ae376eae92f74c5dd] update changelog
git bisect good e49d49c2b844d6e8834e723ae376eae92f74c5dd
# good: [e88979a66bdf17304633e39b724085d3020c6be8] Simplify HighFive usage. (#304)
git bisect good e88979a66bdf17304633e39b724085d3020c6be8
# good: [5932272aeedcb89491242d3c04995bccd4ef5b85] Update HighFive to v2.8.0. (#318)
git bisect good 5932272aeedcb89491242d3c04995bccd4ef5b85
# good: [f6abd8bab6cd19da3b9fed2c73b31e73b1fe857d] update to require using cmake 3.16 (#321)
git bisect good f6abd8bab6cd19da3b9fed2c73b31e73b1fe857d
# bad: [c2e8b5eefc9b70e0245fb422d9ad6a01b70245fa] Refactor `_readSelection`. (#315)
git bisect bad c2e8b5eefc9b70e0245fb422d9ad6a01b70245fa
# good: [f65a0aec9197f6e74c8d75e8f78641326127dfce] Prevent pytest from picking up pybind11 tests. (#320)
git bisect good f65a0aec9197f6e74c8d75e8f78641326127dfce
# first bad commit: [c2e8b5eefc9b70e0245fb422d9ad6a01b70245fa] Refactor `_readSelection`. (#315)

1uc · 2024-07-24T13:52:05Z

The likely issue is the following loop in HighFive:

        for (const auto& sel: selects) {
            if (sel.op == Op::None) {
                H5Sselect_none(space.getId());
            } else {
                auto error_code =
                    H5Sselect_hyperslab(space.getId(),
                                        convert(sel.op),
                                        sel.offset.empty() ? nullptr : sel.offset.data(),
                                        sel.stride.empty() ? nullptr : sel.stride.data(),
                                        sel.count.empty() ? nullptr : sel.count.data(),
                                        sel.block.empty() ? nullptr : sel.block.data());

                if (error_code < 0) {
                    HDF5ErrMapper::ToException<DataSpaceException>("Unable to select hyperslab");
                }
            }
        }

which has runtime that's quadratic in the number of selections, because H5Sselect_hyperslab is linear in the number of selections already made. This can be optimized by a divide and conquer algorithm built on top of H5Scombine_select.

mgeplf · 2024-07-24T13:55:06Z

Nice find @1uc!
Thanks for looking into it.

mgeplf · 2024-07-26T07:49:59Z

@1uc has created a change in HighFive that should fix this:
BlueBrain/HighFive#1031

I think it's worth waiting for that to make it into HighFive, and then released.
Then we can upgrade libsonata, and release it.

mgeplf · 2024-07-29T11:09:48Z

Fixed by #365

mgeplf · 2024-10-14T06:53:35Z

I've had a report that it's still slower:

libsonata 0.1.24
Executed sample_nodes in process 179790 in 2.886 seconds
13.50user 9.57system 0:05.85elapsed 394%CPU (0avgtext+0avgdata 238960maxresident)k
0inputs+0outputs (0major+125862minor)pagefaults 0swaps

libsonata 0.1.28:
Executed sample_nodes in process 180903 in 22.635 seconds
32.54user 10.25system 0:25.54elapsed 167%CPU (0avgtext+0avgdata 493064maxresident)k
0inputs+0outputs (0major+159994minor)pagefaults 0swaps

However, I'm not sure we need to put effort in on this at the moment; if so, I'll reopen the ticket.

1uc · 2024-10-14T07:20:35Z

Remember that libsonata itself didn't fix anything. The fix is in HighFive; and HighFive requires at HDF5 1.10.x to implements the fix. Therefore, if anything is using an older version those two dependencies will not see that anything was fixed.

mgeplf · 2024-10-14T07:29:52Z

Good point; it's the wheel version from PyPi, and so it should be included.
In addition, it's (paraphrased) 7.4 times slower than 0.1.24, while 0.1.27 was much slower than that

1uc · 2024-10-14T07:34:19Z

Okay, so there's more to be fixed.

mgeplf · 2024-10-15T09:22:06Z

I'm only seeing less than a 2x difference in my completely none scientific benchmark:

  latest:
  5.41user 0.46system 0:03.69elapsed 159%CPU (0avgtext+0avgdata 270516maxresident)k
  4.99user 0.43system 0:03.52elapsed 153%CPU (0avgtext+0avgdata 270516maxresident)k
  5.00user 0.44system 0:03.49elapsed 156%CPU (0avgtext+0avgdata 270516maxresident)k
  4.18user 0.43system 0:03.52elapsed 130%CPU (0avgtext+0avgdata 270516maxresident)k

  0.1.24/
  2.46user 0.28system 0:00.92elapsed 298%CPU (0avgtext+0avgdata 104884maxresident)k
  2.59user 0.27system 0:00.85elapsed 335%CPU (0avgtext+0avgdata 104888maxresident)k
  3.15user 0.23system 0:00.74elapsed 457%CPU (0avgtext+0avgdata 104884maxresident)k

This is running on the headnode, with a virtualenv in the home directory, with this code:

import libsonata
import numpy as np

np.random.seed(42)

p = '/gpfs/bbp.cscs.ch/data/scratch/proj134/fullcircuit/full-synth-more-split-v2/nodes.h5'
sto = libsonata.NodeStorage(p)
pop = sto.open_population('root__neurons')

count = int(0.01*pop.size)
print(f'selecting {count} from {pop.size}')
ids = np.random.randint(0, pop.size, count)
sel = libsonata.Selection(ids)

print(np.mean(pop.get_attribute('x', sel)))

1uc · 2024-10-15T09:29:04Z

The regression is not related to anything I/O? Only building the libsonata.Selection? Can pop.size be replaced with pop_size = 71_000_000 to get an equivalent benchmark?

mgeplf · 2024-10-15T09:31:08Z

I'll get back to you, I just wanted to write this down.

mgeplf · 2024-10-15T12:13:32Z

Ok, here are the flame graphs:

1uc · 2024-10-17T07:13:08Z

Using a 71M nodes.h5 we measure how long it takes to do:

pop.get_attribute('x', sel)

The selection is the union of n index ranges, each of length m, each starting at a random offset. To mimic files of different sizes the index ranges where restricted to the first N GIDs. We vary n and N for m = 1, 10, 100, 1000.

We measure several versions of libsonata:

0.1.25-dirty refers to 0.1.24 with an updated version of pybind11.
0.1.29 refers to master.
0.1.29-bR_{k} refers to master with bulk reading, allowing aggregation of at most k pages.

In order to see more easily which measurement belong to which variation of libsonata, we offset them horizontally; and treat block_size as a categorical variable. We measure on an HDD and an SSD.

We created a PR for bR_1 in #374.

GianlucaFicarelli · 2024-10-17T07:57:51Z

Thanks @1uc , amazing improvement. I tried the new wheels on a bb5 allocation for the cells service use case (sampling 1% from 71M nodes) and I'm getting even better results than 0.1.24.
With 4 repetitions I got these times (in seconds):

0.1.24 -> 3.111, 2.560, 2.813, 2.735
PR -> 2.493, 2.252, 1.990, 1.800

1uc · 2024-10-17T08:10:28Z

@GianlucaFicarelli thank you for checking on BB5.

mgeplf closed this as completed Jul 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance regression materializing `Selection` #364

Performance regression materializing `Selection` #364

mgeplf commented Jul 24, 2024

mgeplf commented Jul 24, 2024

mgeplf commented Jul 24, 2024

mgeplf commented Jul 24, 2024

1uc commented Jul 24, 2024 •

edited

Loading

mgeplf commented Jul 24, 2024

mgeplf commented Jul 26, 2024

mgeplf commented Jul 29, 2024

mgeplf commented Oct 14, 2024

1uc commented Oct 14, 2024

mgeplf commented Oct 14, 2024

1uc commented Oct 14, 2024

mgeplf commented Oct 15, 2024

1uc commented Oct 15, 2024

mgeplf commented Oct 15, 2024

mgeplf commented Oct 15, 2024

1uc commented Oct 17, 2024

GianlucaFicarelli commented Oct 17, 2024

1uc commented Oct 17, 2024

Performance regression materializing Selection #364

Performance regression materializing Selection #364

Comments

mgeplf commented Jul 24, 2024

mgeplf commented Jul 24, 2024

mgeplf commented Jul 24, 2024

mgeplf commented Jul 24, 2024

1uc commented Jul 24, 2024 • edited Loading

mgeplf commented Jul 24, 2024

mgeplf commented Jul 26, 2024

mgeplf commented Jul 29, 2024

mgeplf commented Oct 14, 2024

1uc commented Oct 14, 2024

mgeplf commented Oct 14, 2024

1uc commented Oct 14, 2024

mgeplf commented Oct 15, 2024

1uc commented Oct 15, 2024

mgeplf commented Oct 15, 2024

mgeplf commented Oct 15, 2024

1uc commented Oct 17, 2024

GianlucaFicarelli commented Oct 17, 2024

1uc commented Oct 17, 2024

Performance regression materializing `Selection` #364

Performance regression materializing `Selection` #364

1uc commented Jul 24, 2024 •

edited

Loading