Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance regression materializing Selection #364

Closed
mgeplf opened this issue Jul 24, 2024 · 18 comments
Closed

Performance regression materializing Selection #364

mgeplf opened this issue Jul 24, 2024 · 18 comments

Comments

@mgeplf
Copy link
Contributor

mgeplf commented Jul 24, 2024

Two people independently contacted me about a large performance regression while materializing Selections.
They had to roll back to 0.1.24 to be able to complete selections.
It seems that operations that completed nearly instantly (if the data was cached) are now taking much longer.
I will try and make a reproducer.

@mgeplf
Copy link
Contributor Author

mgeplf commented Jul 24, 2024

Seems like in both cases the selections are distributed over the complete set, randomly:
Using the following script:

import libsonata
import numpy as np

np.random.seed(42)

sto = libsonata.NodeStorage('/home/gevaert/src/libsonata/sscx-nodes-sonata.h5')
pop = sto.open_population('All')

#ids = np.arange(0, 100000, 2)
count = int(0.01*pop.size)
print(f'selecting {count} from {pop.size}')
ids = np.random.randint(0, pop.size, count)
sel = libsonata.Selection(ids)

print(np.mean(pop.get_attribute('x', sel)))

I get the following timings (note, FS caches should be hot):

v0.1.24:


# note: this is 0.001 of population size; seems fast
$ time python3 ../bench-selection.py
selecting 4234 from 4234929
3985.7162878066642
python3 ../bench-selection.py  0.28s user 0.96s system 1531% cpu 0.081 total

# note: this is 0.01 of population size; seems fast
$ time python3 ../bench-selection.py
selecting 42349 from 4234929
3973.392037563209
python3 ../bench-selection.py  0.32s user 1.05s system 1572% cpu 0.087 total

# note: this is 0.01 of population size; seems fast
$ time python3 ../bench-selection.py
selecting 42349 from 4234929
3973.392037563209
python3 ../bench-selection.py  0.30s user 1.09s system 1585% cpu 0.087 total

v0.1.25:

# note: this is 0.001 of population size; seems fast
selecting 4234 from 4234929
3985.7162878066642
python3 ../bench-selection.py  0.53s user 1.15s system 557% cpu 0.301 total

# note: this is 0.01 of population size; seems slow
$ time python3 ../bench-selection.py
selecting 42349 from 4234929
3973.392037563209
python3 ../bench-selection.py  44.39s user 1.17s system 103% cpu 44.177 total

# note: this is 0.01 of population size; seems slow
$ time python3 ../bench-selection.py
selecting 42349 from 4234929
3973.392037563209
python3 ../bench-selection.py  43.23s user 1.14s system 103% cpu 42.991 total

So it looks like the user time has gone up by ~40x

@mgeplf
Copy link
Contributor Author

mgeplf commented Jul 24, 2024

This is the screenshots from one of the reports:
Bad:
bad

Good:
good

Note that the IDs in the selection are different, but the number being pulled is the same, while not apples-to-apples, I think the time difference is too large to account for different IDs being the problem.

@mgeplf
Copy link
Contributor Author

mgeplf commented Jul 24, 2024

git bisect is point me at:

git bisect start
# status: waiting for both good and bad commits
# bad: [a7d1453fadce88201d009ea3c843f562491238bf] Refactor edge_index::resolve. (#314)
git bisect bad a7d1453fadce88201d009ea3c843f562491238bf
# status: waiting for good commit(s), bad commit known
# good: [e49d49c2b844d6e8834e723ae376eae92f74c5dd] update changelog
git bisect good e49d49c2b844d6e8834e723ae376eae92f74c5dd
# good: [e88979a66bdf17304633e39b724085d3020c6be8] Simplify HighFive usage. (#304)
git bisect good e88979a66bdf17304633e39b724085d3020c6be8
# good: [5932272aeedcb89491242d3c04995bccd4ef5b85] Update HighFive to v2.8.0. (#318)
git bisect good 5932272aeedcb89491242d3c04995bccd4ef5b85
# good: [f6abd8bab6cd19da3b9fed2c73b31e73b1fe857d] update to require using cmake 3.16 (#321)
git bisect good f6abd8bab6cd19da3b9fed2c73b31e73b1fe857d
# bad: [c2e8b5eefc9b70e0245fb422d9ad6a01b70245fa] Refactor `_readSelection`. (#315)
git bisect bad c2e8b5eefc9b70e0245fb422d9ad6a01b70245fa
# good: [f65a0aec9197f6e74c8d75e8f78641326127dfce] Prevent pytest from picking up pybind11 tests. (#320)
git bisect good f65a0aec9197f6e74c8d75e8f78641326127dfce
# first bad commit: [c2e8b5eefc9b70e0245fb422d9ad6a01b70245fa] Refactor `_readSelection`. (#315)

@1uc
Copy link
Collaborator

1uc commented Jul 24, 2024

The likely issue is the following loop in HighFive:

        for (const auto& sel: selects) {
            if (sel.op == Op::None) {
                H5Sselect_none(space.getId());
            } else {
                auto error_code =
                    H5Sselect_hyperslab(space.getId(),
                                        convert(sel.op),
                                        sel.offset.empty() ? nullptr : sel.offset.data(),
                                        sel.stride.empty() ? nullptr : sel.stride.data(),
                                        sel.count.empty() ? nullptr : sel.count.data(),
                                        sel.block.empty() ? nullptr : sel.block.data());

                if (error_code < 0) {
                    HDF5ErrMapper::ToException<DataSpaceException>("Unable to select hyperslab");
                }
            }
        }

which has runtime that's quadratic in the number of selections, because H5Sselect_hyperslab is linear in the number of selections already made. This can be optimized by a divide and conquer algorithm built on top of H5Scombine_select.

@mgeplf
Copy link
Contributor Author

mgeplf commented Jul 24, 2024

Nice find @1uc!
Thanks for looking into it.

@mgeplf
Copy link
Contributor Author

mgeplf commented Jul 26, 2024

@1uc has created a change in HighFive that should fix this:
BlueBrain/HighFive#1031

I think it's worth waiting for that to make it into HighFive, and then released.
Then we can upgrade libsonata, and release it.

@mgeplf
Copy link
Contributor Author

mgeplf commented Jul 29, 2024

Fixed by #365

@mgeplf mgeplf closed this as completed Jul 29, 2024
@mgeplf
Copy link
Contributor Author

mgeplf commented Oct 14, 2024

I've had a report that it's still slower:

libsonata 0.1.24
Executed sample_nodes in process 179790 in 2.886 seconds
13.50user 9.57system 0:05.85elapsed 394%CPU (0avgtext+0avgdata 238960maxresident)k
0inputs+0outputs (0major+125862minor)pagefaults 0swaps

libsonata 0.1.28:
Executed sample_nodes in process 180903 in 22.635 seconds
32.54user 10.25system 0:25.54elapsed 167%CPU (0avgtext+0avgdata 493064maxresident)k
0inputs+0outputs (0major+159994minor)pagefaults 0swaps

However, I'm not sure we need to put effort in on this at the moment; if so, I'll reopen the ticket.

@1uc
Copy link
Collaborator

1uc commented Oct 14, 2024

Remember that libsonata itself didn't fix anything. The fix is in HighFive; and HighFive requires at HDF5 1.10.x to implements the fix. Therefore, if anything is using an older version those two dependencies will not see that anything was fixed.

@mgeplf
Copy link
Contributor Author

mgeplf commented Oct 14, 2024

Good point; it's the wheel version from PyPi, and so it should be included.
In addition, it's (paraphrased) 7.4 times slower than 0.1.24, while 0.1.27 was much slower than that

@1uc
Copy link
Collaborator

1uc commented Oct 14, 2024

Okay, so there's more to be fixed.

@mgeplf
Copy link
Contributor Author

mgeplf commented Oct 15, 2024

I'm only seeing less than a 2x difference in my completely none scientific benchmark:

  latest:
  5.41user 0.46system 0:03.69elapsed 159%CPU (0avgtext+0avgdata 270516maxresident)k
  4.99user 0.43system 0:03.52elapsed 153%CPU (0avgtext+0avgdata 270516maxresident)k
  5.00user 0.44system 0:03.49elapsed 156%CPU (0avgtext+0avgdata 270516maxresident)k
  4.18user 0.43system 0:03.52elapsed 130%CPU (0avgtext+0avgdata 270516maxresident)k

  0.1.24/
  2.46user 0.28system 0:00.92elapsed 298%CPU (0avgtext+0avgdata 104884maxresident)k
  2.59user 0.27system 0:00.85elapsed 335%CPU (0avgtext+0avgdata 104888maxresident)k
  3.15user 0.23system 0:00.74elapsed 457%CPU (0avgtext+0avgdata 104884maxresident)k

This is running on the headnode, with a virtualenv in the home directory, with this code:

import libsonata
import numpy as np

np.random.seed(42)

p = '/gpfs/bbp.cscs.ch/data/scratch/proj134/fullcircuit/full-synth-more-split-v2/nodes.h5'
sto = libsonata.NodeStorage(p)
pop = sto.open_population('root__neurons')

count = int(0.01*pop.size)
print(f'selecting {count} from {pop.size}')
ids = np.random.randint(0, pop.size, count)
sel = libsonata.Selection(ids)

print(np.mean(pop.get_attribute('x', sel)))

@1uc
Copy link
Collaborator

1uc commented Oct 15, 2024

The regression is not related to anything I/O? Only building the libsonata.Selection? Can pop.size be replaced with pop_size = 71_000_000 to get an equivalent benchmark?

@mgeplf
Copy link
Contributor Author

mgeplf commented Oct 15, 2024

I'll get back to you, I just wanted to write this down.

@mgeplf
Copy link
Contributor Author

mgeplf commented Oct 15, 2024

Ok, here are the flame graphs:
0 1 24
latest

@1uc
Copy link
Collaborator

1uc commented Oct 17, 2024

Using a 71M nodes.h5 we measure how long it takes to do:

pop.get_attribute('x', sel)

The selection is the union of n index ranges, each of length m, each starting at a random offset. To mimic files of different sizes the index ranges where restricted to the first N GIDs. We vary n and N for m = 1, 10, 100, 1000.

We measure several versions of libsonata:

  • 0.1.25-dirty refers to 0.1.24 with an updated version of pybind11.
  • 0.1.29 refers to master.
  • 0.1.29-bR_{k} refers to master with bulk reading, allowing aggregation of at most k pages.

In order to see more easily which measurement belong to which variation of libsonata, we offset them horizontally; and treat block_size as a categorical variable. We measure on an HDD and an SSD.

We created a PR for bR_1 in #374.

runtime_hdd
runtime_ssd
speedup_hdd
speedup_ssd

@GianlucaFicarelli
Copy link
Contributor

Thanks @1uc , amazing improvement. I tried the new wheels on a bb5 allocation for the cells service use case (sampling 1% from 71M nodes) and I'm getting even better results than 0.1.24.
With 4 repetitions I got these times (in seconds):

0.1.24 -> 3.111, 2.560, 2.813, 2.735
PR -> 2.493, 2.252, 1.990, 1.800

@1uc
Copy link
Collaborator

1uc commented Oct 17, 2024

@GianlucaFicarelli thank you for checking on BB5.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants