Improve MeshletMesh::from_mesh performance #13904

zeux · 2024-06-18T00:35:08Z

This change reworks find_connected_meshlets to scale more linearly with the mesh size, which significantly reduces the cost of building meshlet representations. As a small extra complexity reduction, it moves simplify_scale call out of the loop so that it's called once (it only depends on the vertex data => is safe to cache).

The new implementation of connectivity analysis builds edge=>meshlet list data structure, which allows us to only iterate through tuple_combinations of a (usually) small list. There is still some redundancy as if two meshlets share two edges, they will be represented in the meshlet lists twice, but it's overall much faster.

Since the hash traversal is non-deterministic, to keep this part of the algorithm deterministic for reproducible results we sort the output adjacency lists.

Overall this reduces the time to process bunny mesh from ~4.2s to ~1.7s when using release; in unoptimized builds the delta is even more significant.

This was tested by using #13431 and:

a) comparing the result of find_connected_meshlets using old and new code; they are equal in all steps of the clustering process
b) comparing the rendered result of the old code vs new code after making the rest of the algorithm deterministic: right now the loop that iterates through the result of group_meshlets() call executes in different order between program runs. This is orthogonal to this change and can be fixed separately.

Note: a future change can shrink the processing time further from ~1.7s to ~0.4s with a small diff but that requires an update to meshopt crate which is pending in gwihlidal/meshopt-rs#42. This change is independent.

This change reworks find_connected_meshlets to scale more linearly with the mesh size, which significantly reduces the cost of building meshlet representations. As a small extra complexity reduction, it moves simplify_scale call out of the loop so that it's called once (it only depends on the vertex data => is safe to cache). The new implementation of connectivity analysis builds edge=>meshlet list data structure, which allows us to only iterate through tuple_combinations of a (usually smaller) list. There is still some redundancy as if two meshes are sharing two edges, they will be represented in the meshlet lists twice, but it's overall much faster. Since the hash traversal is non-deterministic, to keep this part of the algorithm deterministic for reproducible results we sort the output adjacency lists. Overall this reduces the time to process bunny mesh from ~4.2s to ~1.7s when using release; in unoptimized builds the delta is even more significant.

github-actions · 2024-06-18T00:35:23Z

Welcome, new contributor!

Please make sure you've read our contributing guide and we look forward to reviewing your pull request shortly ✨

We usually just have a single element in the vector (for interior edges) or two elements (for shared edges); everything else is uncommon. SmallVec reduces the number of allocations here and makes processing a little faster.

JMS55

Thanks for the PR, and of course, meshoptimizer in the first place! Super stoked to see your involvement!

Could you please capitalize the first word of the new comments you wrote to keep it consistent with the rest of the file? Other than that and some small comments I left, PR looks good to me :)

Some optional notes, feel free to ignore them:

Technically most hashmaps are not necessary in the first place. We could use a Vec and index into it via something like meshlet_id - simplification_queue[0], since meshlet IDs are simply sequential integers. Do it if you feel like it, but also fine to leave it as-is. Similar for groups.
I'm unsure about making the entire preprocessing step deterministic in the first place. It would be great if we could, but iirc METIS has an internal RNG? Don't quote me on this. It sounds like you were able to make the whole thing fully deterministic though, which I would love a PR for if so.
Parallelizing processing clusters/groups is also an option to improve performance, via something like rayon and parallel batches. Another thing you can try experimenting with if you feel like it. There's also the option/alternative of parallelizing between MeshletMesh::from_mesh() calls when processing multiple meshes in a scene at once, when I build a higher level scene processor.
EDIT: Potentially also worth reusing collection memory instead of allocating for each iteration

crates/bevy_pbr/src/meshlet/from_mesh.rs

zeux · 2024-06-18T05:14:56Z

Thanks for review! Re: comments:

Technically most hashmaps are not necessary in the first place.

Yeah - true for everything that's meshlet id indexed; of course the edge-indexed or meshlet pair-indexed containers need to be maps. I tried to leave the code mostly as is for clarity, but it might make sense to rework some of these to Vec-based if only for deterministic traversal. I think that just affects groups? I might do a second pass over this in a separate PR to not clutter this one.

I'm unsure about making the entire preprocessing step deterministic in the first place.

I'd need to check what METIS is doing - I didn't need to change it but maybe that's just happenstance and it's initializing a global RNG to a stable seed (in which case you'd get determinism iff the asset set that's being processed is consistent between runs, which is true for the meshlets example but not generally true). Overall I prefer determinism when it's free or cheaply available, as it simplifies debugging and performance analysis, and in this change because I was changing the connectivity analysis which by itself happened to be deterministic, I tried to preserve it (the extra sorting necessary here is very cheap). I'll take a look at determinism separately, as if it's effectively free it would be nice, but not necessarily required as you suggest -- this PR just maintains status quo.

Parallelizing processing clusters/groups is also an option to improve performance.

Yeah I was focused on improving meshopt's simplification which I got requests for before, and one thing led to another. Once this is decently efficient in a single thread, multiple threads may be worthwhile to look into - I was expecting higher-level from_mesh parallelism as a side effect of asset processing eventually, but it might make sense to thread internally as well.

zeux · 2024-06-18T06:18:02Z

Just to wrap up the METIS comments, I looked into this a little... there's indeed an RNG. The good news is that the partitioning request has an optional seed; it defaults to 4321, and can be overridden using the set_option API. When the partitioning begins, the seed is set from the options - so repeat invocations of the same partition API should produce the same result unless there are other sources of non-determinism.

The bad news is that the seed is used to initialize the CRT RNG (via srand). On Windows, its state is stored in a thread-local variable so everything should be fine. On Linux, its state is stored in a global that is protected with a mutex; as such, running two METIS operations concurrently, or running METIS code concurrently with anything that uses CRT srand/rand APIs, will change the results. On macOS, the state is stored in an unsynchronized global so running two operations concurrently runs into a race condition which probably reduces the quality of the random generation somewhat.

... so this explains why I saw deterministic results in practice once hash issues were resolved, but it does mean the behavior from parallel execution will likely be unusual. I don't know if METIS has other global state besides CRT RNG to worry about, let's hope it does not.

superdump · 2024-06-18T08:29:02Z

Welcome Arseny, thanks for your contribution, for meshoptimizer, and for your Niagara renderer video series. :)

We can see how things perform with the higher-level parallelisation across processing of many assets first. I do like the optimise-single-threaded-solutions-first approach. :)

This is a followup to #13904 based on the discussion there, and switches two HashMaps that used meshlet ids as keys to Vec. In addition to a small further performance boost for `from_mesh` (1.66s => 1.60s), this makes processing deterministic modulo threading issues wrt CRT rand described in the linked PR. This is valuable for debugging, as you can visually or programmatically inspect the meshlet distribution before/after making changes that should not change the output, whereas previously every asset rebuild would change the meshlet structure. Tested with #13431; after this change, the visual output of meshlets is consistent between asset rebuilds, and the MD5 of the output GLB file does not change either, which was not the case before.

This change reworks `find_connected_meshlets` to scale more linearly with the mesh size, which significantly reduces the cost of building meshlet representations. As a small extra complexity reduction, it moves `simplify_scale` call out of the loop so that it's called once (it only depends on the vertex data => is safe to cache). The new implementation of connectivity analysis builds edge=>meshlet list data structure, which allows us to only iterate through `tuple_combinations` of a (usually) small list. There is still some redundancy as if two meshlets share two edges, they will be represented in the meshlet lists twice, but it's overall much faster. Since the hash traversal is non-deterministic, to keep this part of the algorithm deterministic for reproducible results we sort the output adjacency lists. Overall this reduces the time to process bunny mesh from ~4.2s to ~1.7s when using release; in unoptimized builds the delta is even more significant. This was tested by using #13431 and: a) comparing the result of `find_connected_meshlets` using old and new code; they are equal in all steps of the clustering process b) comparing the rendered result of the old code vs new code *after* making the rest of the algorithm deterministic: right now the loop that iterates through the result of `group_meshlets()` call executes in different order between program runs. This is orthogonal to this change and can be fixed separately. Note: a future change can shrink the processing time further from ~1.7s to ~0.4s with a small diff but that requires an update to meshopt crate which is pending in gwihlidal/meshopt-rs#42. This change is independent.

This is a followup to #13904 based on the discussion there, and switches two HashMaps that used meshlet ids as keys to Vec. In addition to a small further performance boost for `from_mesh` (1.66s => 1.60s), this makes processing deterministic modulo threading issues wrt CRT rand described in the linked PR. This is valuable for debugging, as you can visually or programmatically inspect the meshlet distribution before/after making changes that should not change the output, whereas previously every asset rebuild would change the meshlet structure. Tested with #13431; after this change, the visual output of meshlets is consistent between asset rebuilds, and the MD5 of the output GLB file does not change either, which was not the case before.

IceSentry requested a review from JMS55 June 18, 2024 00:43

IceSentry added the A-Rendering Drawing game state to the screen label Jun 18, 2024

alice-i-cecile added C-Performance A change motivated by improving speed, memory usage or compile times S-Needs-Review Needs reviewer attention (from anyone!) to move forward labels Jun 18, 2024

Use SmallVec/2 instead of Vec for edges_to_meshlets

d23b050

We usually just have a single element in the vector (for interior edges) or two elements (for shared edges); everything else is uncommon. SmallVec reduces the number of allocations here and makes processing a little faster.

JMS55 suggested changes Jun 18, 2024

View reviewed changes

crates/bevy_pbr/src/meshlet/from_mesh.rs Outdated Show resolved Hide resolved

crates/bevy_pbr/src/meshlet/from_mesh.rs Show resolved Hide resolved

crates/bevy_pbr/src/meshlet/from_mesh.rs Outdated Show resolved Hide resolved

zeux added 3 commits June 17, 2024 22:01

Capitalize comments and expand vec.last() comment

01591a9

Use unstable sort: the meshlet id is unique so we don't need stability.

7b3348a

Rename shared_counts to shared_edge_count

25f2e1b

JMS55 approved these changes Jun 18, 2024

View reviewed changes

JMS55 added this to the 0.15 milestone Jun 18, 2024

IceSentry approved these changes Jun 18, 2024

View reviewed changes

IceSentry added S-Ready-For-Final-Review This PR has been approved by the community. It's ready for a maintainer to consider merging it and removed S-Needs-Review Needs reviewer attention (from anyone!) to move forward labels Jun 18, 2024

superdump approved these changes Jun 18, 2024

View reviewed changes

superdump added this pull request to the merge queue Jun 18, 2024

Merged via the queue into bevyengine:main with commit 001cc14 Jun 18, 2024
28 checks passed

zeux deleted the meshlet-opt-edges branch June 18, 2024 15:08

zeux mentioned this pull request Jun 18, 2024

Make meshlet processing deterministic #13913

Merged

JMS55 modified the milestones: 0.15, 0.14 Jun 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve MeshletMesh::from_mesh performance #13904

Improve MeshletMesh::from_mesh performance #13904

zeux commented Jun 18, 2024 •

edited

Loading

github-actions bot commented Jun 18, 2024

JMS55 left a comment •

edited

Loading

zeux commented Jun 18, 2024 •

edited

Loading

zeux commented Jun 18, 2024 •

edited

Loading

superdump commented Jun 18, 2024

Improve MeshletMesh::from_mesh performance #13904

Improve MeshletMesh::from_mesh performance #13904

Conversation

zeux commented Jun 18, 2024 • edited Loading

github-actions bot commented Jun 18, 2024

JMS55 left a comment • edited Loading

Choose a reason for hiding this comment

zeux commented Jun 18, 2024 • edited Loading

zeux commented Jun 18, 2024 • edited Loading

superdump commented Jun 18, 2024

zeux commented Jun 18, 2024 •

edited

Loading

JMS55 left a comment •

edited

Loading

zeux commented Jun 18, 2024 •

edited

Loading

zeux commented Jun 18, 2024 •

edited

Loading