Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve MeshletMesh::from_mesh performance #13904

Merged
merged 5 commits into from
Jun 18, 2024
Merged

Improve MeshletMesh::from_mesh performance #13904

merged 5 commits into from
Jun 18, 2024

Conversation

zeux
Copy link
Contributor

@zeux zeux commented Jun 18, 2024

This change reworks find_connected_meshlets to scale more linearly with the mesh size, which significantly reduces the cost of building meshlet representations. As a small extra complexity reduction, it moves simplify_scale call out of the loop so that it's called once (it only depends on the vertex data => is safe to cache).

The new implementation of connectivity analysis builds edge=>meshlet list data structure, which allows us to only iterate through tuple_combinations of a (usually) small list. There is still some redundancy as if two meshlets share two edges, they will be represented in the meshlet lists twice, but it's overall much faster.

Since the hash traversal is non-deterministic, to keep this part of the algorithm deterministic for reproducible results we sort the output adjacency lists.

Overall this reduces the time to process bunny mesh from ~4.2s to ~1.7s when using release; in unoptimized builds the delta is even more significant.

This was tested by using #13431 and:

a) comparing the result of find_connected_meshlets using old and new code; they are equal in all steps of the clustering process
b) comparing the rendered result of the old code vs new code after making the rest of the algorithm deterministic: right now the loop that iterates through the result of group_meshlets() call executes in different order between program runs. This is orthogonal to this change and can be fixed separately.

Note: a future change can shrink the processing time further from ~1.7s to ~0.4s with a small diff but that requires an update to meshopt crate which is pending in gwihlidal/meshopt-rs#42. This change is independent.

This change reworks find_connected_meshlets to scale more linearly with
the mesh size, which significantly reduces the cost of building meshlet
representations. As a small extra complexity reduction, it moves
simplify_scale call out of the loop so that it's called once (it only
depends on the vertex data => is safe to cache).

The new implementation of connectivity analysis builds edge=>meshlet
list data structure, which allows us to only iterate through
tuple_combinations of a (usually smaller) list. There is still some
redundancy as if two meshes are sharing two edges, they will be
represented in the meshlet lists twice, but it's overall much faster.

Since the hash traversal is non-deterministic, to keep this part of the
algorithm deterministic for reproducible results we sort the output
adjacency lists.

Overall this reduces the time to process bunny mesh from ~4.2s to ~1.7s
when using release; in unoptimized builds the delta is even more significant.
Copy link
Contributor

Welcome, new contributor!

Please make sure you've read our contributing guide and we look forward to reviewing your pull request shortly ✨

@IceSentry IceSentry requested a review from JMS55 June 18, 2024 00:43
@IceSentry IceSentry added the A-Rendering Drawing game state to the screen label Jun 18, 2024
@alice-i-cecile alice-i-cecile added C-Performance A change motivated by improving speed, memory usage or compile times S-Needs-Review Needs reviewer attention (from anyone!) to move forward labels Jun 18, 2024
We usually just have a single element in the vector (for interior edges)
or two elements (for shared edges); everything else is uncommon. SmallVec
reduces the number of allocations here and makes processing a little
faster.
Copy link
Contributor

@JMS55 JMS55 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, and of course, meshoptimizer in the first place! Super stoked to see your involvement!

Could you please capitalize the first word of the new comments you wrote to keep it consistent with the rest of the file? Other than that and some small comments I left, PR looks good to me :)


Some optional notes, feel free to ignore them:

  • Technically most hashmaps are not necessary in the first place. We could use a Vec and index into it via something like meshlet_id - simplification_queue[0], since meshlet IDs are simply sequential integers. Do it if you feel like it, but also fine to leave it as-is. Similar for groups.
  • I'm unsure about making the entire preprocessing step deterministic in the first place. It would be great if we could, but iirc METIS has an internal RNG? Don't quote me on this. It sounds like you were able to make the whole thing fully deterministic though, which I would love a PR for if so.
  • Parallelizing processing clusters/groups is also an option to improve performance, via something like rayon and parallel batches. Another thing you can try experimenting with if you feel like it. There's also the option/alternative of parallelizing between MeshletMesh::from_mesh() calls when processing multiple meshes in a scene at once, when I build a higher level scene processor.
  • EDIT: Potentially also worth reusing collection memory instead of allocating for each iteration

crates/bevy_pbr/src/meshlet/from_mesh.rs Outdated Show resolved Hide resolved
crates/bevy_pbr/src/meshlet/from_mesh.rs Show resolved Hide resolved
crates/bevy_pbr/src/meshlet/from_mesh.rs Outdated Show resolved Hide resolved
@JMS55 JMS55 added this to the 0.15 milestone Jun 18, 2024
@zeux
Copy link
Contributor Author

zeux commented Jun 18, 2024

Thanks for review! Re: comments:

Technically most hashmaps are not necessary in the first place.

Yeah - true for everything that's meshlet id indexed; of course the edge-indexed or meshlet pair-indexed containers need to be maps. I tried to leave the code mostly as is for clarity, but it might make sense to rework some of these to Vec-based if only for deterministic traversal. I think that just affects groups? I might do a second pass over this in a separate PR to not clutter this one.

I'm unsure about making the entire preprocessing step deterministic in the first place.

I'd need to check what METIS is doing - I didn't need to change it but maybe that's just happenstance and it's initializing a global RNG to a stable seed (in which case you'd get determinism iff the asset set that's being processed is consistent between runs, which is true for the meshlets example but not generally true). Overall I prefer determinism when it's free or cheaply available, as it simplifies debugging and performance analysis, and in this change because I was changing the connectivity analysis which by itself happened to be deterministic, I tried to preserve it (the extra sorting necessary here is very cheap). I'll take a look at determinism separately, as if it's effectively free it would be nice, but not necessarily required as you suggest -- this PR just maintains status quo.

Parallelizing processing clusters/groups is also an option to improve performance.

Yeah I was focused on improving meshopt's simplification which I got requests for before, and one thing led to another. Once this is decently efficient in a single thread, multiple threads may be worthwhile to look into - I was expecting higher-level from_mesh parallelism as a side effect of asset processing eventually, but it might make sense to thread internally as well.

@zeux
Copy link
Contributor Author

zeux commented Jun 18, 2024

Just to wrap up the METIS comments, I looked into this a little... there's indeed an RNG. The good news is that the partitioning request has an optional seed; it defaults to 4321, and can be overridden using the set_option API. When the partitioning begins, the seed is set from the options - so repeat invocations of the same partition API should produce the same result unless there are other sources of non-determinism.

The bad news is that the seed is used to initialize the CRT RNG (via srand). On Windows, its state is stored in a thread-local variable so everything should be fine. On Linux, its state is stored in a global that is protected with a mutex; as such, running two METIS operations concurrently, or running METIS code concurrently with anything that uses CRT srand/rand APIs, will change the results. On macOS, the state is stored in an unsynchronized global so running two operations concurrently runs into a race condition which probably reduces the quality of the random generation somewhat.

... so this explains why I saw deterministic results in practice once hash issues were resolved, but it does mean the behavior from parallel execution will likely be unusual. I don't know if METIS has other global state besides CRT RNG to worry about, let's hope it does not.

@IceSentry IceSentry added S-Ready-For-Final-Review This PR has been approved by the community. It's ready for a maintainer to consider merging it and removed S-Needs-Review Needs reviewer attention (from anyone!) to move forward labels Jun 18, 2024
@superdump
Copy link
Contributor

Welcome Arseny, thanks for your contribution, for meshoptimizer, and for your Niagara renderer video series. :)

We can see how things perform with the higher-level parallelisation across processing of many assets first. I do like the optimise-single-threaded-solutions-first approach. :)

@superdump superdump added this pull request to the merge queue Jun 18, 2024
Merged via the queue into bevyengine:main with commit 001cc14 Jun 18, 2024
28 checks passed
@zeux zeux deleted the meshlet-opt-edges branch June 18, 2024 15:08
github-merge-queue bot pushed a commit that referenced this pull request Jun 20, 2024
This is a followup to #13904
based on the discussion there, and switches two HashMaps that used
meshlet ids as keys to Vec.

In addition to a small further performance boost for `from_mesh` (1.66s
=> 1.60s), this makes processing deterministic modulo threading issues
wrt CRT rand described in the linked PR. This is valuable for debugging,
as you can visually or programmatically inspect the meshlet distribution
before/after making changes that should not change the output, whereas
previously every asset rebuild would change the meshlet structure.

Tested with #13431; after this
change, the visual output of meshlets is consistent between asset
rebuilds, and the MD5 of the output GLB file does not change either,
which was not the case before.
@JMS55 JMS55 modified the milestones: 0.15, 0.14 Jun 27, 2024
mockersf pushed a commit that referenced this pull request Jun 27, 2024
This change reworks `find_connected_meshlets` to scale more linearly
with the mesh size, which significantly reduces the cost of building
meshlet representations. As a small extra complexity reduction, it moves
`simplify_scale` call out of the loop so that it's called once (it only
depends on the vertex data => is safe to cache).

The new implementation of connectivity analysis builds edge=>meshlet
list data structure, which allows us to only iterate through
`tuple_combinations` of a (usually) small list. There is still some
redundancy as if two meshlets share two edges, they will be represented
in the meshlet lists twice, but it's overall much faster.

Since the hash traversal is non-deterministic, to keep this part of the
algorithm deterministic for reproducible results we sort the output
adjacency lists.

Overall this reduces the time to process bunny mesh from ~4.2s to ~1.7s
when using release; in unoptimized builds the delta is even more
significant.

This was tested by using #13431
and:

a) comparing the result of `find_connected_meshlets` using old and new
code; they are equal in all steps of the clustering process
b) comparing the rendered result of the old code vs new code *after*
making the rest of the algorithm deterministic: right now the loop that
iterates through the result of `group_meshlets()` call executes in
different order between program runs. This is orthogonal to this change
and can be fixed separately.

Note: a future change can shrink the processing time further from ~1.7s
to ~0.4s with a small diff but that requires an update to meshopt crate
which is pending in gwihlidal/meshopt-rs#42.
This change is independent.
mockersf pushed a commit that referenced this pull request Jun 27, 2024
This is a followup to #13904
based on the discussion there, and switches two HashMaps that used
meshlet ids as keys to Vec.

In addition to a small further performance boost for `from_mesh` (1.66s
=> 1.60s), this makes processing deterministic modulo threading issues
wrt CRT rand described in the linked PR. This is valuable for debugging,
as you can visually or programmatically inspect the meshlet distribution
before/after making changes that should not change the output, whereas
previously every asset rebuild would change the meshlet structure.

Tested with #13431; after this
change, the visual output of meshlets is consistent between asset
rebuilds, and the MD5 of the output GLB file does not change either,
which was not the case before.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-Rendering Drawing game state to the screen C-Performance A change motivated by improving speed, memory usage or compile times S-Ready-For-Final-Review This PR has been approved by the community. It's ready for a maintainer to consider merging it
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants