-
-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve MeshletMesh::from_mesh performance #13904
Conversation
This change reworks find_connected_meshlets to scale more linearly with the mesh size, which significantly reduces the cost of building meshlet representations. As a small extra complexity reduction, it moves simplify_scale call out of the loop so that it's called once (it only depends on the vertex data => is safe to cache). The new implementation of connectivity analysis builds edge=>meshlet list data structure, which allows us to only iterate through tuple_combinations of a (usually smaller) list. There is still some redundancy as if two meshes are sharing two edges, they will be represented in the meshlet lists twice, but it's overall much faster. Since the hash traversal is non-deterministic, to keep this part of the algorithm deterministic for reproducible results we sort the output adjacency lists. Overall this reduces the time to process bunny mesh from ~4.2s to ~1.7s when using release; in unoptimized builds the delta is even more significant.
Welcome, new contributor! Please make sure you've read our contributing guide and we look forward to reviewing your pull request shortly ✨ |
We usually just have a single element in the vector (for interior edges) or two elements (for shared edges); everything else is uncommon. SmallVec reduces the number of allocations here and makes processing a little faster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR, and of course, meshoptimizer in the first place! Super stoked to see your involvement!
Could you please capitalize the first word of the new comments you wrote to keep it consistent with the rest of the file? Other than that and some small comments I left, PR looks good to me :)
Some optional notes, feel free to ignore them:
- Technically most hashmaps are not necessary in the first place. We could use a Vec and index into it via something like
meshlet_id - simplification_queue[0]
, since meshlet IDs are simply sequential integers. Do it if you feel like it, but also fine to leave it as-is. Similar for groups. - I'm unsure about making the entire preprocessing step deterministic in the first place. It would be great if we could, but iirc METIS has an internal RNG? Don't quote me on this. It sounds like you were able to make the whole thing fully deterministic though, which I would love a PR for if so.
- Parallelizing processing clusters/groups is also an option to improve performance, via something like rayon and parallel batches. Another thing you can try experimenting with if you feel like it. There's also the option/alternative of parallelizing between MeshletMesh::from_mesh() calls when processing multiple meshes in a scene at once, when I build a higher level scene processor.
- EDIT: Potentially also worth reusing collection memory instead of allocating for each iteration
Thanks for review! Re: comments:
Yeah - true for everything that's meshlet id indexed; of course the edge-indexed or meshlet pair-indexed containers need to be maps. I tried to leave the code mostly as is for clarity, but it might make sense to rework some of these to Vec-based if only for deterministic traversal. I think that just affects groups? I might do a second pass over this in a separate PR to not clutter this one.
I'd need to check what METIS is doing - I didn't need to change it but maybe that's just happenstance and it's initializing a global RNG to a stable seed (in which case you'd get determinism iff the asset set that's being processed is consistent between runs, which is true for the meshlets example but not generally true). Overall I prefer determinism when it's free or cheaply available, as it simplifies debugging and performance analysis, and in this change because I was changing the connectivity analysis which by itself happened to be deterministic, I tried to preserve it (the extra sorting necessary here is very cheap). I'll take a look at determinism separately, as if it's effectively free it would be nice, but not necessarily required as you suggest -- this PR just maintains status quo.
Yeah I was focused on improving meshopt's simplification which I got requests for before, and one thing led to another. Once this is decently efficient in a single thread, multiple threads may be worthwhile to look into - I was expecting higher-level from_mesh parallelism as a side effect of asset processing eventually, but it might make sense to thread internally as well. |
Just to wrap up the METIS comments, I looked into this a little... there's indeed an RNG. The good news is that the partitioning request has an optional seed; it defaults to 4321, and can be overridden using the The bad news is that the seed is used to initialize the CRT RNG (via ... so this explains why I saw deterministic results in practice once hash issues were resolved, but it does mean the behavior from parallel execution will likely be unusual. I don't know if METIS has other global state besides CRT RNG to worry about, let's hope it does not. |
Welcome Arseny, thanks for your contribution, for meshoptimizer, and for your Niagara renderer video series. :) We can see how things perform with the higher-level parallelisation across processing of many assets first. I do like the optimise-single-threaded-solutions-first approach. :) |
This is a followup to #13904 based on the discussion there, and switches two HashMaps that used meshlet ids as keys to Vec. In addition to a small further performance boost for `from_mesh` (1.66s => 1.60s), this makes processing deterministic modulo threading issues wrt CRT rand described in the linked PR. This is valuable for debugging, as you can visually or programmatically inspect the meshlet distribution before/after making changes that should not change the output, whereas previously every asset rebuild would change the meshlet structure. Tested with #13431; after this change, the visual output of meshlets is consistent between asset rebuilds, and the MD5 of the output GLB file does not change either, which was not the case before.
This change reworks `find_connected_meshlets` to scale more linearly with the mesh size, which significantly reduces the cost of building meshlet representations. As a small extra complexity reduction, it moves `simplify_scale` call out of the loop so that it's called once (it only depends on the vertex data => is safe to cache). The new implementation of connectivity analysis builds edge=>meshlet list data structure, which allows us to only iterate through `tuple_combinations` of a (usually) small list. There is still some redundancy as if two meshlets share two edges, they will be represented in the meshlet lists twice, but it's overall much faster. Since the hash traversal is non-deterministic, to keep this part of the algorithm deterministic for reproducible results we sort the output adjacency lists. Overall this reduces the time to process bunny mesh from ~4.2s to ~1.7s when using release; in unoptimized builds the delta is even more significant. This was tested by using #13431 and: a) comparing the result of `find_connected_meshlets` using old and new code; they are equal in all steps of the clustering process b) comparing the rendered result of the old code vs new code *after* making the rest of the algorithm deterministic: right now the loop that iterates through the result of `group_meshlets()` call executes in different order between program runs. This is orthogonal to this change and can be fixed separately. Note: a future change can shrink the processing time further from ~1.7s to ~0.4s with a small diff but that requires an update to meshopt crate which is pending in gwihlidal/meshopt-rs#42. This change is independent.
This is a followup to #13904 based on the discussion there, and switches two HashMaps that used meshlet ids as keys to Vec. In addition to a small further performance boost for `from_mesh` (1.66s => 1.60s), this makes processing deterministic modulo threading issues wrt CRT rand described in the linked PR. This is valuable for debugging, as you can visually or programmatically inspect the meshlet distribution before/after making changes that should not change the output, whereas previously every asset rebuild would change the meshlet structure. Tested with #13431; after this change, the visual output of meshlets is consistent between asset rebuilds, and the MD5 of the output GLB file does not change either, which was not the case before.
This change reworks
find_connected_meshlets
to scale more linearly with the mesh size, which significantly reduces the cost of building meshlet representations. As a small extra complexity reduction, it movessimplify_scale
call out of the loop so that it's called once (it only depends on the vertex data => is safe to cache).The new implementation of connectivity analysis builds edge=>meshlet list data structure, which allows us to only iterate through
tuple_combinations
of a (usually) small list. There is still some redundancy as if two meshlets share two edges, they will be represented in the meshlet lists twice, but it's overall much faster.Since the hash traversal is non-deterministic, to keep this part of the algorithm deterministic for reproducible results we sort the output adjacency lists.
Overall this reduces the time to process bunny mesh from ~4.2s to ~1.7s when using release; in unoptimized builds the delta is even more significant.
This was tested by using #13431 and:
a) comparing the result of
find_connected_meshlets
using old and new code; they are equal in all steps of the clustering processb) comparing the rendered result of the old code vs new code after making the rest of the algorithm deterministic: right now the loop that iterates through the result of
group_meshlets()
call executes in different order between program runs. This is orthogonal to this change and can be fixed separately.Note: a future change can shrink the processing time further from ~1.7s to ~0.4s with a small diff but that requires an update to meshopt crate which is pending in gwihlidal/meshopt-rs#42. This change is independent.