Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate why NUMA interleaving isn't reliable on Threadripper. #15

Open
Mysticial opened this issue Oct 14, 2018 · 2 comments
Open

Comments

@Mysticial
Copy link
Owner

For reference: https://www.overclockers.com/forums/showthread.php/792068-Marathon-Season-VII-October-y-cruncher-Pi-1b?p=8090577&viewfull=1#post8090577

And my response: https://www.overclockers.com/forums/showthread.php/792068-Marathon-Season-VII-October-y-cruncher-Pi-1b?p=8090579&viewfull=1#post8090579

Something is weird with the 1st and last ones.

It's very subtle, but if you look at the lines: "Working Memory... 5.06 GiB (locked, spread: 50%/2)"

...

It's very well distributed in the 2nd and 3rd runs, but poorly distributed in the 1st and 4th runs.

I can't explain why the 1st and 4th runs are so poor. The program tries its best to evenly spread out the memory, but this isn't always possible if one or more of the nodes is out of memory.

The curious thing here is that it's either 100% or 50%. That corresponds to perfect distribution and 3-to-1 distribution across the nodes. (3x more memory on one node than the other)

This seems too "round" to be a coincidence. Running out of memory one one node wouldn't explain this.

I've never observed this on my dated quad-opteron. And unfortunately, I do not have access to a Threadripper system. So this might take a while to track down.

@Mysticial
Copy link
Owner Author

Mysticial commented Nov 19, 2018

Had a discussion with Oliver Kruse. And while he wasn't able to reproduce it with a 1950X, he did bring up a point which seems to be the likely cause of this on the 2990WX. So huge thanks to him!

The screenshots on the forum post show that Windows (and thus y-cruncher) reads the hardware as 4 NUMA nodes despite there being only 2 memory domains.

Windows uses the CPU topology to define nodes. And since the 2990WX has 4 dies, it has 4 nodes. 2 of them have memory, the other 2 don't.

y-cruncher reads the hardware as having 4 NUMA nodes and attempts to allocate memory evenly across the 4 nodes. These allocations are done using VirtualAllocExNuma().

However, 2 of the nodes have no memory. Therefore VirtualAllocExNuma() cannot satisfy the nndPreferred parameter to bind to the nodes with no memory. So instead, it silently (and seemingly randomly) binds the memory to one of the two nodes that do have memory.

If the empty nodes gets bound to different nodes, the distribution will be perfect (100%/2). If they both get bound to the same node, the memory distribution will be 3-to-1 - thus giving the 50%/2.


The solution is to exclude NUMA nodes that don't have memory. This should take care of Threadripper and other similar cases. But it won't solve the more general case of heterogeneous systems.

This is easy to do on Windows. But Linux will take some more investigation.


A temporary work-around is to manually select the NUMA nodes in memory allocator. You will need to experiment to see which 2 nodes of the 4 are the ones with memory.

@Mysticial
Copy link
Owner Author

I rolled out v0.7.6.9488 yesterday which will disregard NUMA nodes that have no memory.

This has been tested on Windows using an artificial environment. If the cause of the bug is as described above, then this should be fixed for Windows.

On Linux, the fix remains completely untested so I'm less confident it works there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant