Investigate why NUMA interleaving isn't reliable on Threadripper. #15

Mysticial · 2018-10-14T23:47:55Z

For reference: https://www.overclockers.com/forums/showthread.php/792068-Marathon-Season-VII-October-y-cruncher-Pi-1b?p=8090577&viewfull=1#post8090577

And my response: https://www.overclockers.com/forums/showthread.php/792068-Marathon-Season-VII-October-y-cruncher-Pi-1b?p=8090579&viewfull=1#post8090579

Something is weird with the 1st and last ones.

It's very subtle, but if you look at the lines: "Working Memory... 5.06 GiB (locked, spread: 50%/2)"

...

It's very well distributed in the 2nd and 3rd runs, but poorly distributed in the 1st and 4th runs.

I can't explain why the 1st and 4th runs are so poor. The program tries its best to evenly spread out the memory, but this isn't always possible if one or more of the nodes is out of memory.

The curious thing here is that it's either 100% or 50%. That corresponds to perfect distribution and 3-to-1 distribution across the nodes. (3x more memory on one node than the other)

This seems too "round" to be a coincidence. Running out of memory one one node wouldn't explain this.

I've never observed this on my dated quad-opteron. And unfortunately, I do not have access to a Threadripper system. So this might take a while to track down.

Mysticial · 2018-11-19T02:45:10Z

Had a discussion with Oliver Kruse. And while he wasn't able to reproduce it with a 1950X, he did bring up a point which seems to be the likely cause of this on the 2990WX. So huge thanks to him!

The screenshots on the forum post show that Windows (and thus y-cruncher) reads the hardware as 4 NUMA nodes despite there being only 2 memory domains.

Windows uses the CPU topology to define nodes. And since the 2990WX has 4 dies, it has 4 nodes. 2 of them have memory, the other 2 don't.

y-cruncher reads the hardware as having 4 NUMA nodes and attempts to allocate memory evenly across the 4 nodes. These allocations are done using VirtualAllocExNuma().

However, 2 of the nodes have no memory. Therefore VirtualAllocExNuma() cannot satisfy the nndPreferred parameter to bind to the nodes with no memory. So instead, it silently (and seemingly randomly) binds the memory to one of the two nodes that do have memory.

If the empty nodes gets bound to different nodes, the distribution will be perfect (100%/2). If they both get bound to the same node, the memory distribution will be 3-to-1 - thus giving the 50%/2.

The solution is to exclude NUMA nodes that don't have memory. This should take care of Threadripper and other similar cases. But it won't solve the more general case of heterogeneous systems.

This is easy to do on Windows. But Linux will take some more investigation.

A temporary work-around is to manually select the NUMA nodes in memory allocator. You will need to experiment to see which 2 nodes of the 4 are the ones with memory.

Mysticial · 2018-11-21T01:18:33Z

I rolled out v0.7.6.9488 yesterday which will disregard NUMA nodes that have no memory.

This has been tested on Windows using an artificial environment. If the cause of the bug is as described above, then this should be fixed for Windows.

On Linux, the fix remains completely untested so I'm less confident it works there.

Mysticial added bug performance labels Oct 14, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate why NUMA interleaving isn't reliable on Threadripper. #15

Investigate why NUMA interleaving isn't reliable on Threadripper. #15

Mysticial commented Oct 14, 2018

Mysticial commented Nov 19, 2018 •

edited

Loading

Mysticial commented Nov 21, 2018

Investigate why NUMA interleaving isn't reliable on Threadripper. #15

Investigate why NUMA interleaving isn't reliable on Threadripper. #15

Comments

Mysticial commented Oct 14, 2018

Mysticial commented Nov 19, 2018 • edited Loading

Mysticial commented Nov 21, 2018

Mysticial commented Nov 19, 2018 •

edited

Loading