Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Low performance for the amd epic 7351p #91

Closed
zeno39 opened this issue Jun 28, 2019 · 20 comments
Closed

Low performance for the amd epic 7351p #91

zeno39 opened this issue Jun 28, 2019 · 20 comments
Labels
invalid This doesn't seem right

Comments

@zeno39
Copy link

zeno39 commented Jun 28, 2019

Hi tevador, I have a question. I have a AMD epic 7351p 16core 32 thread 8 channels.
If i launch the script randomX i have juste 4000h/s and max 5000
I have 256 go in ddr4 i think it's not a problem, do you have any idea ? Probably my configuration is not good

@tevador
Copy link
Owner

tevador commented Jun 28, 2019

Please post the full command line you use for testing.

@zeno39
Copy link
Author

zeno39 commented Jun 28, 2019

randomx-benchmark.exe --mine --init 16 --threads 32 --nonces 100000 --largePages --jit

@SChernykh
Copy link
Collaborator

EPYC CPUs are 4 NUMA nodes per socket IIRC. You need to run 4 benchmark instances, each with 8 threads and assigned to corresponding NUMA node.

@zeno39
Copy link
Author

zeno39 commented Jun 28, 2019

Assigned where ?

@SChernykh
Copy link
Collaborator

numactl on Linux, I don't know how it's done on Windows.

@zeno39
Copy link
Author

zeno39 commented Jun 28, 2019

4core ans 8 thread per instance ?

@tevador
Copy link
Owner

tevador commented Jun 28, 2019

Your CPU consists of 4 NUMA nodes, each node being 4 cores.

Unfortunately, the bechmark doesn't support running in NUMA mode at the moment (see issue #22), but you can estimate the performance by running only 1 node and multiplying by 4:

randomx-benchmark.exe --mine --init 4 --threads 4 --affinity 170 --nonces 10000 --largePages --jit

This doesn't need to give optimal performance either, it depends on how Windows will allocate the memory.

@zeno39
Copy link
Author

zeno39 commented Jun 28, 2019

I have juste 681 h/s

@SChernykh
Copy link
Collaborator

Try --affinity 255 and everything else the same.

@tevador
Copy link
Owner

tevador commented Jun 28, 2019

Sorry, should be --affinity 170.

@zeno39
Copy link
Author

zeno39 commented Jun 28, 2019

1000 h/s

@zeno39
Copy link
Author

zeno39 commented Jun 28, 2019

--seed maybe ? And i have 32 threads ans 64mo why i use just 4 threads per launcher ?

@tevador
Copy link
Owner

tevador commented Jun 28, 2019

You can try --threads 8 --affinity 255

@zeno39
Copy link
Author

zeno39 commented Jun 28, 2019

1345

@zeno39
Copy link
Author

zeno39 commented Jun 28, 2019

I think i have to wait for the adapted scripts

@mistfpga
Copy link

mistfpga commented Jul 3, 2019

To force an application to use NUMA on windows server / win 10 (i think, I have not double checked, but from win 7 onwards NUMA has been a part of windows)

The most important part is to read the example of how to use CoreInfo

This also article references SSAS but dont worry about it, the Windows System Resource Manager/WSRM, or the hyper-x stuff in that article are not relevant.

https://techcommunity.microsoft.com/t5/DataCAT/Forcing-NUMA-Node-affinity-for-Analysis-Services-Tabular/ba-p/305188

CoreInfo - https://docs.microsoft.com/en-gb/sysinternals/downloads/coreinfo

This shows what cores are assigned to what nodes and roughly what performance costs for each cpu to access each bank. The source code to this might be available from before microsoft acquired sysinternals.

Once you have done this, either apply this hotfix (if it applies to you) or skip this step if it doesn't.
https://support.microsoft.com/en-gb/help/2028687/you-cannot-specify-a-numa-node-when-you-create-a-process-by-using-the

Now you can run benchmark instances per node with something like

start /NUMA [n] "Numa" cmd /k benchmark.exe --mine --jit --largePages --init 8 --threads 8 --nonces 10000

For each NUMA node you have. Also adjust the --init and --threads for your hardware.
the /k parameter means the cmd prompt will stay open even after the benchmark has finished.

This might be all you need but you will probably need to set the affinity too. You can work out which cores are on which banks from CoreInfo.

Then use a command like:

start /AFFINITY [h] /NUMA [n] "Numa [n]" cmd /k benchmark.exe --mine --jit --largePages --init 8 --threads 8 --nonces 10000

Again for each node you want to have where [h] is the hex mask and [n] is the numa node.

The affinity hex mask not a direct cpu number. So to go over the full range of possible combinations on my 2c/4t thread processor, 0 to F are acceptable values for the parameter. This represents the range 0001 to 1111. 0001 = one core 1 thread and 1111 = 2 cores 4 threads.

Now you can put it all in a batch file with one line per node.

Note: You might need to fiddle with your largePages set up too, not 100% sure on that yet.

@Jabroni
Copy link

Jabroni commented Jul 12, 2019

Heres my results of the benchmark for a 7351p

OS: Proxmox 5.3
Memory: 8x 16GB 2400Mhz (so all slots are used)

Ran directly on hypverisor OS

sudo sysctl -w vm.nr_hugepages=4800
seq 0 3 | xargs -P 0 -I node numactl -N node ./randomx-benchmark --mine --largePages --jit --nonces 100000 --init 8 --threads 8

My results were

Calculated result: d6660144e9a2e68bf47d7cc8afc206672e72f82dfff69fe0d974531e85f7504f
Performance: 2645.11 hashes per second
Calculated result: d6660144e9a2e68bf47d7cc8afc206672e72f82dfff69fe0d974531e85f7504f
Performance: 2646.72 hashes per second
Calculated result: d6660144e9a2e68bf47d7cc8afc206672e72f82dfff69fe0d974531e85f7504f
Performance: 2637.67 hashes per second
Calculated result: d6660144e9a2e68bf47d7cc8afc206672e72f82dfff69fe0d974531e85f7504f
Performance: 2173.49 hashes per second
`

So thats ~10,100 H/s if I sum the result of each NUMA job

@russoj88
Copy link

Calculated result: d6660144e9a2e68bf47d7cc8afc206672e72f82dfff69fe0d974531e85f7504f
Performance: 2645.11 hashes per second
Calculated result: d6660144e9a2e68bf47d7cc8afc206672e72f82dfff69fe0d974531e85f7504f
Performance: 2646.72 hashes per second
Calculated result: d6660144e9a2e68bf47d7cc8afc206672e72f82dfff69fe0d974531e85f7504f
Performance: 2637.67 hashes per second
Calculated result: d6660144e9a2e68bf47d7cc8afc206672e72f82dfff69fe0d974531e85f7504f
Performance: 2173.49 hashes per second

Any idea why the fourth node had worse performance?

@Jabroni
Copy link

Jabroni commented Jul 22, 2019

Any idea why the fourth node had worse performance?

Could be because I did had some vms and dockers running while doing these benchmarks, its my home server so those vms and dockers load is low, but that could count on the difference

@russoj88
Copy link

Thanks for the reply. I assume getting performance for a node and then multiplying by the node count is accurate (enough), but want to make sure.

@tevador tevador added the invalid This doesn't seem right label Aug 30, 2019
@tevador tevador closed this as completed Sep 27, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
invalid This doesn't seem right
Projects
None yet
Development

No branches or pull requests

6 participants