-
Notifications
You must be signed in to change notification settings - Fork 11
Testing performance #5
Comments
Xeon (Skylake, IBRS) @ 2.1 GHz (Ubuntu 18.04 LTS) |
Intel Core i7-7820X @ 3.6 GHz (Windows 10). Built all binaries (release build) myself.
|
From precompiled binaries. |
Intel Core i7-7820X @ 3.6 GHz (Windows 10). Rebuilt all binaries using profile guided optimizations: 6.4% faster.
Edit: here are the precompiled binaries on the same PC
|
Raspberry Pi 3 @ 1.2 GHz
(Note: To build for ARM, the currently used SSE2 version of Blake2b must be replaced by the reference or NEON optimized version.) |
Core i7-4700HQ @ 2.4 GHz Windows 8.1 with 16gigs RAM ..\Wownero-Test>randomjs.exe ..\Wownero-Test>randomjs.exe ..\Wownero-Test>randomjs.exe ..\Wownero-Test>randomjs.exe ..\Wownero-Test>randomjs.exe |
@tevador Your precompiled binaries crash with exception code 0xC000001D (illegal instruction) on Pentium G5400 (Coffee Lake) because it doesn't support AVX. I had to recompile it with SSE2. Intel Core i5-3210M @ 2.9 GHz, precompiled binaries:
Intel Core i5-3210M @ 2.9 GHz, my binaries with PGO:
Intel Pentium G5400 @ 3.7 GHz, my binaries without PGO:
Intel Pentium G5400 @ 3.7 GHz, my binaries with PGO:
|
@SChernykh Yes, the precompiled binaries require AVX. I added a note to the original comment. I didn't realize there were still modern CPUs where Intel disables it. Do the profile guided optimizations work in general or just for these particular 1000 programs? You can run a different set of programs by modifying the block header template in |
@tevador 1000 programs (~10 seconds of CPU time) is a big enough sample set for PGO to work well. It optimizes C/C++ code on low level and only needs execution statistics - which branches are executed more often, which if are taken and which are not and so on. In my experience, PGO almost always gives some improvement. |
Ryzen 1600 @ 3.6 GHz (Linux)
|
@tevador Was it 64 bit or 32 bit build? Are there any differences between 32 and 64 bit? |
@SChernykh Ubuntu 16.04 armv7l (32bit). Haven't tested a 64bit build since the software support is still a bit lacking. |
Ryzen 1700 @ 3.6 GHz, Windows 10 (original testing binaries)
|
Finally found a way to test it on 64-bit ARM - there are a number of cloud hosting providers that have ARM servers. OS: Ubuntu 18.04 LTS
|
Interesting. The performance per clock seems to be significantly lower than Raspberry Pi 3 in 32-bit mode. Perhaps the CPU is an older model? Can you also test a 32-bit build of the executables? I think the compiler flag is |
It's a cloud server - very unstable performance because it's 4 virtual cores on a 96-core server:
There are also providers that have bare metal ARM servers, I'll try them tomorrow. I'll try to test 32-bit build, but re-compiling boost will take a few hours. |
You can specify |
No luck in compiling 32-bit code
|
From what I found online, not all armv8 (64bit) CPUs are backwards compatible with armv7. Extra silicon is needed for this, so backwards compatibility is optional. It seems that the Cavium ThunderX CPU doesn't support the armv7 instruction set, so it cannot run in 32bit mode. |
Ubuntu 16.04 Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz Intel(R) Core(TM) i5-3450 CPU @ 3.10GHz wow. A PoW where the GHz actually matter. These are with the precompiled binaries. |
Ubuntu 18.04; Threadripper 1950x
let's see if I can get some more performant binaries done |
Speaking of optimized version
Stock binaries do 92.7852 programs per second on the same PC (Ryzen 7 1700 @ 3.6 GHz). I only spent a couple of evenings to get 1.9x performance and I have a few more changes in mind to get to 2x and more performance. @tevador Is this PoW completely finalized? I could spend some more time optimizing it then. |
@SChernykh , are those optimizations to the code that will increase performance on any CPU? Or are they compile-time optimizations for the specs of the Ryzen? Also, I assume these current binaries are for 1 thread. Is there any limit (i.e., cpu cache size) needed for each thread, or will it be able to run per thread regardless of cache size? |
@Gingeropolous These are algorithmic and C++ coding optimizations only. No assembly or compiler flags magic. Edit: as I understood, it's supposed to run 1 process per CPU core. |
@Gingeropolous The limited testing I've done shows basically linear scaling with the number of cores. I haven't tested the impact of SMT, so I'm not sure if Ryzen will mine faster with 8 or 16 threads. @SChernykh Are you are optimizing the XS engine or the JS generator? The generator takes only about ~6% of the time of one hash, so I'm not sure what optimizations can be done there. Can you push the code changes? The PoW is not final yet. I'm planning some changes to the EvalExpression to increase FPGA resistance, but I have a lot of work this summer, so I don't have time to continue at the moment. |
@tevador Most optimizations are in the XS engine, the only thing that I changed in RandomJS binary is IPC communicaton with XS engine. Not sure I can make a pull request right now since some of my optimizations are Windows only for now (memory allocator for example). |
I added #ifdef guards to my Windows-only pieces of code and could compile and run it in Ubuntu:
It's even faster on virtual Linux machine without profile-guided optimizations than on real Windows with PGO. GCC compiler is superior! I'll prepare pull requests later today. |
XS pull request: tevador/moddable#9 |
@tevador ,
Sounds awesome. Could you possibly write out a rough sketch of the idea so someone else could pick up where you left off? |
@SChernykh I have merged your code. Btw, I think we could avoid the call to atoi by prepending the program size as a 4-byte integer at the beginning rather than writing it in textual form. @Gingeropolous It still requires a lot of research. The problem is that currently, there is a large variation in the number of EvalExpressions executed per program (IIRC it varies from as little as 8 to hundreds). The goal is to have a narrow range. As I'm planning to remove the The programs with a high number of EvalExpressions create long-running outliers, which is also undesirable. Throw/catch is slow and could be a potential optimization target. |
i5-6600K @ 4.1 GHz , Windows 10, pre-compiled binaries
ryzen 2700x @ 4.0 GHz, Windows 10, pre-compiled binaries
|
New results with latest optimizations by @SChernykh Same Threadripper did ~123 pps before :)
|
when I try to compile the new code, I get segfaults... but only on one box. O got it to compile on a different box and it runs fine. New optimizations: won't compile on model name : Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz , even though I followed the same instructions. Interestingly, the precompiled binaries run on this machine, but not the opteron. |
You get segfaults during compilation or when you try to run the compiled binary? |
@SChernykh , when I tried to run the self-compiled binary (on the i7-7700k box), i get a segfault (but the precompiled binary you provided works fine on that box). When I try to run the precompiled binary on the Opteron 6172, i get |
Hi, is it possible to provide binaries for older architectures? Like Intel Core2 for example (-march=core2 on gcc)? Or AVX is a must-have? |
on ubuntu 18.10 I compiled xst and randomjs... they compiled fine but xst fails to run
|
I have pushed a test release of the C++ generator.
You can build it yourself (instructions here) or download precompiled binaries from here. The precompiled binaries require AVX instruction set, so if your CPU doesn't support it, you have to build your own.
It consists of two binaries:
randomjs
(the generator) andxst
(javascript engine).It generates and executes 1000 programs and calculates Blake2b hashes of all outputs.
Post your performance numbers and CPU specs.
Xeon E3-1245 @3.7 GHz (Debian 9)
The text was updated successfully, but these errors were encountered: