Skip to content
This repository has been archived by the owner on Nov 8, 2021. It is now read-only.

Testing performance #5

Open
tevador opened this issue Jul 10, 2018 · 37 comments
Open

Testing performance #5

tevador opened this issue Jul 10, 2018 · 37 comments

Comments

@tevador
Copy link
Owner

tevador commented Jul 10, 2018

I have pushed a test release of the C++ generator.

You can build it yourself (instructions here) or download precompiled binaries from here. The precompiled binaries require AVX instruction set, so if your CPU doesn't support it, you have to build your own.

It consists of two binaries: randomjs (the generator) and xst (javascript engine).

It generates and executes 1000 programs and calculates Blake2b hashes of all outputs.

Post your performance numbers and CPU specs.

Xeon E3-1245 @3.7 GHz (Debian 9)

> ./randomjs
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 100.232 programs per second
@ghost
Copy link

ghost commented Jul 10, 2018

Xeon (Skylake, IBRS) @ 2.1 GHz (Ubuntu 18.04 LTS)
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79 Performance: 62.0579 programs per second

@SChernykh
Copy link
Contributor

Intel Core i7-7820X @ 3.6 GHz (Windows 10). Built all binaries (release build) myself.

Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 129.94 programs per second

@JohannesLau
Copy link

From precompiled binaries.
Ryzen 1700x @ 3.4GHz Windows 10
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79 Performance: 86.3199 programs per second

@SChernykh
Copy link
Contributor

SChernykh commented Jul 10, 2018

Intel Core i7-7820X @ 3.6 GHz (Windows 10). Rebuilt all binaries using profile guided optimizations: 6.4% faster.

Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 138.256 programs per second

Edit: here are the precompiled binaries on the same PC

Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 130.962 programs per second

@tevador
Copy link
Owner Author

tevador commented Jul 10, 2018

Raspberry Pi 3 @ 1.2 GHz

Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 13.9537 programs per second

(Note: To build for ARM, the currently used SSE2 version of Blake2b must be replaced by the reference or NEON optimized version.)

@Cactii1
Copy link

Cactii1 commented Jul 10, 2018

Core i7-4700HQ @ 2.4 GHz Windows 8.1 with 16gigs RAM

..\Wownero-Test>randomjs.exe
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 96.2629 programs per second

..\Wownero-Test>randomjs.exe
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 95.4948 programs per second

..\Wownero-Test>randomjs.exe
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 96.447 programs per second

..\Wownero-Test>randomjs.exe
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 94.1722 programs per second

..\Wownero-Test>randomjs.exe
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 99.5576 programs per second

@SChernykh
Copy link
Contributor

@tevador Your precompiled binaries crash with exception code 0xC000001D (illegal instruction) on Pentium G5400 (Coffee Lake) because it doesn't support AVX. I had to recompile it with SSE2.

Intel Core i5-3210M @ 2.9 GHz, precompiled binaries:

Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 75.0907 programs per second

Intel Core i5-3210M @ 2.9 GHz, my binaries with PGO:

Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 84.095 programs per second

Intel Pentium G5400 @ 3.7 GHz, my binaries without PGO:

Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 109.548 programs per second

Intel Pentium G5400 @ 3.7 GHz, my binaries with PGO:

Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 116.807 programs per second

@tevador
Copy link
Owner Author

tevador commented Jul 10, 2018

@SChernykh Yes, the precompiled binaries require AVX. I added a note to the original comment. I didn't realize there were still modern CPUs where Intel disables it.

Do the profile guided optimizations work in general or just for these particular 1000 programs? You can run a different set of programs by modifying the block header template in main.cpp. It would be best to read the block template and nonce count from the command line, but I didn't have time to implement it yet.

@SChernykh
Copy link
Contributor

@tevador 1000 programs (~10 seconds of CPU time) is a big enough sample set for PGO to work well. It optimizes C/C++ code on low level and only needs execution statistics - which branches are executed more often, which if are taken and which are not and so on. In my experience, PGO almost always gives some improvement.

@zpalmtree
Copy link

Ryzen 1600 @ 3.6 GHz (Linux)

Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 115.622 programs per second

@SChernykh
Copy link
Contributor

Raspberry Pi 3 @ 1.2 GHz

@tevador Was it 64 bit or 32 bit build? Are there any differences between 32 and 64 bit?

@tevador
Copy link
Owner Author

tevador commented Jul 12, 2018

@SChernykh Ubuntu 16.04 armv7l (32bit). Haven't tested a 64bit build since the software support is still a bit lacking.

@SChernykh
Copy link
Contributor

Ryzen 1700 @ 3.6 GHz, Windows 10 (original testing binaries)

Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 92.7852 programs per second

@SChernykh
Copy link
Contributor

Finally found a way to test it on 64-bit ARM - there are a number of cloud hosting providers that have ARM servers.

OS: Ubuntu 18.04 LTS
Compiler: g++ (Ubuntu 8-20180414-1ubuntu2) 8.0.1 20180414 (experimental) [trunk revision 259383]
Processor: Cavium ThunderX 88XX @ 2GHz (aarch64)

Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 17.4324 programs per second

@tevador
Copy link
Owner Author

tevador commented Jul 20, 2018

Interesting. The performance per clock seems to be significantly lower than Raspberry Pi 3 in 32-bit mode. Perhaps the CPU is an older model?

Can you also test a 32-bit build of the executables? I think the compiler flag is -march=armv7.

@SChernykh
Copy link
Contributor

It's a cloud server - very unstable performance because it's 4 virtual cores on a 96-core server:

Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 13.4565 programs per second
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 12.1426 programs per second
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 11.2156 programs per second
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 13.0792 programs per second
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 12.42 programs per second
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 10.9402 programs per second
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 12.3119 programs per second
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 10.4637 programs per second
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 10.3319 programs per second
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 12.4766 programs per second
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 11.9622 programs per second
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 10.7777 programs per second
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 11.278 programs per second
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 12.9119 programs per second
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 17.1197 programs per second
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 16.4346 programs per second
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 16.774 programs per second

There are also providers that have bare metal ARM servers, I'll try them tomorrow. I'll try to test 32-bit build, but re-compiling boost will take a few hours.

@tevador
Copy link
Owner Author

tevador commented Jul 20, 2018

You can specify --with-libraries=system,filesystem when compiling boost. It speeds up compilation considerably (these are the only 2 libraries required by randomjs at the moment).

@SChernykh
Copy link
Contributor

No luck in compiling 32-bit code

cc1plus: error: unknown value ‘armv7’ for -march
cc1plus: note: valid arguments are: armv8-a armv8.1-a armv8.2-a armv8.3-a armv8.4-a native; did you mean ‘armv8-a’?

@tevador
Copy link
Owner Author

tevador commented Jul 20, 2018

From what I found online, not all armv8 (64bit) CPUs are backwards compatible with armv7. Extra silicon is needed for this, so backwards compatibility is optional.

It seems that the Cavium ThunderX CPU doesn't support the armv7 instruction set, so it cannot run in 32bit mode.

https://github.com/scaleway/image-debian/issues/86

@Gingeropolous
Copy link

Gingeropolous commented Jul 28, 2018

Ubuntu 16.04

Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 157.207 programs per second
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 156.94 programs per second
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 157.211 programs per second

Intel(R) Core(TM) i5-3450 CPU @ 3.10GHz
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 98.8595 programs per second
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 99.4327 programs per second
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 99.3007 programs per second

wow. A PoW where the GHz actually matter. These are with the precompiled binaries.

@M5M400
Copy link

M5M400 commented Jul 30, 2018

Ubuntu 18.04; Threadripper 1950x

root@TR4:/usr/local/src/RandomJS/src-cpp/bin# for i in {1..10}; do ./randomjs; done
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 123.49 programs per second
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 120.954 programs per second
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 123.555 programs per second
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 123.58 programs per second
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 122.938 programs per second
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 122.905 programs per second
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 123.402 programs per second
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 122.556 programs per second
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 122.872 programs per second
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 123.415 programs per second

let's see if I can get some more performant binaries done

@SChernykh
Copy link
Contributor

Speaking of optimized version

C:\Users\User\Downloads\RandomJS\src-cpp\x64\Release>randomjs.exe
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 175.757 programs per second

C:\Users\User\Downloads\RandomJS\src-cpp\x64\Release>randomjs.exe
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 176.123 programs per second

C:\Users\User\Downloads\RandomJS\src-cpp\x64\Release>randomjs.exe
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 176.713 programs per second

C:\Users\User\Downloads\RandomJS\src-cpp\x64\Release>randomjs.exe
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 175.023 programs per second

C:\Users\User\Downloads\RandomJS\src-cpp\x64\Release>randomjs.exe
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 176.187 programs per second

Stock binaries do 92.7852 programs per second on the same PC (Ryzen 7 1700 @ 3.6 GHz). I only spent a couple of evenings to get 1.9x performance and I have a few more changes in mind to get to 2x and more performance.

@tevador Is this PoW completely finalized? I could spend some more time optimizing it then.

@Gingeropolous
Copy link

@SChernykh , are those optimizations to the code that will increase performance on any CPU? Or are they compile-time optimizations for the specs of the Ryzen?

Also, I assume these current binaries are for 1 thread. Is there any limit (i.e., cpu cache size) needed for each thread, or will it be able to run per thread regardless of cache size?

@SChernykh
Copy link
Contributor

SChernykh commented Jul 30, 2018

@Gingeropolous These are algorithmic and C++ coding optimizations only. No assembly or compiler flags magic.

Edit: as I understood, it's supposed to run 1 process per CPU core.
Edit2: yes, my optimizations will work on any CPU.

@tevador
Copy link
Owner Author

tevador commented Jul 30, 2018

@Gingeropolous The limited testing I've done shows basically linear scaling with the number of cores. I haven't tested the impact of SMT, so I'm not sure if Ryzen will mine faster with 8 or 16 threads.

@SChernykh Are you are optimizing the XS engine or the JS generator? The generator takes only about ~6% of the time of one hash, so I'm not sure what optimizations can be done there. Can you push the code changes?

The PoW is not final yet. I'm planning some changes to the EvalExpression to increase FPGA resistance, but I have a lot of work this summer, so I don't have time to continue at the moment.

@SChernykh
Copy link
Contributor

@tevador Most optimizations are in the XS engine, the only thing that I changed in RandomJS binary is IPC communicaton with XS engine. Not sure I can make a pull request right now since some of my optimizations are Windows only for now (memory allocator for example).

@SChernykh
Copy link
Contributor

I added #ifdef guards to my Windows-only pieces of code and could compile and run it in Ubuntu:

osboxes@osboxes:~/RandomJS/src-cpp/bin$ ./randomjs 
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 177.983 programs per second

It's even faster on virtual Linux machine without profile-guided optimizations than on real Windows with PGO. GCC compiler is superior! I'll prepare pull requests later today.

@SChernykh
Copy link
Contributor

XS pull request: tevador/moddable#9
RandomJS pull request: #7

@Gingeropolous
Copy link

@tevador ,

The PoW is not final yet. I'm planning some changes to the EvalExpression to increase FPGA resistance, but I have a lot of work this summer, so I don't have time to continue at the moment.

Sounds awesome. Could you possibly write out a rough sketch of the idea so someone else could pick up where you left off?

@tevador
Copy link
Owner Author

tevador commented Jul 30, 2018

@SChernykh I have merged your code. Btw, I think we could avoid the call to atoi by prepending the program size as a 4-byte integer at the beginning rather than writing it in textual form.

@Gingeropolous It still requires a lot of research. The problem is that currently, there is a large variation in the number of EvalExpressions executed per program (IIRC it varies from as little as 8 to hundreds). The goal is to have a narrow range.

As I'm planning to remove the = from the eval chars, the low number of EvalExpressions would enable some theoretical attacks by assuming all EvalExpressions throw a SyntaxError and thus avoiding eval altogether.

The programs with a high number of EvalExpressions create long-running outliers, which is also undesirable. Throw/catch is slow and could be a potential optimization target.

@miziel
Copy link

miziel commented Jul 31, 2018

i5-6600K @ 4.1 GHz , Windows 10, pre-compiled binaries

Performance: 123.777 programs per second
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 123.409 programs per second
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 123.018 programs per second
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 121.82 programs per second

ryzen 2700x @ 4.0 GHz, Windows 10, pre-compiled binaries

Performance: 102.023 programs per second
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 101.123 programs per second
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 102.32 programs per second
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 100.887 programs per second

@M5M400
Copy link

M5M400 commented Jul 31, 2018

New results with latest optimizations by @SChernykh

Same Threadripper did ~123 pps before :)

root@TR4:/usr/local/src/RandomJS/src-cpp/bin# for i in {1..10}; do ./randomjs; done
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 193.575 programs per second
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 193.213 programs per second
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 194.454 programs per second
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 192.792 programs per second
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 193.737 programs per second
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 193.869 programs per second
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 193.833 programs per second
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 192.676 programs per second
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 193.936 programs per second
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 194.189 programs per second

@Gingeropolous
Copy link

when I try to compile the new code, I get segfaults... but only on one box. O got it to compile on a different box and it runs fine.

New optimizations:
model name : AMD Opteron(tm) Processor 6172
Cumulative output hash: bb3bffd3d0eef49066bf5c8e664c082c502fe021a7cdc0cd0c44fe1990560b79
Performance: 60.5453 programs per second

won't compile on model name : Intel(R) Core(TM) i7-7700K CPU @ 4.20GHz , even though I followed the same instructions. Interestingly, the precompiled binaries run on this machine, but not the opteron.

@SChernykh
Copy link
Contributor

You get segfaults during compilation or when you try to run the compiled binary?

@Gingeropolous
Copy link

Gingeropolous commented Aug 5, 2018

@SChernykh , when I tried to run the self-compiled binary (on the i7-7700k box), i get a segfault (but the precompiled binary you provided works fine on that box).

When I try to run the precompiled binary on the Opteron 6172, i get Illegal instruction (core dumped)

@OpticFlowX
Copy link

Hi, is it possible to provide binaries for older architectures? Like Intel Core2 for example (-march=core2 on gcc)? Or AVX is a must-have?

@ouillepouille
Copy link

ouillepouille commented Oct 4, 2018

on ubuntu 18.10 I compiled xst and randomjs... they compiled fine but xst fails to run
So ./randomjs does nothing (you have to kill it)
I'm on a ThinkPad X240 with i5

./xst
Segmentation error (core dumped)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants