-
Notifications
You must be signed in to change notification settings - Fork 0
Optimizations for TimeLord
Running a TimeLord is optional (default is disabled). Not needed to run a fully functional node.
Blockchain, as a whole, need at least one active timelord to move forward. A few more, spread around, is preferred for redundancy and security.
If you want to contribute by running one, check requirements below. Enable timelord in WebGUI, or set true
in config/local/timelord
file. Check that running, and speed, in NODE / LOG / TIMELORD tab in WebGUI. Probably lower than NODE / VDF Speed, unless you are the fastest timelord.
No more is needed. Standard Linux compile and Windows binaries gives good performance for a timelord. Unless you want to optimize for fastest timelord, or the fun of it.
I want to run a fast timelord:
- Use Linux or Windows
- Compile with AVX baseline
- Have a GPU/iGPU verify VDF
- Clock CPU as high as possible
NOTE: Optimize and overclock at your own risk.
A timelord performs a very simple mathematical operation (SHA256, Secure Hash Algorithm). It is performed recursively and cannot be parallelized. Previous result is needed, before repeating, as fast as possible. Like a very specific mathematical single-threaded benchmark.
Timelord runs 3x of these operations, VDF (verifiable delay function) streams, in parallel. But their individual workload cannot be parallelized more.
A third VDF stream was introduced for on-chain timelord rewards with testnet10. Can be turned off, leaving 2x VDF streams. Still possible to be fastest timelord, but no timelord rewards will be given.
Only fastest timelord, at any time, produces VDF for block being created. And can receive a timelord reward.
CPU: Intel or AMD (w/ SHA extensions).
Model: Intel 11th-gen (Rocket Lake), AMD Zen, or later (a few exceptions).
GPU/iGPU: Any compatible (offload verify VDF).
Linux: grep -o 'sha_ni' /proc/cpuinfo
, empty if not available.
Windows: CPU-Z (Instructions) or HWiNFO64 (Features), look for SHA
.
You can run timelord on a CPU without SHA extensions. Will fallback to AVX2. In reality SHA extensions are needed. SHA256 calculations are ~5-10x faster with SHA vs AVX2.
Timelord logic will only use CPU, not GPU.
GPU is great for parallel SHA256 calculations, beating CPU in both speed and efficiency. GPU is used for verify VDF operation on a node, if available.
For a single SHA256 calculation, CPU's SHA extensions will beat GPU on speed. As timelord SHA256 workload is not parallelizable, CPU wins the serial SHA256 race.
Feedback welcome on other contenders. As of now, nothing observed beating a high-GHz CPU with SHA extensions (optimized silicon circuits inside CPU). Too low speed (GHz) on FPGA, work not parallelizable. Prohibitive cost to produce a high-GHz ASIC that beats Intel/AMD optimized silicon.
Standard Linux compile and Windows binaries gives good performance for a timelord. Still important to tune surrounding environment. Either aiming for fastest timelord rewards, or just the challenge.
Information and numbers in this article might be superseded. Sections below are known info at date of publish. To help get started or give ideas. Not the absolute answer. Probably angles not discovered yet, and other ways to go about it.
You do not need to complicate it like below. Try to run a timelord. Measure speed. Try out a tip. Measure if faster.
Discuss and share in #mmx-timelord
channel on Discord. No requirement to divulge all your secrets. But a good place to get tips, or kickstart new ideas.
NOTE: Optimize and overclock at your own risk.
CPU: Intel Ultra 200S series (15th-gen, Arrow Lake)
GPU: Nvidia RTX 3050 8GB (GA106)
OS: Ubuntu 24.10 (Oracular Oriole)
OS: Windows 10 (22H2)
mmx-node: v0.12.8 (+ AVX)
Most instructions below are transferable to AMD.
Timelord speed is measured in MH/s (million hashes per second).
Current blockchain network speed, fastest timelord, is found in NODE / VDF Speed in WebGUI. Your own timelord speed is found under NODE / LOG / TIMELORD tab in WebGUI.
To make it easier to measure own improvements, baseline numbers, compare with others. It is recommended to measure MH/s speed per 0.1 GHz (MH/s/0.1GHz). Yes, absolute speed is the end goal. But timelord speed, at least observed for now, is linear given CPU GHz. A 2-step process is recommended:
- Optimize for best possible MH/s/0.1GHz
- Clock 3x CPU cores as high as possible
In this case the Intel 15th-gen has had E-cores locked to 3.1 GHz (could be lower/higher, not important). We already know from TimeLord Predictions that E-cores is the best option. Not always been so. Previous generations had P-cores as best, with hyperthreading and E-cores disabled (more on that later). Goal is to make optimization measurements easier and controlled.
Intel 13th-gen (v0.12.8 + AVX):
Environment | Measured | Locked Speed | Measured Per Unit |
---|---|---|---|
Ubuntu/gcc14 | 40.77 MH/s | /31 (3.1 GHz) | 1.315 MH/s/0.1GHz |
Ubuntu/Clang19 | 39.94 MH/s | /31 (3.1 GHz) | 1.288 MH/s/0.1GHz |
Windows/VC++ | 38.44 MH/s | /31 (3.1 GHz) | 1.240 MH/s/0.1GHz |
These numbers represent absolute speed per 0.1 GHz, given the environment and tuning. Easy to compare against yourself or others. Top speed after that is dependent on how high you can clock 3x CPU cores (more on that later). In this case, 3x E-cores stable at 4.6 GHz, would give a timelord speed of ~60.5 MH/s.
As a sidenote. P-core numbers for 12/13/14th-gen Intel gives exact same performance per 0.1 GHz. Basically, no IPC (instructions per clock) uplift for SHA extensions between them (for this specific use-case). But 14th-gen have potential to clock highest.
Testing of AMD Zen4-core (7000-series), and Zen5-core (9000-series) gives 0.755 and 0.715 MH/s/0.1GHz. Yes, a degradation in efficiency. Probably an architectural decision.
Intel's 15th-gen E-cores efficiency increase was a surprise (1.320, vs 0.975 before). Making it the best known choice for now. Previous generations had Intel/AMD much closer (for this specific use-case). In the end, an overclocking race.
Fastest timelord speeds observed on testnets (as of Dec2024):
Continuous | Peak |
---|---|
~66.7 MH/s | ~67.0 MH/s |
Numbers above, in previous releases and CPU generations, have switched between Linux or Windows binary being fastest. With current source code, Linux (gcc14/Clang19) looks to have the edge.
Instructions shown in sections below are done on Linux. But most aspects are applicable to Windows too.
Linux distribution and kernel often have an effect on different types of workloads. When it comes to timelord logic, not much observed. Logic for creation of a VDF stream is very small. A few instructions repeated in a CPU core.
Follow default mmx-node installation for Linux (in this case Ubuntu, with default compiler gcc14). Get mmx-node up and running. Enable timelord in WebGUI, or set true
in config/local/timelord
file.
Let it run for a while. Check average speed in NODE / LOG / TIMELORD tab in WebGUI. With this setup:
Environment | Measured | Locked Speed | Measured Per Unit |
---|---|---|---|
Ubuntu/gcc14 | 40.40 MH/s | /31 (3.1 GHz) | 1.303 MH/s/0.1GHz |
Compiler has an effect on how good source code is translated to binary objects. Default for Ubuntu 24.10 is GCC (GNU Compiler Collection), or gcc14. An alternative is Clang (LLVM), or Clang19. There are others. Has varied if Clang or gcc do a better job.
One way to install, enable and compile with Clang19:
sudo apt install clang lld libomp-dev
export CC=/usr/bin/clang-19
export CPP=/usr/bin/clang-cpp-19
export CXX=/usr/bin/clang++-19
export LD=/usr/bin/ld.lld-19
./clean_all.sh
./make_devel.sh
NOTE: You need to perform export
statements in terminal environment before compile, or gcc14 will be used.
NOTE: When you switch compiler, or compiler options. Always do ./clean_all.sh
before new compile.
NOTE: You will get a lot of unused -fmax-errors=1
warnings. Just ignore, or remove from compiler options.
Default Clang19 compile (./make_devel.sh
):
Environment | Measured | Locked Speed | Measured Per Unit |
---|---|---|---|
Ubuntu/Clang19 | 40.25 MH/s | /31 (3.1 GHz) | 1.298 MH/s/0.1GHz |
Small, but noticeable degrade from gcc14's 1.303 MH/s/0.1GHz. We'll switch back to gcc14 going forward.
Compiler options can have a big effect on how source code is transformed to a binary object. Often focus is on speed vs size. Several options have an effect on timelord logic. Much have been tried with both gcc and Clang.
For now, gcc14 with default options in ./make_devel.sh
gives best performance.
Some elements to experiment with (./make_devel.sh
):
- Switch between
Release
andRelWithDebInfo
(-DCMAKE_BUILD_TYPE
) - Remove
-fno-omit-frame-pointer
(-DCMAKE_CXX_FLAGS
) - Add
-march=native
(-DCMAKE_CXX_FLAGS
) - Variants of
-O
optimization option (-DCMAKE_CXX_FLAGS
)
There are others. Look up optimization in relevant compiler documentation.
NOTE: When you switch compiler, or compiler options. Always do ./clean_all.sh
before new compile.
One thing to be aware of is that we want to optimize a tiny part of whole mmx-node. Even a tiny subset of whole timelord logic. The calculation of a VDF stream. Performed through hash_t TimeLord::compute(...)
(/src/TimeLord.cpp
) calling recursive_sha256_ni(...)
(/src/sha256_ni_rec.cpp
). We do not care about the rest. As long as this part goes as fast as possible. Unless surrounding elements has an effect. Not observed for now.
That part of source code is already written to be fast when translated to binary objects by compiler (inline, intrinsics, asm).
Several iterations were made for it to end up like that. Still, this is the place to adjust source code if you think there is a way to optimize it even more.
Current default compile combines the usage of SHA extensions and SSE4.2 instructions. Raising the SSE4.2 baseline to AVX instructions gives about ~1% boost on 15th-gen Intel. Has to do with compiler using identical AVX versions of certain SSE4.2 instructions. Though, on 11th-gen Intel this looks to degrade performance (better with SSE4.2).
To implement AVX vs SSE4.2 baseline (./CMakeLists.txt
):
Change -msse4.2
to -mavx
on two lines (Linux compile part):
set_source_files_properties(src/sha256_ni.cpp PROPERTIES COMPILE_FLAGS "-mavx -msha")
set_source_files_properties(src/sha256_ni_rec.cpp PROPERTIES COMPILE_FLAGS "-mavx -msha")
Add two lines (Windows compile part):
set_source_files_properties(src/sha256_ni.cpp PROPERTIES COMPILE_FLAGS "/arch:AVX")
set_source_files_properties(src/sha256_ni_rec.cpp PROPERTIES COMPILE_FLAGS "/arch:AVX")
Easier to see location in closed PR#210 request.
Small, but real jump from gcc14's 1.303 MH/s/0.1GHz.
Environment | Measured | Locked Speed | Measured Per Unit |
---|---|---|---|
Ubuntu/gcc14 | 40.77 MH/s | /31 (3.1 GHz) | 1.315 MH/s/0.1GHz |
NOTE: For now, official releases have SSE4.2 as baseline.
At this stage we know what to expect for each 0.1 GHz, 1.315 MH/s. All testing we have observed have given linear increase, given CPU GHz. Now it is time to clock CPU as high as possible.
NOTE: Optimize and overclock at your own risk.
First a boring observation. Many elements surrounding raw GHz of CPU cores have been tested:
- RAM type/speed/latency/bandwidth
- HyperThreading on/off
- Virtualization (VT-d)
- Mitigations (Spectre/Meltdown)
- CPU cache/ring ratio
- CPU L1/L2/L3 cache size
- CPU core-to-core latency
Nothing looks to affect timelord speed, except CPU core clock (GHz). Remember, timelord logic for creation of VDF streams is very small. Not much outside a few instructions repeated in a CPU core.
Timelord logic has 3x process threads. Each wants 100% of 1x CPU core, to calculate a VDF stream. Goal is to create an environment that makes these 3x process threads run with high GHz continuously.
One way, and valid strategy, is to let the OS process scheduler do its job (Linux or Windows). Distribute and use resources as best possible, depending on requirements and state of system. Maybe tune some aspects of OS, together with BIOS adjustments to clock CPU as high as possible. Gives great results. All modern CPUs have logic to boost individual CPU cores in combination with OS scheduler, power management and other logic.
Another, more manual way, is to dedicate specific CPU cores to the 3x timelord process threads. Locking OS and other processes away from them. In this case an Intel CPU with 8x P-cores, numbered 0-7. With E-cores following after that. Going to dedicate E-cores 8,9,10 to timelord process threads. One way to achieve it (Linux, in this case Ubuntu):
- Force OS process scheduler to not use core 8,9,10. Add
isolcpus=8,9,10
toGRUB_CMDLINE_LINUX
(/etc/default/grub
). Easily observed throughhtop
and CPU core utilization. - When timelord up and running, you should have 5x process threads with command name of 'TimeLord':
ps -A -T -o tid,comm,pcpu | grep 'TimeLord'
In practice, the three last are the 100% CPU creating VDF stream process threads. Can also find them withhtop
. Let's say they have pid(tid) 5111, 5112, 5113. Assign each of them an isolated CPU core:
taskset -cp 8 5111
taskset -cp 9 5112
taskset -cp 10 5113
Check result throughhtop
. Should have cores 8,9,10 at 100% all the time through the 3x VDF creation streams.
In previous CPU generations we disabled P-core hyperthreading and the E-cores themselves. Was no penalty observed on timelord speed. Less complications, more overclocking potential. In this case, if motherboard supports it. Manually clock P-cores lower (GHz), and as high as possible for E-cores.
Now it is a game of getting highest possible GHz, while keeping CPU cool and stable.
It is possible to be fastest timelord, produce VDF for block being created, with only 2x VDF streams (option in SETTINGS). No timelord rewards will be given, blockchain still 100% operational.
Mentioned because some CPUs boosts (GHz) 2x favored cores higher than others, if workload is optimal. Usually these 2x favored cores have higher overclock potential. By adding the third VDF stream for on-chain timelord rewards, 3x high-GHz cores are needed. You might be able to clock 2x cores higher than 3x, no rewards. Choices, choices.
First. Timelord rewards in testnets are not incentivized. Unlike block wins from testnet8, and later. Basically, no timelord rewards from testnets will transfer to mainnet.
On-chain timelord rewards was introduced with testnet10. Now part of blockchain logic. Before that, a temporary centralized solution existed.
How do you know if you are fastest timelord. Ultimate indicator is very easy. There is a wallet address set up as 'TimeLord Reward Address' target in SETTINGS in WebGUI. Timelord rewards will show up as 0.01 MMX
of type VDF_REWARD
. Not necessarily all blocks. Depends on farmer verifying timelord reward, if close to 5sec verify VDF limit.
Another indicator is looking for Broadcasting VDF for height x
messages in NODE / LOG / ROUTER tab in WebGUI. Not given you are the fastest timelord. But you are close to the threshold, and broadcasting VDF.
Overtaking as fastest timelord is usually not instant. Unless current fastest timelord outright stops, or new one is faster by a good margin. Your timelord starts behind because of network and verify VDF latency. Not easy to quantify given internet itself and other nodes. Will not hurt having a fast VDF verify at start (GPU/iGPU). Is where your timelord starts calculating its VDF streams from. If you are faster (MH/s), should get ahead in the end.
To illustrate. With a previous test environment, running +0.2 MH/s over perceived speed of fastest timelord (network VDF speed). It took a few minutes to get first timelord reward, overtaking as fastest timelord. Verify VDF (not a fast GPU) was 2.5sec at the time. Took about 36 blocks (6min) to overtake.
Please contradict findings above, or tell of new ones. Use #mmx-timelord
channel on Discord.