Skip to content

Optimizations for TimeLord

voidxno edited this page Dec 11, 2024 · 17 revisions

Running a TimeLord is optional (default is disabled). Not needed to run a fully functional node.

Blockchain, as a whole, need at least one active timelord to move forward. A few more, spread around, is preferred for redundancy and security.

If you want to contribute by running one, check requirements below. Enable timelord in WebGUI, or set true in config/local/timelord file. Check that running, and speed, in NODE / LOG / TIMELORD tab in WebGUI. Probably lower than NODE / VDF Speed, unless you are the fastest timelord.

No more is needed. Standard Linux compile and Windows binaries gives good performance for a timelord. Unless you want to optimize for fastest timelord, or the fun of it.

TLDR;

I want to run a fast timelord:

  • Use Linux or Windows
  • Compile with AVX baseline
  • Have a GPU/iGPU verify VDF
  • Clock CPU as high as possible

NOTE: Optimize and overclock at your own risk.

Logic

A timelord performs a very simple mathematical operation (SHA256, Secure Hash Algorithm). It is performed recursively and cannot be parallelized. Previous result is needed, before repeating, as fast as possible. Like a very specific mathematical single-threaded benchmark.

Timelord runs 3x of these operations, VDF (verifiable delay function) streams, in parallel. But their individual workload cannot be parallelized more.

A third VDF stream was introduced for on-chain timelord rewards with testnet10. Can be turned off, leaving 2x VDF streams. Still possible to be fastest timelord, but no timelord rewards will be given.

Only fastest timelord, at any time, produces VDF for block being created. And can receive a timelord reward.

Requirements

CPU: Intel or AMD (w/ SHA extensions).
Model: Intel 11th-gen (Rocket Lake), AMD Zen, or later (a few exceptions).
GPU/iGPU: Any compatible (offload verify VDF).

Linux: grep -o 'sha_ni' /proc/cpuinfo, empty if not available.
Windows: CPU-Z (Instructions) or HWiNFO64 (Features), look for SHA.

You can run timelord on a CPU without SHA extensions. Will fallback to AVX2. In reality SHA extensions are needed. SHA256 calculations are ~5-10x faster with SHA vs AVX2.

Why not GPU/FPGA/ASIC

Timelord logic will only use CPU, not GPU.

GPU is great for parallel SHA256 calculations, beating CPU in both speed and efficiency. GPU is used for verify VDF operation on a node, if available.

For a single SHA256 calculation, CPU's SHA extensions will beat GPU on speed. As timelord SHA256 workload is not parallelizable, CPU wins the serial SHA256 race.

Feedback welcome on other contenders. As of now, nothing observed beating a high-GHz CPU with SHA extensions (optimized silicon circuits inside CPU). Too low speed (GHz) on FPGA, work not parallelizable. Prohibitive cost to produce a high-GHz ASIC that beats Intel/AMD optimized silicon.

Optimize TimeLord

Standard Linux compile and Windows binaries gives good performance for a timelord. Still important to tune surrounding environment. Either aiming for fastest timelord rewards, or just the challenge.

Information and numbers in this article might be superseded. Sections below are known info at date of publish. To help get started or give ideas. Not the absolute answer. Probably angles not discovered yet, and other ways to go about it.

You do not need to complicate it like below. Try to run a timelord. Measure speed. Try out a tip. Measure if faster.

Discuss and share in #mmx-timelord channel on Discord. No requirement to divulge all your secrets. But a good place to get tips, or kickstart new ideas.

NOTE: Optimize and overclock at your own risk.

Test Environment

CPU: Intel Ultra 200S series (15th-gen, Arrow Lake)
GPU: Nvidia RTX 3050 8GB (GA106)
OS: Ubuntu 24.10 (Oracular Oriole)
OS: Windows 10 (22H2)
mmx-node: v0.12.8 (+ AVX)

Most instructions below are transferable to AMD.

Measuring Speed

Timelord speed is measured in MH/s (million hashes per second).

Current blockchain network speed, fastest timelord, is found in NODE / VDF Speed in WebGUI. Your own timelord speed is found under NODE / LOG / TIMELORD tab in WebGUI.

To make it easier to measure own improvements, baseline numbers, compare with others. It is recommended to measure MH/s speed per 0.1 GHz (MH/s/0.1GHz). Yes, absolute speed is the end goal. But timelord speed, at least observed for now, is linear given CPU GHz. A 2-step process is recommended:

  1. Optimize for best possible MH/s/0.1GHz
  2. Clock 3x CPU cores as high as possible

In this case the Intel 15th-gen has had E-cores locked to 3.1 GHz (could be lower/higher, not important). We already know from TimeLord Predictions that E-cores is the best option. Not always been so. Previous generations had P-cores as best, with hyperthreading and E-cores disabled (more on that later). Goal is to make optimization measurements easier and controlled.

Intel 13th-gen (v0.12.8 + AVX):

Environment Measured Locked Speed Measured Per Unit
Ubuntu/gcc14 40.77 MH/s /31 (3.1 GHz) 1.315 MH/s/0.1GHz
Ubuntu/Clang19 39.94 MH/s /31 (3.1 GHz) 1.288 MH/s/0.1GHz
Windows/VC++ 38.44 MH/s /31 (3.1 GHz) 1.240 MH/s/0.1GHz

These numbers represent absolute speed per 0.1 GHz, given the environment and tuning. Easy to compare against yourself or others. Top speed after that is dependent on how high you can clock 3x CPU cores (more on that later). In this case, 3x E-cores stable at 4.6 GHz, would give a timelord speed of ~60.5 MH/s.

As a sidenote. P-core numbers for 12/13/14th-gen Intel gives exact same performance per 0.1 GHz. Basically, no IPC (instructions per clock) uplift for SHA extensions between them (for this specific use-case). But 14th-gen have potential to clock highest.

Testing of AMD Zen4-core (7000-series), and Zen5-core (9000-series) gives 0.755 and 0.715 MH/s/0.1GHz. Yes, a degradation in efficiency. Probably an architectural decision.

Intel's 15th-gen E-cores efficiency increase was a surprise (1.320, vs 0.975 before). Making it the best known choice for now. Previous generations had Intel/AMD much closer (for this specific use-case). In the end, an overclocking race.

Fastest timelord speeds observed on testnets (as of Dec2024):

Continuous Peak
~66.7 MH/s ~67.0 MH/s

Linux or Windows

Numbers above, in previous releases and CPU generations, have switched between Linux or Windows binary being fastest. With current source code, Linux (gcc14/Clang19) looks to have the edge.

Instructions shown in sections below are done on Linux. But most aspects are applicable to Windows too.

Linux distribution and kernel often have an effect on different types of workloads. When it comes to timelord logic, not much observed. Logic for creation of a VDF stream is very small. A few instructions repeated in a CPU core.

Optimize: Establish defaults

Follow default mmx-node installation for Linux (in this case Ubuntu, with default compiler gcc14). Get mmx-node up and running. Enable timelord in WebGUI, or set true in config/local/timelord file.

Let it run for a while. Check average speed in NODE / LOG / TIMELORD tab in WebGUI. With this setup:

Environment Measured Locked Speed Measured Per Unit
Ubuntu/gcc14 40.40 MH/s /31 (3.1 GHz) 1.303 MH/s/0.1GHz

Optimize: Compiler

Compiler has an effect on how good source code is translated to binary objects. Default for Ubuntu 24.10 is GCC (GNU Compiler Collection), or gcc14. An alternative is Clang (LLVM), or Clang19. There are others. Has varied if Clang or gcc do a better job.

One way to install, enable and compile with Clang19:

sudo apt install clang lld libomp-dev
export CC=/usr/bin/clang-19
export CPP=/usr/bin/clang-cpp-19
export CXX=/usr/bin/clang++-19
export LD=/usr/bin/ld.lld-19
./clean_all.sh
./make_devel.sh

NOTE: You need to perform export statements in terminal environment before compile, or gcc14 will be used.
NOTE: When you switch compiler, or compiler options. Always do ./clean_all.sh before new compile.
NOTE: You will get a lot of unused -fmax-errors=1 warnings. Just ignore, or remove from compiler options.

Default Clang19 compile (./make_devel.sh):

Environment Measured Locked Speed Measured Per Unit
Ubuntu/Clang19 40.25 MH/s /31 (3.1 GHz) 1.298 MH/s/0.1GHz

Small, but noticeable degrade from gcc14's 1.303 MH/s/0.1GHz. We'll switch back to gcc14 going forward.

Optimize: Compiler options

Compiler options can have a big effect on how source code is transformed to a binary object. Often focus is on speed vs size. Several options have an effect on timelord logic. Much have been tried with both gcc and Clang.

For now, gcc14 with default options in ./make_devel.sh gives best performance.

Some elements to experiment with (./make_devel.sh):

  • Switch between Release and RelWithDebInfo (-DCMAKE_BUILD_TYPE)
  • Remove -fno-omit-frame-pointer (-DCMAKE_CXX_FLAGS)
  • Add -march=native (-DCMAKE_CXX_FLAGS)
  • Variants of -O optimization option (-DCMAKE_CXX_FLAGS)

There are others. Look up optimization in relevant compiler documentation.

NOTE: When you switch compiler, or compiler options. Always do ./clean_all.sh before new compile.

Optimize: Source code

One thing to be aware of is that we want to optimize a tiny part of whole mmx-node. Even a tiny subset of whole timelord logic. The calculation of a VDF stream. Performed through hash_t TimeLord::compute(...) (/src/TimeLord.cpp) calling recursive_sha256_ni(...) (/src/sha256_ni_rec.cpp). We do not care about the rest. As long as this part goes as fast as possible. Unless surrounding elements has an effect. Not observed for now.

That part of source code is already written to be fast when translated to binary objects by compiler (inline, intrinsics, asm).

Several iterations were made for it to end up like that. Still, this is the place to adjust source code if you think there is a way to optimize it even more.

Optimize: Source code (AVX vs SSE4.2)

Current default compile combines the usage of SHA extensions and SSE4.2 instructions. Raising the SSE4.2 baseline to AVX instructions gives about ~1% boost on 15th-gen Intel. Has to do with compiler using identical AVX versions of certain SSE4.2 instructions. Though, on 11th-gen Intel this looks to degrade performance (better with SSE4.2).

To implement AVX vs SSE4.2 baseline (./CMakeLists.txt):

Change -msse4.2 to -mavx on two lines (Linux compile part):

set_source_files_properties(src/sha256_ni.cpp PROPERTIES COMPILE_FLAGS "-mavx -msha")
set_source_files_properties(src/sha256_ni_rec.cpp PROPERTIES COMPILE_FLAGS "-mavx -msha")

Add two lines (Windows compile part):

set_source_files_properties(src/sha256_ni.cpp PROPERTIES COMPILE_FLAGS "/arch:AVX")
set_source_files_properties(src/sha256_ni_rec.cpp PROPERTIES COMPILE_FLAGS "/arch:AVX")

Easier to see location in closed PR#210 request.

Small, but real jump from gcc14's 1.303 MH/s/0.1GHz.

Environment Measured Locked Speed Measured Per Unit
Ubuntu/gcc14 40.77 MH/s /31 (3.1 GHz) 1.315 MH/s/0.1GHz

NOTE: For now, official releases have SSE4.2 as baseline.

Optimize: CPU speed

At this stage we know what to expect for each 0.1 GHz, 1.315 MH/s. All testing we have observed have given linear increase, given CPU GHz. Now it is time to clock CPU as high as possible.

NOTE: Optimize and overclock at your own risk.

First a boring observation. Many elements surrounding raw GHz of CPU cores have been tested:

  • RAM type/speed/latency/bandwidth
  • HyperThreading on/off
  • Virtualization (VT-d)
  • Mitigations (Spectre/Meltdown)
  • CPU cache/ring ratio
  • CPU L1/L2/L3 cache size
  • CPU core-to-core latency

Nothing looks to affect timelord speed, except CPU core clock (GHz). Remember, timelord logic for creation of VDF streams is very small. Not much outside a few instructions repeated in a CPU core.

Timelord logic has 3x process threads. Each wants 100% of 1x CPU core, to calculate a VDF stream. Goal is to create an environment that makes these 3x process threads run with high GHz continuously.

One way, and valid strategy, is to let the OS process scheduler do its job (Linux or Windows). Distribute and use resources as best possible, depending on requirements and state of system. Maybe tune some aspects of OS, together with BIOS adjustments to clock CPU as high as possible. Gives great results. All modern CPUs have logic to boost individual CPU cores in combination with OS scheduler, power management and other logic.

Another, more manual way, is to dedicate specific CPU cores to the 3x timelord process threads. Locking OS and other processes away from them. In this case an Intel CPU with 8x P-cores, numbered 0-7. With E-cores following after that. Going to dedicate E-cores 8,9,10 to timelord process threads. One way to achieve it (Linux, in this case Ubuntu):

  • Force OS process scheduler to not use core 8,9,10. Add isolcpus=8,9,10 to GRUB_CMDLINE_LINUX (/etc/default/grub). Easily observed through htop and CPU core utilization.
  • When timelord up and running, you should have 5x process threads with command name of 'TimeLord':
    ps -A -T -o tid,comm,pcpu | grep 'TimeLord'
    In practice, the three last are the 100% CPU creating VDF stream process threads. Can also find them with htop. Let's say they have pid(tid) 5111, 5112, 5113. Assign each of them an isolated CPU core:
    taskset -cp 8 5111
    taskset -cp 9 5112
    taskset -cp 10 5113
    Check result through htop. Should have cores 8,9,10 at 100% all the time through the 3x VDF creation streams.

In previous CPU generations we disabled P-core hyperthreading and the E-cores themselves. Was no penalty observed on timelord speed. Less complications, more overclocking potential. In this case, if motherboard supports it. Manually clock P-cores lower (GHz), and as high as possible for E-cores.

Now it is a game of getting highest possible GHz, while keeping CPU cool and stable.

It is possible to be fastest timelord, produce VDF for block being created, with only 2x VDF streams (option in SETTINGS). No timelord rewards will be given, blockchain still 100% operational.

Mentioned because some CPUs boosts (GHz) 2x favored cores higher than others, if workload is optimal. Usually these 2x favored cores have higher overclock potential. By adding the third VDF stream for on-chain timelord rewards, 3x high-GHz cores are needed. You might be able to clock 2x cores higher than 3x, no rewards. Choices, choices.

Fastest TimeLord

First. Timelord rewards in testnets are not incentivized. Unlike block wins from testnet8, and later. Basically, no timelord rewards from testnets will transfer to mainnet.

On-chain timelord rewards was introduced with testnet10. Now part of blockchain logic. Before that, a temporary centralized solution existed.

How do you know if you are fastest timelord. Ultimate indicator is very easy. There is a wallet address set up as 'TimeLord Reward Address' target in SETTINGS in WebGUI. Timelord rewards will show up as 0.01 MMX of type VDF_REWARD. Not necessarily all blocks. Depends on farmer verifying timelord reward, if close to 5sec verify VDF limit.

Another indicator is looking for Broadcasting VDF for height x messages in NODE / LOG / ROUTER tab in WebGUI. Not given you are the fastest timelord. But you are close to the threshold, and broadcasting VDF.

Overtaking as fastest timelord is usually not instant. Unless current fastest timelord outright stops, or new one is faster by a good margin. Your timelord starts behind because of network and verify VDF latency. Not easy to quantify given internet itself and other nodes. Will not hurt having a fast VDF verify at start (GPU/iGPU). Is where your timelord starts calculating its VDF streams from. If you are faster (MH/s), should get ahead in the end.

To illustrate. With a previous test environment, running +0.2 MH/s over perceived speed of fastest timelord (network VDF speed). It took a few minutes to get first timelord reward, overtaking as fastest timelord. Verify VDF (not a fast GPU) was 2.5sec at the time. Took about 36 blocks (6min) to overtake.

Feedback

Please contradict findings above, or tell of new ones. Use #mmx-timelord channel on Discord.