diff --git a/.gitignore b/.gitignore index 3eaa275b75..76683563e7 100644 --- a/.gitignore +++ b/.gitignore @@ -40,3 +40,4 @@ deps-bundle.tar.zst /book/.vitepress/cache /book/.vitepress/dist /book/node_modules +/book/bun.lockb diff --git a/book/.vitepress/config.mts b/book/.vitepress/config.mts index 1fe37d3539..be1b90d2ee 100644 --- a/book/.vitepress/config.mts +++ b/book/.vitepress/config.mts @@ -35,6 +35,10 @@ export default defineConfig({ { text: 'Getting Started', link: 'getting-started' }, { text: 'Configuring', link: 'configuring' }, { text: 'Initializing', link: 'initializing' }, + { text: 'Frequently Asked Questions', link: 'faq' }, + { text: 'Monitoring', link: 'monitoring' }, + { text: 'Troubleshooting', link: 'troubleshooting' }, + { text: 'Tuning', link: 'tuning' }, ] } ] }, diff --git a/book/guide/configuring.md b/book/guide/configuring.md index fd02ec0b26..24f1fc2bf1 100644 --- a/book/guide/configuring.md +++ b/book/guide/configuring.md @@ -50,7 +50,14 @@ for different commands may cause them to fail. ## Logging By default Firedancer will maintain two logs. One permanent log which is written to a file, and an ephemeral log for fast visual inspection which -is written to stderr. +is written to stderr. The Agave runtime and consensus components also +output logs which are a part of the Firedancer's logs. You can increase +the ephemeral log output in the configuration TOML. + +```toml +[log] + level_stderr = "INFO" +``` ## Layout One way that Firedancer is fast is that it pins a dedicated thread to @@ -73,10 +80,11 @@ should be started. ```toml [layout] - affinity = "0-14" - net_tile_count = 4 + affinity = "1-18" + quic_tile_count = 2 verify_tile_count = 4 bank_tile_count = 4 + solana_labs_affinity = "19-31" ``` It is suggested to run as many tiles as possible and tune the tile diff --git a/book/guide/faq.md b/book/guide/faq.md new file mode 100644 index 0000000000..253037ebec --- /dev/null +++ b/book/guide/faq.md @@ -0,0 +1,65 @@ +# Frequently Asked Questions + +::: details What hardware do I need to run Frankendancer? + +The current Frankendancer hardware requirements are the same +as that of an Agave validator. Refer to the [Hardware](./getting-started.md#hardware-requirements) +section in the [Getting Started](./getting-started.md) guide +for more details. + +::: + +::: details How can I obtain the Frankendancer binaries? + +Frankendancer does not currently provide pre-built binaries. +It is recommended to build the binaries on the same host where +you are planning to run the validator. Frankendancer detects +system properties and tries to build a binary tuned for the +particular host. Take a look at the [getting started](./getting-started.md) +guide for requirements and instructions. + +::: + +::: details What branch or tag should I build from? + +You can always checkout the `v0.1` tag, which will point to the +latest release. For more information, refer to the [releases](./getting-started.md#releases) +section. + +::: + +::: details How do I resolve errors encountered while starting up Frankendancer? + +The Frankendancer binary `fdctl` tries to provide helpful error +messages to identify the problem and sometimes even suggests +solutions. Take a look at the [troubleshooting](./troubleshooting.md) +guide for some easy steps that can mitigate some common issues. + +::: + +::: details Can Agave and Frankendancer use the same ledger and snapshots? + +Yes, Frankendancer is fully compatible with both the snapshot +and the ledger formats of the Agave validator. + +::: + +::: details How can I monitor the status of my Frankendancer node? + +You can use most of the regular monitoring tools and commands +that you typically would use with an Agave validator to monitor +Frankendancer as well. Refer to the [monitoring](./monitoring.md) +guide for some helpful commands. + +::: + +::: details Why is my node still delinquent? + +There could be several reasons, some of which include the validator +being unable to catchup and the validator not voting properly among +others. Take a look at the [tuning](./tuning.md) guide for some +tips on how to configure Frankendancer to increase the performance +of the replay stage so the validator catches up faster. Also make +sure that your node is staked and the stake is active. + +::: diff --git a/book/guide/monitoring.md b/book/guide/monitoring.md new file mode 100644 index 0000000000..28cee66f2a --- /dev/null +++ b/book/guide/monitoring.md @@ -0,0 +1,69 @@ +# Monitoring + +The Frankendancer validator can be monitored quite similar to an +Agave validator. + +## Pre-requisite + +Be sure to build the `solana` binary, i.e. specify `solana` as a +target to the `make` command. The binary should be in the same +directory as `fdctl`. If you have not added that directory to the +`PATH` environment variable, replace `solana` with the full path +to the binary in the following commands. + +::: tip NOTE + +Note that this list is not exhaustive. Some commands may not +work without RPC enabled on your validator. Check out the +comments in the `rpc` section of the `default.toml` file to +configure it according to your needs. + +::: + +## Solana Commands + +* Ensure the validator has joined gossip + +```sh [bash] +solana -ut gossip | grep +``` + +* Ensure the validator is caught up + +```sh [bash] +solana -ut catchup --our-localhost +``` + +* Ensure the validator is voting + +```sh [bash] +solana -ut validators | grep +``` + +* Ensure the validator is producing blocks + +```sh [bash] +solana -ut block-production | grep +``` + +::: tip NOTE + +You can also use the `agave-validator --ledger monitor` +command with Frankendancer. For that, you need to build the +`agave-validator` binary from the `agave` repository. + +::: + +## Frankendancer Metrics + +* Look at the prometheus metrics (on the same host) + +```sh [bash] +curl http://localhost:7999/metrics +``` + +* Running the Frankendancer monitor + +```sh [bash] +fdctl monitor --config ~/config.toml +``` diff --git a/book/guide/troubleshooting.md b/book/guide/troubleshooting.md new file mode 100644 index 0000000000..d5db0fd2a9 --- /dev/null +++ b/book/guide/troubleshooting.md @@ -0,0 +1,73 @@ +# Troubleshooting + +This page has a collection of common troubleshooting steps when operators +encounter errors while building and running Frankendancer. If these do +not address the problem, send a message in the `#firedancer-operators` +channel on the Solana Tech Discord or file an issue on GitHub. + +## Building + +### General Recommendations + +* It is always a good idea to retry building everything again from scratch. +Do a fresh clone of the repository, following the instructions in the +[Getting Started](./getting-started.md#prerequisites) guide. Remember to +check if you're using a supported compiler and to run `./deps.sh`! + +* If you're updating an existing repository clone, be sure to update +the solana submodule _after_ pulling the latest changes. For example: + +```sh [bash] +~/firedancer $ git fetch +~/firedancer $ git checkout v0.1 +~/firedancer $ git submodule update +``` + +### Specific Errors + +* Missing `cargo` binary from rust toolchain + +```sh [bash] +error: the 'cargo' binary, normally provided by the 'cargo' component, is not applicable to the '1.75.0-x86_64-unknown-linux-gnu' toolchain ++ exec cargo +1.75.0 build --profile=release-with-debug --lib -p solana-validator +error: the 'cargo' binary, normally provided by the 'cargo' component, is not applicable to the '1.75.0-x86_64-unknown-linux-gnu' toolchain +make: *** [src/app/fdctl/Local.mk:107: cargo-validator] Error 1 +``` + +This typically happens due to a race condition between trying to install the +correct version of the rust toolchain and using it. Separately re-installing +the toolchain fixes it (replace `1.75.0` with the appropriate version): + +```sh [bash] +rustup toolchain uninstall 1.75.0-x86_64-unknown-linux-gnu +rustup toolchain install 1.75.0-x86_64-unknown-linux-gnu +``` + +## Configuring + +### General Recommendations + +* If there are errors during `fdctl configure init all --config +~/config.toml`, consider running `fdctl configure fini all --config +~/config.toml` to remove all existing configuration and try the `init` +command again. You can also re-run a specific configure stage, for +example, `fdctl configure init workspace --config ~/config.toml`. + +* Make sure the `config.toml` specified during this command is the +same as the one specified with the `run` command. Also make sure +that the content is valid TOML. + +* Read the output of the command carefully, `fdctl` often prints out +a helpful message that contains suggestions on how to resolve some +errors. Be sure to try them out! + +## Running + +### General Recommendations + +* Always run `fdctl configure init all --config ~/config.toml` before +running the `fdctl run --config ~/config.toml`. If using a systemd unit, +specify both of the commands together for starting Frankendancer. + +* Make sure the `~/config.toml` being used is the same in the `configure` +and `run` commands. diff --git a/book/guide/tuning.md b/book/guide/tuning.md new file mode 100644 index 0000000000..70d6847859 --- /dev/null +++ b/book/guide/tuning.md @@ -0,0 +1,81 @@ +# Tuning + +## Tiles + +To stay caught up with the cluster, the replay stage needs enough +cores and processing power. If you see your validator falling +behind with the default configuration, consider trying out the +following: + +### Increase Shred Tiles + +Example Original Config: + +```toml +[layout] + affinity = "1-18" + quic_tile_count = 2 + verify_tile_count = 4 + bank_tile_count = 4 + solana_labs_affinity = "19-31" +``` + +Example New Config: + +```toml +[layout] + affinity = "1-18" + quic_tile_count = 2 + verify_tile_count = 5 + bank_tile_count = 2 + shred_tile_count = 2 + solana_labs_affinity = "19-31" +``` + +This takes a core from the `bank` tile (transaction execution) and +gives it to another `shred` tile (turbine and shred processing). It +takes another core from another `bank` tile and gives it to a `verify` +(signature verification) tile. + +### Increase Cores for Solana Labs + +Example Original Config: + +```toml +[layout] + affinity = "1-18" + quic_tile_count = 2 + verify_tile_count = 5 + bank_tile_count = 2 + shred_tile_count = 2 + solana_labs_affinity = "19-31" +``` + +Example New Config: + +```toml +[layout] + affinity = "1-16" + quic_tile_count = 1 + verify_tile_count = 4 + bank_tile_count = 2 + shred_tile_count = 2 + solana_labs_affinity = "17-31" +``` + +This takes 1 core from the `quic` tile and another from the `verify` +tile gives them both to the solana labs threads (where the replay stage +runs). + +## QUIC + +There is a lot of QUIC traffic in the cluster. If the validator is +having a hard time establishing QUIC connections, it might end up +getting less transactions. Some parameters that can be tuned to address +this are (these 2 parameters need to be the same value): + +```toml +[tiles.quic] + max_concurrent_connections = 2048 + max_concurrent_handshakes = 2048 +```