Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

supervisor: lightweight process manager for the storagenode #27

Draft
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

profclems
Copy link
Member

@profclems profclems commented Oct 10, 2024

Creating a lightweight process manager for the storagenode in docker, which autoupdates and restarts the storagenode process.

This supervisor becomes the main process, and forks the storagenode as a subprocess.

With this, we don't need the "python supervisor" and its verbose configuration, download both the storagenode and storagenode updater binaries and run them like we do currently.

But this custom supervisor is also a storagenode updater by itself.

TODO: provide steps on how to test with storj-up

NOTE: this is not an attempt to write something like supervisor. See #27 (comment)

@profclems profclems marked this pull request as draft October 10, 2024 18:12
@profclems profclems changed the title supervisor supervisor: lightweight process manager for the storagenode Oct 10, 2024
@profclems profclems force-pushed the supervisor branch 4 times, most recently from a93547b to 0f6b8e4 Compare October 12, 2024 15:46
@elek
Copy link
Contributor

elek commented Oct 14, 2024

Do we really need a new type of supervisor? Can we write sg. better then the good old supervisord?

Why don't we use a bash loop (with tiny) + exit from the main process when we downloaded the new binary.

I am worried, because looks like implementing a supervisor is not so easy, especially with all the corner cases of signal processing / handling. Tiny might be better in this game.

But didn't check the code (yet).

Also: do we have code linting in this repository? Do we plan to use this for systemd based setup? Wondering if storj/storj would be a better place for the code...

@profclems
Copy link
Member Author

profclems commented Oct 14, 2024

Do we really need a new type of supervisor? Can we write sg. better then the good old supervisord?

This isn't really an attempt to write something like the supervisord. I just named it supervisor but it's not it.
The functionality is completely different and it's just purposely to support auto-updating and restarting just the storagenode. I want to maintain the auto-updating feature.

Here is how it works:

  • We will build the docker images with this supervisor as the only binary.

  • At the first run, it checks if the binary store contains the storagenode binary and copies it to /app/bin or the specified binary location within the container. If it does not exist, it downloads the correct minimum version or suggested version (if node ID is a rollout candidate)

    • This makes sure the node doesn't downgrade if it was an early upgrader in the rollout process
  • Then it starts an updater loop (goroutine) that checks for new updates for the storagenode and updates the binary when necessary

  • It then execs the storagenode binary and restarts it if it fails.

  • When the binary is updated, the supervisor kills the node and restarts it

The shutdown process: Kill the node process -> stop the updater loop -> exit the supervisor (main process)

Also: do we have code linting in this repository? Do we plan to use this for systemd based setup? Wondering if storj/storj would be a better place for the code...

I added it to this repo because it's just specifically for the storagenode-docker for now. And I don't want to move it to storj/storj because we don't want to follow the release cycle there.

@profclems
Copy link
Member Author

I just want the review to be based on the idea, then we can see if we need to move the code out or add some linters here

@AlexeyALeonov
Copy link
Contributor

AlexeyALeonov commented Oct 15, 2024

ERROR: docker.io/arm32v5/golang:1.22: not found

sad

they now supports only linux/arm/v7..

Dockerfile Outdated Show resolved Hide resolved
Copy link
Contributor

@AlexeyALeonov AlexeyALeonov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docker run -it --rm -v storagenode:/app/storagenode -v identity:/app/identity storagenodedocker:latest
...
Error: Failed to check version: software outdated: outdated software version (v1.112.2), please update
        storj.io/storj/private/version/checker.(*Service).CheckVersion:80
        main.cmdRun:91
        main.newRunCmd.func1:33
        storj.io/common/process.cleanup.func1.4:392
        storj.io/common/process.cleanup.func1:410
        github.com/spf13/cobra.(*Command).execute:983
        github.com/spf13/cobra.(*Command).ExecuteC:1115
        github.com/spf13/cobra.(*Command).Execute:1039
        storj.io/common/process.ExecWithCustomOptions:112
        main.main:34
        runtime.main:271
2024/10/15 04:24:27 ERROR Process exited with error service=supervisor error="process: exit status 1\n\tstorj.io/storagenode-docker/supervisor.(*Process).wait:61\n\tstorj.io/storagenode-docker/supervisor.(*Supervisor).runProcess:98\n\tstorj.io/storagenode-docker/supervisor.(*Supervisor).Run.func2:77\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"
2024/10/15 04:24:27 INFO Too many retries, exiting supervisor service=supervisor
2024/10/15 04:24:36 INFO Download finished. service=supervisor From=https://github.com/storj/storj/releases/download/v1.113.2/storagenode_linux_amd64.zip To=/tmp/storagenode_linux_amd64.2953723089.zip
2024/10/15 04:24:36 INFO Unpacking archive. service=supervisor From=/tmp/storagenode_linux_amd64.2953723089.zip To=/app/bin/storagenode.1.113.2
2024/10/15 04:24:37 INFO Downloaded version. service=supervisor Version=v1.113.2
2024/10/15 04:24:37 INFO Copying binary to store. service=supervisor From=/app/bin/storagenode.1.113.2 To=/app/config/bin/storagenode.1.113.2
2024/10/15 04:24:38 INFO Binary copied to store. service=supervisor From=/app/bin/storagenode.1.113.2 To=/app/config/bin/storagenode.1.113.2

And in the loop. It downloads a new binary to the storage location, but seems didn't rename it to actually update.

@elek
Copy link
Contributor

elek commented Oct 15, 2024

Thanks to explain the details @profclems

I understand that the goal is to keep this docker image empty as before, and I like this goal. And it requires an initial download.

Still I have the two questions:

  1. order of processes.

Originally I was thinking about the same. Updater (or here the supervisor) can execute a sub-process, and restart when needed. But now I am wondering if need some more simple hierarchy, as we had the pain with shutting down storagenodes in docker.

Instead of supervisor --> storagenode

We can also use something like this as an entrypoint:

#!/usr/bin/bash
storagenode-updater --initial-download storagenode

while true; do
   storagenode run
   if $? !=123; then
      exit $?
   fi
done

123 supposed to be the exit code of storagenode when update is required, and binary is downloaded by the version chore.

I feel this one somewhat more safe, as we don't need to forward signals (bash already does it).

  1. Independent question: this supervisor seems to be very similar to storagenode-updater. I think we should maintain only one tool, and probably put all required functionality...

But I understand that you would like to include a binary here, probably for long term, so I am fine with moving storagenode-updater to here, just seems to be a duplicated effort.

@ifraixedes
Copy link
Member

I am fine with moving storagenode-updater to here, just seems to be a duplicated effort.

Is storagenode-upater only used for storagenonode docker?

@profclems profclems force-pushed the supervisor branch 2 times, most recently from ccfaf6f to e434d6a Compare October 24, 2024 03:35
@profclems
Copy link
Member Author

ERROR: docker.io/arm32v5/golang:1.22: not found

sad

they now supports only linux/arm/v7..

Yes @AlexeyALeonov. So I've switched to arm32v7 for the binary builder and maintained arm32v6 for the main image.

This updates and manages the storagenode process

Change-Id: If8704e91028d7fc86d947244e6e8a21d33c4ab7a
Change-Id: Ied701ed18df5c7d764879e391a435c2a7ddd9874
Change-Id: Ieea3ee9c164abf4d6d97bfa49a2c5b6eb1e06b6a
Change-Id: I958224e2cbdf736f9aac81d6f07e1a846130534e
@AlexeyALeonov
Copy link
Contributor

I am fine with moving storagenode-updater to here, just seems to be a duplicated effort.

Is storagenode-upater only used for storagenonode docker?

No, it's also used in the services setups like Windows/Linux GUI

@AlexeyALeonov
Copy link
Contributor

ERROR: docker.io/arm32v5/golang:1.22: not found

Yes @AlexeyALeonov. So I've switched to arm32v7 for the binary builder and maintained arm32v6 for the main image.

They should be ARM32/v5 I suppose, there is no v6

Copy link
Contributor

@AlexeyALeonov AlexeyALeonov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's still doesn't work properly with versions:

docker build -t storagenode:local . --no-cache
$ docker run -it --rm -v identity:/app/identity -v storagenode:/app/storagenode -e WALLET=0x0 storagenode:local

Running /app/bin/supervisor exec /app/bin/storagenode run  --config-dir config --identity-dir identity --version.server-address=https://version.storj.io --storage.allocated-disk-space=2.0TB --operator.wallet=0x0000000000000000000000 
2024/10/27 04:36:22 {15m0s 1m0s /app/bin/storagenode /app/config/bin https://version.storj.io 1111111111111111111111111111111112m1s9K /app/identity false false}
2024/10/27 04:36:22 Binary does not exist, downloading new binary
2024/10/27 04:36:22 INFO Download started. service=supervisor From=https://github.com/storj/storj/releases/download/v1.113.2/storagenode_linux_amd64.zip To=/tmp/storagenode_linux_amd64.3611398710.zip
2024/10/27 04:36:23 INFO Download finished. service=supervisor From=https://github.com/storj/storj/releases/download/v1.113.2/storagenode_linux_amd64.zip To=/tmp/storagenode_linux_amd64.3611398710.zip
2024/10/27 04:36:23 INFO Unpacking archive. service=supervisor From=/tmp/storagenode_linux_amd64.3611398710.zip To=/app/bin/storagenode.1.113.2
2024/10/27 04:36:25 INFO Downloaded version. service=supervisor Version=v1.113.2
2024/10/27 04:36:25 INFO Copying binary to store. service=supervisor From=/app/bin/storagenode.1.113.2 To=/app/config/bin/storagenode.1.113.2
2024/10/27 04:36:25 INFO Binary copied to store. service=supervisor From=/app/bin/storagenode.1.113.2 To=/app/config/bin/storagenode.1.113.2
2024/10/27 04:36:25 WARN Error removing backup binary service=supervisor !BADKEY="remove /app/bin/storagenode.old.v0.0.0: no such file or directory"
2024/10/27 04:36:25 INFO Starting process service=supervisor binary=/app/bin/storagenode
2024-10-27T04:36:25Z	INFO	Anonymized tracing enabled	{"Process": "storagenode"}
2024-10-27T04:36:25Z	WARN	Operator email address isn't specified.	{"Process": "storagenode"}

And the second run:

Running /app/bin/supervisor exec /app/bin/storagenode run  --config-dir config --identity-dir identity --version.server-address=https://version.storj.io --storage.allocated-disk-space=2.0TB --operator.wallet=0x0000000000000000000000 
2024/10/27 04:46:01 {15m0s 1m0s /app/bin/storagenode /app/config/bin https://version.storj.io 1111111111111111111111111111111112m1s9K /app/identity false false}
2024/10/27 04:46:01 Binary does not exist, downloading new binary
2024/10/27 04:46:02 INFO Download started. service=supervisor From=https://github.com/storj/storj/releases/download/v1.113.2/storagenode_linux_amd64.zip To=/tmp/storagenode_linux_amd64.2332384136.zip
2024/10/27 04:46:02 INFO Download finished. service=supervisor From=https://github.com/storj/storj/releases/download/v1.113.2/storagenode_linux_amd64.zip To=/tmp/storagenode_linux_amd64.2332384136.zip
2024/10/27 04:46:02 INFO Unpacking archive. service=supervisor From=/tmp/storagenode_linux_amd64.2332384136.zip To=/app/bin/storagenode.1.113.2
2024/10/27 04:46:04 INFO Downloaded version. service=supervisor Version=v1.113.2
2024/10/27 04:46:04 INFO Copying binary to store. service=supervisor From=/app/bin/storagenode.1.113.2 To=/app/config/bin/storagenode.1.113.2
2024/10/27 04:46:04 INFO Binary copied to store. service=supervisor From=/app/bin/storagenode.1.113.2 To=/app/config/bin/storagenode.1.113.2
2024/10/27 04:46:04 WARN Error removing backup binary service=supervisor !BADKEY="remove /app/bin/storagenode.old.v0.0.0: no such file or directory"
2024/10/27 04:46:04 INFO Starting process service=supervisor binary=/app/bin/storagenode
2024-10-27T04:46:04Z	INFO	Anonymized tracing enabled	{"Process": "storagenode"}
2024-10-27T04:46:04Z	WARN	Operator email address isn't specified.	{"Process": "storagenode"}

the content of /app/config/bin:

$ docker run -it --rm -v storagenode:/app/config ubuntu ls -l /app/config/bin
total 154748
-rwxr-xr-x 1 root root 78855672 Oct 27 04:46 storagenode.1.113.2
-rwxr-xr-x 1 root root 79600112 Oct 27 04:46 storagenode.1.114.6

Copy link
Contributor

@AlexeyALeonov AlexeyALeonov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried with docker compose, even more interesting.

  1. Start everything
docker compose up -d
  1. Make sure, that all working and node updated to 1.114.6
  2. Remove the storagenode container:
docker compose down storagenode
  1. Re-create storagenode container
docker compose up -d

It firstly downloaded a minimum version, then started the node. It's failed because of the database schema mismatch:

docker compose logs storagenode | grep -c "Error migrating"
3

Because it has a minimal version.

And after that it would update to the correct version.

Change-Id: I2810f1900b4bef3caa97f07494d23cc340333545
@profclems
Copy link
Member Author

Thanks @AlexeyALeonov for all the tests and feedback. So I made a mistake: I was copying the new binary with the version number attached before replacing the actual binary. This has been fixed

@profclems
Copy link
Member Author

It firstly downloaded a minimum version, then started the node. It's failed because of the database schema mismatch:

This is an interesting one. I've noticed that from my tests too. Are you using the compose file in this PR? @AlexeyALeonov

…not exist

This should help prevent any database version mismatch issues

Change-Id: I173fa79f7a19d79f6388821892e5163ded90f1bb
@profclems
Copy link
Member Author

It firstly downloaded a minimum version, then started the node. It's failed because of the database schema mismatch:

This is an interesting one. I've noticed that from my tests too. Are you using the compose file in this PR? @AlexeyALeonov

Okay I just found the issues

  1. this was related to supervisor: lightweight process manager for the storagenode #27 (comment). It should be fixed now by 525c637
  2. If the binary does not exist, the updater was downloading the minimum before checking for rollout candidacy to download the appropriate version. This should be fixed too by 5047b57

Change-Id: I59da78cfb9e106758b6ba9c0f9ba1b6b4335c396
Change-Id: I83a2dbc7ab42ef8036bccab74fb1e218d92647ce
Copy link
Contributor

@AlexeyALeonov AlexeyALeonov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this new version.

  1. Replaced the storagenode binary with 1.113.2 in the storage location
  2. Started the stack
$ docker compose up -d
  1. Got a lot of errors "Error: Failed to check version: software outdated: outdated software version (v1.113.2), please update" (132):
$ docker compose logs storagenode | grep -c "Failed to check version"
132
  1. But finally it's updated.

Change-Id: I4091f660fd51205505a09bd59b8037d8a1a86e7d
@profclems
Copy link
Member Author

Great!

Got a lot of errors "Error: Failed to check version: software outdated: outdated software version (v1.113.2), please update" (132):

These errors are directly from the storagenode because it also checks for updates. I've set the updater in this supervisor to start 30 seconds after the node starts.

Also, I just pushed another change to make sure we still check if the node qualifies for the current rollout and download that.

Copy link
Contributor

@AlexeyALeonov AlexeyALeonov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now it do update to a minimum version, if it detects the old one in the storage location.
But it downloads a minimum version every restart, even if it's already in the storage location.

Change-Id: I570994e3198e24138db13618d8417c36c00b37af
Change-Id: I874e47ceb75a11199df2d057585efdad46667be0
Copy link
Contributor

@AlexeyALeonov AlexeyALeonov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the current version with the old binary in the storage location trying to start the old binary before the check of a new version

storagenode-1  | Running /app/bin/supervisor exec /app/bin/storagenode run  --config-dir config --identity-dir identity --version.server-address=https://version.storj.io --storage.allocated-disk-space=2.0TB 
storagenode-1  | 2024/11/08 03:35:57 INFO Copying backup binary service=supervisor backup=/app/config/bin/storagenode destination=/app/bin/storagenode
storagenode-1  | 2024/11/08 03:35:57 INFO Checking version service=supervisor
storagenode-1  | 2024/11/08 03:35:57 INFO Current binary version service=supervisor version=v1.113.2
storagenode-1  | 2024/11/08 03:35:57 INFO Starting process service=supervisor binary=/app/bin/storagenode
storagenode-1  | 2024-11-08T03:35:58Z   INFO    Configuration loaded    {"Process": "storagenode", "Location": "/app/config/config.yaml"}
storagenode-1  | 2024-11-08T03:35:58Z   INFO    Anonymized tracing enabled      {"Process": "storagenode"}
storagenode-1  | 2024-11-08T03:35:58Z   DEBUG   tracing collector       started {"Process": "storagenode"}
storagenode-1  | 2024-11-08T03:35:58Z   DEBUG   debug server listening on 127.0.0.1:37471       {"Process": "storagenode"}
storagenode-1  | 2024-11-08T03:35:58Z   INFO    Operator email  {"Process": "storagenode", "Address": "[email protected]"}
storagenode-1  | 2024-11-08T03:35:58Z   INFO    Operator wallet {"Process": "storagenode", "Address": "0x0123456789012345678901234567890123456789"}
storagenode-1  | 2024-11-08T03:35:58Z   DEBUG   Version info    {"Process": "storagenode", "Version": "1.113.2", "Commit Hash": "a08a0fa4c7ea14154fe32eb002960be4396d9963", "Build Timestamp": "2024-09-18 09:03:49 +0000 UTC", "Release Build": true}
storagenode-1  | 2024-11-08T03:35:58Z   INFO    server  kernel support for server-side tcp fast open remains disabled.  {"Process": "storagenode"}
storagenode-1  | 2024-11-08T03:35:58Z   INFO    server  enable with: sysctl -w net.ipv4.tcp_fastopen=3  {"Process": "storagenode"}
storagenode-1  | 2024-11-08T03:35:59Z   DEBUG   version Allowed minimum version from control server.    {"Process": "storagenode", "Minimum Version": "1.114.6"}
storagenode-1  | 2024-11-08T03:35:59Z   WARN    version version not allowed/outdated    {"Process": "storagenode", "current version": "1.113.2", "minimum allowed version": "v1.114.6"}
storagenode-1  | 2024-11-08T03:35:59Z   ERROR   failure during run      {"Process": "storagenode", "error": "Failed to check version: software outdated: outdated software version (v1.113.2), please update\n\tstorj.io/storj/private/version/checker.(*Service).CheckVersion:80\n\tmain.cmdRun:91\n\tmain.newRunCmd.func1:33\n\tstorj.io/common/process.cleanup.func1.4:392\n\tstorj.io/common/process.cleanup.func1:410\n\tgithub.com/spf13/cobra.(*Command).execute:983\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:1115\n\tgithub.com/spf13/cobra.(*Command).Execute:1039\n\tstorj.io/common/process.ExecWithCustomOptions:112\n\tmain.main:34\n\truntime.main:271", "errorVerbose": "Failed to check version: software outdated: outdated software version (v1.113.2), please update\n\tstorj.io/storj/private/version/checker.(*Service).CheckVersion:80\n\tmain.cmdRun:91\n\tmain.newRunCmd.func1:33\n\tstorj.io/common/process.cleanup.func1.4:392\n\tstorj.io/common/process.cleanup.func1:410\n\tgithub.com/spf13/cobra.(*Command).execute:983\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:1115\n\tgithub.com/spf13/cobra.(*Command).Execute:1039\n\tstorj.io/common/process.ExecWithCustomOptions:112\n\tmain.main:34\n\truntime.main:271\n\tmain.cmdRun:93\n\tmain.newRunCmd.func1:33\n\tstorj.io/common/process.cleanup.func1.4:392\n\tstorj.io/common/process.cleanup.func1:410\n\tgithub.com/spf13/cobra.(*Command).execute:983\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:1115\n\tgithub.com/spf13/cobra.(*Command).Execute:1039\n\tstorj.io/common/process.ExecWithCustomOptions:112\n\tmain.main:34\n\truntime.main:271"}

it's failed 15 times before it's downloaded a minimum allowed version:

$ docker compose logs storagenode | grep -c "software outdated"
15

Copy link
Contributor

@AlexeyALeonov AlexeyALeonov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall it's working normally.

@profclems
Copy link
Member Author

it's failed 15 times before it's downloaded a minimum allowed version:

@AlexeyALeonov yeah right, the updater waits for the node to run for a while before checking for update. But if you did run with the docker-compose.yaml in this patch, I added STORJ_SUPERVISOR_DISABLE_UPDATE_BEFORE_FIRST_RUN: "true" which disables updating the node before running.

So Ideally, the supervisor will first check the binary to see if it should be updated, updates it and starts it; then the actual updater starts a few seconds after the node starts.

Thanks for all the tests!

Change-Id: I9535b03bbacde04d26d6bf8b29d7c195d2a48246
Change-Id: I2b27e0d01a780511e059aabae5114ac639b49e39
@AlexeyALeonov
Copy link
Contributor

I added STORJ_SUPERVISOR_DISABLE_UPDATE_BEFORE_FIRST_RUN: "true" which disables updating the node before running.

In my docker-compose.yaml it has value

      STORJ_SUPERVISOR_DISABLE_UPDATE_BEFORE_FIRST_RUN: ${STORJ_SUPERVISOR_DISABLE_UPDATE_BEFORE_FIRST_RUN:-"false"}

So, it's false, not true.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants