-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
supervisor: lightweight process manager for the storagenode #27
base: main
Are you sure you want to change the base?
Conversation
a93547b
to
0f6b8e4
Compare
Do we really need a new type of supervisor? Can we write sg. better then the good old supervisord? Why don't we use a bash loop (with tiny) + exit from the main process when we downloaded the new binary. I am worried, because looks like implementing a supervisor is not so easy, especially with all the corner cases of signal processing / handling. Tiny might be better in this game. But didn't check the code (yet). Also: do we have code linting in this repository? Do we plan to use this for systemd based setup? Wondering if storj/storj would be a better place for the code... |
This isn't really an attempt to write something like the supervisord. I just named it supervisor but it's not it. Here is how it works:
The shutdown process: Kill the node process -> stop the updater loop -> exit the supervisor (main process)
I added it to this repo because it's just specifically for the storagenode-docker for now. And I don't want to move it to storj/storj because we don't want to follow the release cycle there. |
I just want the review to be based on the idea, then we can see if we need to move the code out or add some linters here |
sad they now supports only |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
docker run -it --rm -v storagenode:/app/storagenode -v identity:/app/identity storagenodedocker:latest
...
Error: Failed to check version: software outdated: outdated software version (v1.112.2), please update
storj.io/storj/private/version/checker.(*Service).CheckVersion:80
main.cmdRun:91
main.newRunCmd.func1:33
storj.io/common/process.cleanup.func1.4:392
storj.io/common/process.cleanup.func1:410
github.com/spf13/cobra.(*Command).execute:983
github.com/spf13/cobra.(*Command).ExecuteC:1115
github.com/spf13/cobra.(*Command).Execute:1039
storj.io/common/process.ExecWithCustomOptions:112
main.main:34
runtime.main:271
2024/10/15 04:24:27 ERROR Process exited with error service=supervisor error="process: exit status 1\n\tstorj.io/storagenode-docker/supervisor.(*Process).wait:61\n\tstorj.io/storagenode-docker/supervisor.(*Supervisor).runProcess:98\n\tstorj.io/storagenode-docker/supervisor.(*Supervisor).Run.func2:77\n\tgolang.org/x/sync/errgroup.(*Group).Go.func1:78"
2024/10/15 04:24:27 INFO Too many retries, exiting supervisor service=supervisor
2024/10/15 04:24:36 INFO Download finished. service=supervisor From=https://github.com/storj/storj/releases/download/v1.113.2/storagenode_linux_amd64.zip To=/tmp/storagenode_linux_amd64.2953723089.zip
2024/10/15 04:24:36 INFO Unpacking archive. service=supervisor From=/tmp/storagenode_linux_amd64.2953723089.zip To=/app/bin/storagenode.1.113.2
2024/10/15 04:24:37 INFO Downloaded version. service=supervisor Version=v1.113.2
2024/10/15 04:24:37 INFO Copying binary to store. service=supervisor From=/app/bin/storagenode.1.113.2 To=/app/config/bin/storagenode.1.113.2
2024/10/15 04:24:38 INFO Binary copied to store. service=supervisor From=/app/bin/storagenode.1.113.2 To=/app/config/bin/storagenode.1.113.2
And in the loop. It downloads a new binary to the storage location, but seems didn't rename it to actually update.
Thanks to explain the details @profclems I understand that the goal is to keep this docker image empty as before, and I like this goal. And it requires an initial download. Still I have the two questions:
Originally I was thinking about the same. Updater (or here the supervisor) can execute a sub-process, and restart when needed. But now I am wondering if need some more simple hierarchy, as we had the pain with shutting down storagenodes in docker. Instead of We can also use something like this as an entrypoint:
123 supposed to be the exit code of storagenode when update is required, and binary is downloaded by the version chore. I feel this one somewhat more safe, as we don't need to forward signals (bash already does it).
But I understand that you would like to include a binary here, probably for long term, so I am fine with moving storagenode-updater to here, just seems to be a duplicated effort. |
Is storagenode-upater only used for storagenonode docker? |
ccfaf6f
to
e434d6a
Compare
Yes @AlexeyALeonov. So I've switched to arm32v7 for the binary builder and maintained arm32v6 for the main image. |
This updates and manages the storagenode process Change-Id: If8704e91028d7fc86d947244e6e8a21d33c4ab7a
Change-Id: Ied701ed18df5c7d764879e391a435c2a7ddd9874
Change-Id: Ieea3ee9c164abf4d6d97bfa49a2c5b6eb1e06b6a
No, it's also used in the services setups like Windows/Linux GUI |
They should be ARM32/v5 I suppose, there is no v6 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's still doesn't work properly with versions:
docker build -t storagenode:local . --no-cache
$ docker run -it --rm -v identity:/app/identity -v storagenode:/app/storagenode -e WALLET=0x0 storagenode:local
Running /app/bin/supervisor exec /app/bin/storagenode run --config-dir config --identity-dir identity --version.server-address=https://version.storj.io --storage.allocated-disk-space=2.0TB --operator.wallet=0x0000000000000000000000
2024/10/27 04:36:22 {15m0s 1m0s /app/bin/storagenode /app/config/bin https://version.storj.io 1111111111111111111111111111111112m1s9K /app/identity false false}
2024/10/27 04:36:22 Binary does not exist, downloading new binary
2024/10/27 04:36:22 INFO Download started. service=supervisor From=https://github.com/storj/storj/releases/download/v1.113.2/storagenode_linux_amd64.zip To=/tmp/storagenode_linux_amd64.3611398710.zip
2024/10/27 04:36:23 INFO Download finished. service=supervisor From=https://github.com/storj/storj/releases/download/v1.113.2/storagenode_linux_amd64.zip To=/tmp/storagenode_linux_amd64.3611398710.zip
2024/10/27 04:36:23 INFO Unpacking archive. service=supervisor From=/tmp/storagenode_linux_amd64.3611398710.zip To=/app/bin/storagenode.1.113.2
2024/10/27 04:36:25 INFO Downloaded version. service=supervisor Version=v1.113.2
2024/10/27 04:36:25 INFO Copying binary to store. service=supervisor From=/app/bin/storagenode.1.113.2 To=/app/config/bin/storagenode.1.113.2
2024/10/27 04:36:25 INFO Binary copied to store. service=supervisor From=/app/bin/storagenode.1.113.2 To=/app/config/bin/storagenode.1.113.2
2024/10/27 04:36:25 WARN Error removing backup binary service=supervisor !BADKEY="remove /app/bin/storagenode.old.v0.0.0: no such file or directory"
2024/10/27 04:36:25 INFO Starting process service=supervisor binary=/app/bin/storagenode
2024-10-27T04:36:25Z INFO Anonymized tracing enabled {"Process": "storagenode"}
2024-10-27T04:36:25Z WARN Operator email address isn't specified. {"Process": "storagenode"}
And the second run:
Running /app/bin/supervisor exec /app/bin/storagenode run --config-dir config --identity-dir identity --version.server-address=https://version.storj.io --storage.allocated-disk-space=2.0TB --operator.wallet=0x0000000000000000000000
2024/10/27 04:46:01 {15m0s 1m0s /app/bin/storagenode /app/config/bin https://version.storj.io 1111111111111111111111111111111112m1s9K /app/identity false false}
2024/10/27 04:46:01 Binary does not exist, downloading new binary
2024/10/27 04:46:02 INFO Download started. service=supervisor From=https://github.com/storj/storj/releases/download/v1.113.2/storagenode_linux_amd64.zip To=/tmp/storagenode_linux_amd64.2332384136.zip
2024/10/27 04:46:02 INFO Download finished. service=supervisor From=https://github.com/storj/storj/releases/download/v1.113.2/storagenode_linux_amd64.zip To=/tmp/storagenode_linux_amd64.2332384136.zip
2024/10/27 04:46:02 INFO Unpacking archive. service=supervisor From=/tmp/storagenode_linux_amd64.2332384136.zip To=/app/bin/storagenode.1.113.2
2024/10/27 04:46:04 INFO Downloaded version. service=supervisor Version=v1.113.2
2024/10/27 04:46:04 INFO Copying binary to store. service=supervisor From=/app/bin/storagenode.1.113.2 To=/app/config/bin/storagenode.1.113.2
2024/10/27 04:46:04 INFO Binary copied to store. service=supervisor From=/app/bin/storagenode.1.113.2 To=/app/config/bin/storagenode.1.113.2
2024/10/27 04:46:04 WARN Error removing backup binary service=supervisor !BADKEY="remove /app/bin/storagenode.old.v0.0.0: no such file or directory"
2024/10/27 04:46:04 INFO Starting process service=supervisor binary=/app/bin/storagenode
2024-10-27T04:46:04Z INFO Anonymized tracing enabled {"Process": "storagenode"}
2024-10-27T04:46:04Z WARN Operator email address isn't specified. {"Process": "storagenode"}
the content of /app/config/bin
:
$ docker run -it --rm -v storagenode:/app/config ubuntu ls -l /app/config/bin
total 154748
-rwxr-xr-x 1 root root 78855672 Oct 27 04:46 storagenode.1.113.2
-rwxr-xr-x 1 root root 79600112 Oct 27 04:46 storagenode.1.114.6
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried with docker compose, even more interesting.
- Start everything
docker compose up -d
- Make sure, that all working and node updated to 1.114.6
- Remove the storagenode container:
docker compose down storagenode
- Re-create storagenode container
docker compose up -d
It firstly downloaded a minimum version, then started the node. It's failed because of the database schema mismatch:
docker compose logs storagenode | grep -c "Error migrating"
3
Because it has a minimal version.
And after that it would update to the correct version.
Change-Id: I2810f1900b4bef3caa97f07494d23cc340333545
Thanks @AlexeyALeonov for all the tests and feedback. So I made a mistake: I was copying the new binary with the version number attached before replacing the actual binary. This has been fixed |
This is an interesting one. I've noticed that from my tests too. Are you using the compose file in this PR? @AlexeyALeonov |
…not exist This should help prevent any database version mismatch issues Change-Id: I173fa79f7a19d79f6388821892e5163ded90f1bb
Okay I just found the issues
|
Change-Id: I59da78cfb9e106758b6ba9c0f9ba1b6b4335c396
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried this new version.
- Replaced the storagenode binary with 1.113.2 in the storage location
- Started the stack
$ docker compose up -d
- Got a lot of errors
"Error: Failed to check version: software outdated: outdated software version (v1.113.2), please update"
(132):
$ docker compose logs storagenode | grep -c "Failed to check version"
132
- But finally it's updated.
Change-Id: I4091f660fd51205505a09bd59b8037d8a1a86e7d
Great!
These errors are directly from the storagenode because it also checks for updates. I've set the updater in this supervisor to start 30 seconds after the node starts. Also, I just pushed another change to make sure we still check if the node qualifies for the current rollout and download that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now it do update to a minimum version, if it detects the old one in the storage location.
But it downloads a minimum version every restart, even if it's already in the storage location.
Change-Id: I570994e3198e24138db13618d8417c36c00b37af
Change-Id: I874e47ceb75a11199df2d057585efdad46667be0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the current version with the old binary in the storage location trying to start the old binary before the check of a new version
storagenode-1 | Running /app/bin/supervisor exec /app/bin/storagenode run --config-dir config --identity-dir identity --version.server-address=https://version.storj.io --storage.allocated-disk-space=2.0TB
storagenode-1 | 2024/11/08 03:35:57 INFO Copying backup binary service=supervisor backup=/app/config/bin/storagenode destination=/app/bin/storagenode
storagenode-1 | 2024/11/08 03:35:57 INFO Checking version service=supervisor
storagenode-1 | 2024/11/08 03:35:57 INFO Current binary version service=supervisor version=v1.113.2
storagenode-1 | 2024/11/08 03:35:57 INFO Starting process service=supervisor binary=/app/bin/storagenode
storagenode-1 | 2024-11-08T03:35:58Z INFO Configuration loaded {"Process": "storagenode", "Location": "/app/config/config.yaml"}
storagenode-1 | 2024-11-08T03:35:58Z INFO Anonymized tracing enabled {"Process": "storagenode"}
storagenode-1 | 2024-11-08T03:35:58Z DEBUG tracing collector started {"Process": "storagenode"}
storagenode-1 | 2024-11-08T03:35:58Z DEBUG debug server listening on 127.0.0.1:37471 {"Process": "storagenode"}
storagenode-1 | 2024-11-08T03:35:58Z INFO Operator email {"Process": "storagenode", "Address": "[email protected]"}
storagenode-1 | 2024-11-08T03:35:58Z INFO Operator wallet {"Process": "storagenode", "Address": "0x0123456789012345678901234567890123456789"}
storagenode-1 | 2024-11-08T03:35:58Z DEBUG Version info {"Process": "storagenode", "Version": "1.113.2", "Commit Hash": "a08a0fa4c7ea14154fe32eb002960be4396d9963", "Build Timestamp": "2024-09-18 09:03:49 +0000 UTC", "Release Build": true}
storagenode-1 | 2024-11-08T03:35:58Z INFO server kernel support for server-side tcp fast open remains disabled. {"Process": "storagenode"}
storagenode-1 | 2024-11-08T03:35:58Z INFO server enable with: sysctl -w net.ipv4.tcp_fastopen=3 {"Process": "storagenode"}
storagenode-1 | 2024-11-08T03:35:59Z DEBUG version Allowed minimum version from control server. {"Process": "storagenode", "Minimum Version": "1.114.6"}
storagenode-1 | 2024-11-08T03:35:59Z WARN version version not allowed/outdated {"Process": "storagenode", "current version": "1.113.2", "minimum allowed version": "v1.114.6"}
storagenode-1 | 2024-11-08T03:35:59Z ERROR failure during run {"Process": "storagenode", "error": "Failed to check version: software outdated: outdated software version (v1.113.2), please update\n\tstorj.io/storj/private/version/checker.(*Service).CheckVersion:80\n\tmain.cmdRun:91\n\tmain.newRunCmd.func1:33\n\tstorj.io/common/process.cleanup.func1.4:392\n\tstorj.io/common/process.cleanup.func1:410\n\tgithub.com/spf13/cobra.(*Command).execute:983\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:1115\n\tgithub.com/spf13/cobra.(*Command).Execute:1039\n\tstorj.io/common/process.ExecWithCustomOptions:112\n\tmain.main:34\n\truntime.main:271", "errorVerbose": "Failed to check version: software outdated: outdated software version (v1.113.2), please update\n\tstorj.io/storj/private/version/checker.(*Service).CheckVersion:80\n\tmain.cmdRun:91\n\tmain.newRunCmd.func1:33\n\tstorj.io/common/process.cleanup.func1.4:392\n\tstorj.io/common/process.cleanup.func1:410\n\tgithub.com/spf13/cobra.(*Command).execute:983\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:1115\n\tgithub.com/spf13/cobra.(*Command).Execute:1039\n\tstorj.io/common/process.ExecWithCustomOptions:112\n\tmain.main:34\n\truntime.main:271\n\tmain.cmdRun:93\n\tmain.newRunCmd.func1:33\n\tstorj.io/common/process.cleanup.func1.4:392\n\tstorj.io/common/process.cleanup.func1:410\n\tgithub.com/spf13/cobra.(*Command).execute:983\n\tgithub.com/spf13/cobra.(*Command).ExecuteC:1115\n\tgithub.com/spf13/cobra.(*Command).Execute:1039\n\tstorj.io/common/process.ExecWithCustomOptions:112\n\tmain.main:34\n\truntime.main:271"}
it's failed 15 times before it's downloaded a minimum allowed version:
$ docker compose logs storagenode | grep -c "software outdated"
15
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall it's working normally.
@AlexeyALeonov yeah right, the updater waits for the node to run for a while before checking for update. But if you did run with the docker-compose.yaml in this patch, I added So Ideally, the supervisor will first check the binary to see if it should be updated, updates it and starts it; then the actual updater starts a few seconds after the node starts. Thanks for all the tests! |
Change-Id: I9535b03bbacde04d26d6bf8b29d7c195d2a48246
Change-Id: I2b27e0d01a780511e059aabae5114ac639b49e39
In my
So, it's |
Creating a lightweight process manager for the storagenode in docker, which autoupdates and restarts the storagenode process.
This supervisor becomes the main process, and forks the storagenode as a subprocess.
With this, we don't need the "python supervisor" and its verbose configuration, download both the storagenode and storagenode updater binaries and run them like we do currently.
But this custom supervisor is also a storagenode updater by itself.
TODO: provide steps on how to test with storj-up
NOTE: this is not an attempt to write something like supervisor. See #27 (comment)