Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node add isn't consistently working #728

Open
gsstoykov opened this issue Oct 22, 2024 · 2 comments
Open

Node add isn't consistently working #728

gsstoykov opened this issue Oct 22, 2024 · 2 comments
Assignees
Labels
Bug A error that causes the feature to behave differently than what was expected based on design docs HashSphere Requirement P0 An issue impacting production environments or impacting multiple releases or multiple individuals.

Comments

@gsstoykov
Copy link

gsstoykov commented Oct 22, 2024

To Reproduce

Initialisation steps from #727 and:

npm run solo -- node add --gossip-keys true --tls-keys true --release-tag v0.54.0-alpha.4 --namespace solo-e2e

Describe the bug

◼ Finalize
node:internal/process/promises:289
            triggerUncaughtException(err, true /* fromPromise */);
            ^

[UnhandledPromiseRejection: This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). The promise rejected with the reason "#<ErrorEvent>".] {
  code: 'ERR_UNHANDLED_REJECTION'
}

Node.js v21.7.1

I've seen fails with the error from #727 as well.

Describe the expected behavior

Node added and functioning. Does not happen every time but still it is not consistent for testing from our side.

Whole JUnit/CLI Logs

npm run solo -- node add --gossip-keys true --tls-keys true --release-tag v0.54.0-alpha.4 --namespace solo-e2e

> @hashgraph/[email protected] solo
> NODE_OPTIONS=--experimental-vm-modules node --no-deprecation solo.mjs node add --gossip-keys true --tls-keys true --release-tag v0.54.0-alpha.4 --namespace solo-e2e


******************************* Solo *********************************************
Version			: 0.31.0
Kubernetes Context	: kind-solo-e2e
Kubernetes Cluster	: kind-solo-e2e
Kubernetes Namespace	: solo-e2e
**********************************************************************************
✔ Initialize [0.1s]
✔ Check that PVCs are enabled
✔ Identify existing network nodes
  ✔ Check network pod: node1
✔ Determine new node account number
✔ Generate Gossip key [0.3s]
  ✔ Backup old files
  ✔ Gossip key for node: node2 [0.3s]
✔ Generate gRPC TLS key [0.4s]
  ✔ Backup old files
  ✔ TLS key for node: node2 [0.4s]
✔ Load signing key certificate
✔ Compute mTLS certificate hash
✔ Prepare gossip endpoints
✔ Prepare grpc service endpoints
✔ Prepare upgrade zip file for node upgrade process [2s]
✔ Check existing nodes staked amount [2s]
✔ Send node create transaction [2s]
✔ Send prepare upgrade transaction [4s]
✔ Send freeze upgrade transaction [2s]
✔ Download generated files from an existing node [0.5s]
✔ Prepare staging directory
  ✔ Copy Gossip keys to staging
  ✔ Copy gRPC TLS keys to staging
✔ Copy node keys to secrets [0.1s]
  ✔ Copy TLS keys [0.1s]
  ✔ Node: node1
    ✔ Copy Gossip keys
  ✔ Node: node2
    ✔ Copy Gossip keys
✔ Check network nodes are frozen [9s]
  ✔ Check network pod: node1  - status FREEZE_COMPLETE, attempt: 3/120 [9s]
✔ Get node logs and configs [2s]
✔ Deploy new network node [5s]
✔ Kill nodes to pick up updated configMaps
✔ Check node pods are running [58s]
  ✔ Check Node: node1
  ✔ Check Node: node2 [58s]
❯ Fetch platform software into all network nodes
  ⠇ Update node: node1 [ platformVersion = v0.54.0-alpha.4 ]
  ⠇ Update node: node2 [ platformVersion = v0.54.0-alpha.4 ]
◼ Download last state from an existing node
◼ Upload last saved state to new network node
◼ Setup new network node
◼ Start network nodes
◼ Enable port forwarding for JVM debugger
◼ Check all nodes are ACTIVE
◼ Check all node proxies are ACTIVE
◼ Stake new node
◼ Trigger stake weight calculate
◼ Finalize
node:internal/process/promises:289
            triggerUncaughtException(err, true /* fromPromise */);
            ^

[UnhandledPromiseRejection: This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). The promise rejected with the reason "#<ErrorEvent>".] {
  code: 'ERR_UNHANDLED_REJECTION'
}

Node.js v21.7.1

Additional Context

No response

@gsstoykov gsstoykov added Bug A error that causes the feature to behave differently than what was expected based on design docs Pending Triage New issue that needs to be triaged by the team labels Oct 22, 2024
@gsstoykov
Copy link
Author

Also tried doing the same flow with the C++ SDK NodeCreateTransaction followed by npm run solo -- node add-execute --input-dir context. Seems like the node pod is correctly created also setup and start are passing as well but got the following log:

npm run solo -- node add-execute --input-dir context

> @hashgraph/[email protected] solo
> NODE_OPTIONS=--experimental-vm-modules node --no-deprecation solo.mjs node add-execute --input-dir context


******************************* Solo *********************************************
Version			: 0.31.0
Kubernetes Context	: kind-solo-e2e
Kubernetes Cluster	: kind-solo-e2e
Kubernetes Namespace	: solo-e2e
**********************************************************************************
✔ Initialize [0.1s]
✔ Identify existing network nodes
  ✔ Check network pod: node1
✔ Load context data
✔ Download generated files from an existing node [0.4s]
✔ Prepare staging directory
  ✔ Copy Gossip keys to staging
  ✔ Copy gRPC TLS keys to staging
✔ Copy node keys to secrets
  ✔ Copy TLS keys
  ✔ Node: node1
    ✔ Copy Gossip keys
  ✔ Node: node2
    ✔ Copy Gossip keys
✔ Check network nodes are frozen [6s]
  ✔ Check network pod: node1  - status FREEZE_COMPLETE, attempt: 0/120 [6s]
✔ Get node logs and configs [8s]
✔ Deploy new network node [2s]
✔ Kill nodes to pick up updated configMaps
✔ Check node pods are running [30s]
  ✔ Check Node: node1
  ✔ Check Node: node2 [30s]
✔ Fetch platform software into all network nodes [5s]
  ✔ Update node: node1 [ platformVersion = v0.54.0-alpha.4 ] [5s]
  ✔ Update node: node2 [ platformVersion = v0.54.0-alpha.4 ] [5s]
✔ Download last state from an existing node [0.4s]
✔ Upload last saved state to new network node [0.4s]
✔ Setup new network node [0.1s]
  ✔ Node: node1 [0.1s]
    ✔ Set file permissions [0.1s]
  ✔ Node: node2
    ✔ Set file permissions
✔ Start network nodes [0.1s]
  ✔ Start node: node1
  ✔ Start node: node2
↓ Enable port forwarding for JVM debugger
❯ Check all nodes are ACTIVE
  ✔ Check network pod: node1  - status ACTIVE, attempt: 16/120 [24s]
  ✖ node 'node2' is not ACTIVE[ attempt = 120/120 ]
◼ Check all node proxies are ACTIVE
◼ Stake new node
◼ Trigger stake weight calculate
◼ Finalize
*********************************** ERROR *****************************************
Error in setting up nodes: node 'node2' is not ACTIVE[ attempt = 120/120 ]
***********************************************************************************

@jeromy-cannon jeromy-cannon added P0 An issue impacting production environments or impacting multiple releases or multiple individuals. and removed Pending Triage New issue that needs to be triaged by the team labels Nov 18, 2024
@jeromy-cannon jeromy-cannon self-assigned this Nov 20, 2024
@jeromy-cannon
Copy link
Contributor

We discovered there is currently an issue in platform/services with NodeCreateTransaction. After the node has been added and the one of the nodes goes into teach mode for the newly added node, the teacher will get JVM out of memory errors after finishing teaching and reconnecting to the network. I'm not sure the exact amount, but Nathan quoted 22GB of memory (not sure what this 22GB refers to). I think you might be able to get around this by setting the JVM memory settings really high, but we haven't configured Solo to do that by default.

We have disabled our E2E tests involving solo node add until this is resolved in a patch. I'm reaching out to find an issue that we can use to track this with.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug A error that causes the feature to behave differently than what was expected based on design docs HashSphere Requirement P0 An issue impacting production environments or impacting multiple releases or multiple individuals.
Projects
None yet
Development

No branches or pull requests

2 participants