Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allocation API: fix "no path to region" errors for non-global regions #24644

Merged
merged 1 commit into from
Dec 16, 2024

Conversation

tgross
Copy link
Member

@tgross tgross commented Dec 10, 2024

In #16872 we added support for unix domain sockets, but this required mutating the Config when parsing the address so as to remove the port number. In #23785 we fixed a bug where if the configuration was used across multiple clients (as in the autoscaler) that mutation would happen multiple times and the address would be incorrectly parsed.

When making alloc log, alloc fs, or alloc exec calls where we have line-of-sight to the client, we attempt to make a HTTP API call directly to the client node. So we create a new API client from the same configuration and then set the address. But in this case we copy the private url field and that causes the URL parsing to be skipped for the new client.

This results in the region always being set to the string literal "global" (because of mTLS handling code introduced all the way back in 4d3b75d), unless the user has set the region specifically. This fails with an error "no path to region" when the cluster isn't non-global and requests are sent to a non-leader.

Arguably the "right" way of fixing this would be for ClientConfig not to change the API client's region to "global" in the first place, but as this is a public API and extremely longstanding behavior, it could potentially be a breaking change for some downstream consumers. Instead, we'll avoid copying the private url field so that the new address is re-parsed.

Fixes: #24635
Fixes: #24609
Ref: #16872
Ref: #23785
Ref: 4d3b75d
Ref: https://hashicorp.atlassian.net/browse/NET-11858

Testing & Reproduction steps

To reproduce, stand up a cluster with region = "example" with at least one client, in an environment where you have line-of-sight to all nodes (ex. local development environment should work fine). Deploy a job to that client and run nomad alloc logs :alloc_id.

@tgross
Copy link
Member Author

tgross commented Dec 11, 2024

Another way I considered fixing this would be to have the Address field get paired with private scheme/host fields and add a (*Config) Copy() method that only copies the user-configured / public fields and erases the private fields. Callers would only ever use the internal fields.

That would prevent this class of bug in general at the cost of making the internals a lot more fiddly, so I thought that might not be the best approach. I'm definitely open to being argued otherwise though.

@tgross tgross requested a review from mismithhisler December 16, 2024 15:00
mismithhisler
mismithhisler previously approved these changes Dec 16, 2024
Copy link
Member

@mismithhisler mismithhisler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@mismithhisler mismithhisler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we update this comment?

Couldn't link to it in the review.

@tgross
Copy link
Member Author

tgross commented Dec 16, 2024

Should we update this comment?

That comment expressed the correct intent but incorrectly described the behavior before this bug fix. It's correct now.

In #16872 we added support for unix domain sockets, but this required mutating
the `Config` when parsing the address so as to remove the port number. In #23785
we fixed a bug where if the configuration was used across multiple clients that
mutation would happen multiple times and the address would be incorrectly
parsed.

When making `alloc log`, `alloc fs`, or `alloc exec` calls where we have
line-of-sight to the client, we attempt to make a HTTP API call directly to the
client node. So we create a new API client from the same configuration and then
set the address. But in this case we copy the private `url` field and that
causes the URL parsing to be skipped for the new client.

This results in the region always being set to the string literal
`"global"` (because of mTLS handling code introduced all the way back in
4d3b75d), unless the user has set the region specifically. This fails with
an error "no path to region" when the cluster isn't non-global and requests are
sent to a non-leader.

Arguably the "right" way of fixing this would be for `ClientConfig` not to
change the API client's region to `"global"` in the first place, but as this is
a public API and extremely longstanding behavior, it could potentially be a
breaking change for some downstream consumers. Instead, we'll avoid copying the
private `url` field so that the new address is re-parsed.

Fixes: #24635
Fixes: #24609
Ref: #16872
Ref: #23785
Ref: 4d3b75d
@tgross
Copy link
Member Author

tgross commented Dec 16, 2024

Sorry @mismithhisler I didn't noticed that a file was in conflict because of a move. Can you 👍 again?

tgross added a commit that referenced this pull request Dec 16, 2024
In #16872 we added support for unix domain sockets, but this required mutating
the `Config` when parsing the address so as to remove the port number. In #23785
we fixed a bug where if the configuration was used across multiple clients that
mutation would happen multiple times and the address would be incorrectly
parsed.

When making `alloc log`, `alloc fs`, or `alloc exec` calls where we have
line-of-sight to the client, we attempt to make a HTTP API call directly to the
client node. So we create a new API client from the same configuration and then
set the address. But in this case we copy the private `url` field and that
causes the URL parsing to be skipped for the new client.

This results in the region always being set to the string literal
`"global"` (because of mTLS handling code introduced all the way back in
4d3b75d), unless the user has set the region specifically. This fails with
an error "no path to region" when the cluster isn't non-global and requests are
sent to a non-leader.

Arguably the "right" way of fixing this would be for `ClientConfig` not to
change the API client's region to `"global"` in the first place, but as this is
a public API and extremely longstanding behavior, it could potentially be a
breaking change for some downstream consumers. Instead, we'll avoid copying the
private `url` field so that the new address is re-parsed.

Fixes: #24635
Fixes: #24609
Ref: #16872
Ref: #23785
Ref: 4d3b75d
tgross added a commit that referenced this pull request Dec 16, 2024
…ddress (#24644) (#24682)

In #16872 we added support for unix domain sockets, but this required mutating
the `Config` when parsing the address so as to remove the port number. In #23785
we fixed a bug where if the configuration was used across multiple clients that
mutation would happen multiple times and the address would be incorrectly
parsed.

When making `alloc log`, `alloc fs`, or `alloc exec` calls where we have
line-of-sight to the client, we attempt to make a HTTP API call directly to the
client node. So we create a new API client from the same configuration and then
set the address. But in this case we copy the private `url` field and that
causes the URL parsing to be skipped for the new client.

This results in the region always being set to the string literal
`"global"` (because of mTLS handling code introduced all the way back in
4d3b75d), unless the user has set the region specifically. This fails with
an error "no path to region" when the cluster isn't non-global and requests are
sent to a non-leader.

Arguably the "right" way of fixing this would be for `ClientConfig` not to
change the API client's region to `"global"` in the first place, but as this is
a public API and extremely longstanding behavior, it could potentially be a
breaking change for some downstream consumers. Instead, we'll avoid copying the
private `url` field so that the new address is re-parsed.

Fixes: #24635
Fixes: #24609
Ref: #16872
Ref: #23785
Ref: 4d3b75d

Co-authored-by: Tim Gross <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/ent/1.8.x+ent Changes are backported to 1.8.x+ent backport/1.9.x backport to 1.9.x release line theme/allocation API type/bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

nomad logs no longer works without NOMAD_REGION set Nomad 1.8.4 failed to exec into task: No path to region
2 participants