Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VAULT-31409: trace postUnseal function #28895

Open
wants to merge 19 commits into
base: main
Choose a base branch
from

Conversation

bosouza
Copy link

@bosouza bosouza commented Nov 13, 2024

Description

This PR adds 2 new settings to the top-level Vault server configuration: enable_post_unseal_trace that when enabled will cause Vault to generate a Go trace during the execution of core.postUnseal and save it to disk under the directory configured in post_unseal_trace_dir, defaulting to /tmp/vault-traces/ instead if the field isn't set. This file can be inspected using the go tool trace command and will help us debug cases where the leadership transfer operation takes too long.

Most of the feature was split into a reusable StartDebugTrace function in a separate package. It could be used for similar debug purposes in the future but most importantly for now it avoids adding more code and dependencies to the already-bloated core package.

Manually tested the functionality to ensure that it was possible to generate the traces by updating the new fields in the config, then sending a SIGHUP signal to the running Vault process and triggering a leadership election via vault operator step-down. Disabling the generation of the traces via the same SIGHUP process also works. Post-unseal only runs on the active node so restarting a follower won't generate the post-unseal traces.

Jira: VAULT-31409

TODO only if you're a HashiCorp employee

  • Backport Labels: If this PR is in the ENT repo and needs to be backported, backport
    to N, N-1, and N-2, using the backport/ent/x.x.x+ent labels. If this PR is in the CE repo, you should only backport to N, using the backport/x.x.x label, not the enterprise labels.
    • If this fixes a critical security vulnerability or severity 1 bug, it will also need to be backported to the current LTS versions of Vault. To ensure this, use all available enterprise labels.
  • ENT Breakage: If this PR either 1) removes a public function OR 2) changes the signature
    of a public function, even if that change is in a CE file, double check that
    applying the patch for this PR to the ENT repo and running tests doesn't
    break any tests. Sometimes ENT only tests rely on public functions in CE
    files.
  • Jira: If this change has an associated Jira, it's referenced either
    in the PR description, commit message, or branch name.
  • RFC: If this change has an associated RFC, please link it in the description.
  • ENT PR: If this change has an associated ENT PR, please link it in the
    description. Also, make sure the changelog is in this PR, not in your ENT PR.

@github-actions github-actions bot added the hashicorp-contributed-pr If the PR is HashiCorp (i.e. not-community) contributed label Nov 13, 2024
Copy link

github-actions bot commented Nov 13, 2024

CI Results: failed ❌
Failures:

Test Type Package Test Logs
race command/server TestConfig_Sanitized view test results
race http TestSysConfigState_Sanitized view test results
race http TestSysConfigState_Sanitized/inmem_storage,_no_HA_storage view test results
race http TestSysConfigState_Sanitized/inmem_storage,_raft_HA_storage view test results
race http TestSysConfigState_Sanitized/raft_storage view test results
standard command/server TestConfig_Sanitized view test results
standard http TestSysConfigState_Sanitized view test results
standard http TestSysConfigState_Sanitized/inmem_storage,_no_HA_storage view test results
standard http TestSysConfigState_Sanitized/inmem_storage,_raft_HA_storage view test results
standard http TestSysConfigState_Sanitized/raft_storage view test results

helper/trace/debug_trace.go Show resolved Hide resolved
helper/trace/debug_trace.go Show resolved Hide resolved
vault/core.go Outdated Show resolved Hide resolved
Comment on lines 14 to 15
path := fmt.Sprintf("%s/%s_%s", os.TempDir(), filePrefix, time.Now().Format(time.RFC3339))
traceFile, err := os.Create(path)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still want to make this more similar to VAULT_STACKTRACE_WRITE_TO_FILE for the sake of consistency

after talking to Kuba it sounds like it's a good idea to try to move stuff out of core, so even if there's no immediate need for a generic debug trace function it's still fair to add it
also some usability improvements from manual testing
@bosouza bosouza changed the title VAULT-31409: trace unsealInternal function VAULT-31409: trace postUnseal function Nov 18, 2024
}
}

traceFile, err := filepath.Abs(fmt.Sprintf("%s/%s-%s.trace", dir, filePrefix, time.Now().Format(time.RFC3339)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we switch this to using filepath.Join to create the path, rather than Sprintf?

@@ -115,6 +115,9 @@ type Config struct {
License string `hcl:"-"`
LicensePath string `hcl:"license_path"`
DisableSSCTokens bool `hcl:"-"`

EnablePostUnsealTrace bool `hcl:"enable_post_unseal_trace"`
PostUnsealTraceDir string `hcl:"post_unseal_trace_dir"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's also add these fields to the (*Config).Merge function

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh good catch, wouldn't have noticed this unless I'd tested with multiple files. I wonder why we didn't go for a more generic implementation here (maybe didn't want to use reflect?) but we could at least have some semgrep validation to check that all fields are merged (or explicitly ignored)

if stopTrace := c.tracePostUnsealIfEnabled(); stopTrace != nil {
defer stopTrace()
}

defer metrics.MeasureSince([]string{"core", "post_unseal"}, time.Now())
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure what is the best order between this metric and the new trace, but I think it's fair to include the metric in the captured trace even tho that will make the time spend on the tracing setup invisible to the metric

@@ -115,6 +115,9 @@ type Config struct {
License string `hcl:"-"`
LicensePath string `hcl:"license_path"`
DisableSSCTokens bool `hcl:"-"`

EnablePostUnsealTrace bool `hcl:"enable_post_unseal_trace"`
PostUnsealTraceDir string `hcl:"post_unseal_trace_dir"`
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh good catch, wouldn't have noticed this unless I'd tested with multiple files. I wonder why we didn't go for a more generic implementation here (maybe didn't want to use reflect?) but we could at least have some semgrep validation to check that all fields are merged (or explicitly ignored)

there were concerns about using the /tmp directory because of permissions, or having a default dir at all, so now it's required to set a dir in order to generate the traces.
sounds like it might be forbidden in Windows and possibly cause problems in some MacOS applications.
@bosouza bosouza marked this pull request as ready for review November 22, 2024 15:49
@bosouza bosouza requested review from a team as code owners November 22, 2024 15:49
Copy link

Build Results:
All builds succeeded! ✅

vault/testing.go Outdated Show resolved Hide resolved
return "", nil, fmt.Errorf("trace directory %q does not exist", dir)
}

if !os.IsNotExist(err) && !d.IsDir() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need the !os.IsNotExist(err) portion of this condition?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I think we need it. If os.IsNotExist(err) evaluates to true then we don't want to evaluate d.IsDir() since d would be nil.

The alternative would be to use a nested if-else structure that interleaves error returns with additional operations (creation of default dir), which I don't like as much:

	d, err := os.Stat(dir)
	if err != nil {
		if os.IsNotExist(err) {
			if dirMustExist {
				return "", nil, fmt.Errorf("trace directory %q does not exist", dir)
			} else {
				if err := os.Mkdir(dir, 0o700); err != nil {
					return "", nil, fmt.Errorf("failed to create trace directory %q: %s", dir, err)
				}
			}
		} else {
			return "", nil, fmt.Errorf("failed to stat trace directory %q: %s", dir, err)
		}
	} else if !d.IsDir() {
		return "", nil, fmt.Errorf("trace directory %q is not a directory", dir)
	}

helper/trace/debug_trace_test.go Show resolved Hide resolved
helper/trace/debug_trace_test.go Outdated Show resolved Hide resolved
CI was complaining about missing comments on the new test function. It feels a bit silly to require this of tests but whatever XD
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/1.18.x hashicorp-contributed-pr If the PR is HashiCorp (i.e. not-community) contributed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants