Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow multiple private keys but use the latest one #1748

Open
wants to merge 67 commits into
base: main
Choose a base branch
from

Conversation

h2zh
Copy link
Collaborator

@h2zh h2zh commented Nov 15, 2024

Overview

In issue #561, we want to rotate out old private key when the new one is provided. This PR mainly changed the functions chain starting from GetIssuerPrivateJWK(), in order to support multiple private key files in a new directory specified by a new config param IssuerKeysDirectory, along with a new goroutine to monitor the changes there every 5 mins.

New Key

In this PR, admin can create new private key through either of the following two ways:

  1. Simply drop the private key into into the new directory as a .pem file (the location where config param IssuerKeysDirectory points to, which by default is /etc/pelican/issuer-keys), or modify any .pem there. Pelican program will pick up the last modified .pem file through a goroutine running every 5 mins, set the private key it contains as the active one, add it to an in-memory map, and update the public key in registry sqlite db.
  2. Hit the API endpoint “/api/v1.0/origin_ui/newIssuerKey”. Then the new .pem file will be created. Then it will wait for the same goroutine to be loaded (note: The web ui for this API is not included).

Security

How do we know the origin who updates the namespace's public key is the same origin registered this namespace? We authorize the origin on the registry side: require the public keys carried by the key update HTTP request (origin-->registry) must contain the public this origin previously registered, and check the proof of possession for both origin's current active private key and the previous private key, whose can prove the origin owns this namespace(prefix). The latter proof of possession verification uses a "previous private key's signature" sent by origin and verified by the public key stored in db on registry side.

Design thinking

Existing file hierarchy:

├── whatever-parent-dir/ (by default /etc/pelican)                             
│   ├── issuer.jwk            # Existing key location, specified by config param “IssuerKey”

To ensure backward compatibility, the new directory that stores the private keys (saved as .pem files) is mounted to a new config param IssuerKeysDirectory. User doesn’t need to change the value of IssuerKey, though all functions are not longer using it. The private key IssuerKey refers to will be migrated to the new directory specified by IssuerKeysDirectory. The new file hierarchy looks like this:

├── whatever-parent-dir/ (by default /etc/pelican)                             
│   └── issuer-keys/          # The new directory storing multiple private keys where “IssuerKeysDirectory” refers to
│       ├── pelican_generated_<timestamp_1>_<randomChars>.pem    # If no private key is provided, new .pem file will be created
│       ├── pelican_generated_<timestamp_2>_<randomChars>.pem         
│       ├── ...                 
│       ├── <system admin's fav name>.pem
│       └── migrated_<timestamp_2>_<randomChars>.pem    # previous key if it exists, migrated from the location where “IssuerKey” points to

Algorithm efficiency tradeoff

For a map using atomic.pointer, it requires to copy the entire map before update any key-value pair. In the loadPEMFiles func, because it runs every 5 minutes to update the in-memory key map from the file, which is not frequent, we think the memory use by a map using atomic.pointer is still acceptable.

Test

  • No private key file
    --> new key created: in-memory active key == public key of this origin's namespaces in registry db

  • One existing private key file, adding a new private key
    a) an existing private key file at config param IssuerKey (usually .../issuer.jwk), adding a new private key manually
    b) an existing private key file at config param IssuerKey (usually .../issuer.jwk), adding a new private key via API
    c) an existing private key file at config param IssuerKeysDirectory (usually .../issuer-keys), adding a new private key manually
    d) an existing private key file at config param IssuerKeysDirectory (usually .../issuer-keys), adding a new private key via API
    --> result of any scenario in this section: in-memory active key: second; public key of this origin's namespaces in registry db: second

  • Two existing private key files
    a) never run pelican before (empty registry db)
    --> in-memory active key: second; public key of this origin's namespaces in registry db: second
    b) first key is registered in registry db
    --> in-memory active key: second; public key of this origin's namespaces in registry db: second
    c) second key is registered in registry db
    --> in-memory active key: second; public key of this origin's namespaces in registry db: second
    d) second key is registered in registry db, adding the third key through API
    --> in-memory active key: third; public key of this origin's namespaces in registry db: third

Note: "first", "second" key are keys in chronological order of their private key file modification time.
In-memory active key can be copied at the top-right key icon in origin webUI; db public key of this namespace can be found at the registry.sqlite file, whose directory is specified by config param Registry.DbLocation, or in etc/pelican by default

Future work

At this moment, we use the most recent modified .pem file as the active private key. In the next step, we want to allow multiple active private keys in use. #1818

Eventually, when multiple origins have the same namespace, the admin of these origins can append a common private key to all. Then these origin servers will register the corresponding shared public key in the registry. Currently this operation needs to manually done by OSDF admin.

@h2zh h2zh requested a review from matyasselmeci November 19, 2024 14:40
@h2zh h2zh linked an issue Nov 19, 2024 that may be closed by this pull request
@h2zh h2zh added enhancement New feature or request origin Issue relating to the origin component registry Issue relating to the registry component go Pull requests that update Go code security labels Nov 26, 2024
@jhiemstrawisc
Copy link
Member

One other quick comment, now that I think we've settled on an answer -- wherever there's some JSON that gets passed back and forth via API calls, we should stick to camel case:
https://github.com/orgs/PelicanPlatform/discussions/1734#discussioncomment-11621001

For example, the new RegisteredPrefixUpdate struct from registry/registry_pubkey_update.go uses snake case when it defines JSON tags like json:"client_nonce". Unless we're worried about a specific backwards compatibility issue here, these should be defined like json:"clientNonce"

@h2zh h2zh force-pushed the multiple-private-keys branch from 19ce7bb to 85bc504 Compare December 20, 2024 19:30
Comment on lines 303 to 304
func() { assert.NoError(t, egrp.Wait()) }()
cancel()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is clearly the deadlock and looks like some bad copy/pasting without understanding what the original code does.

The original was:

	defer func() { require.NoError(t, egrp.Wait()) }()
	defer cancel()

Since defer is executed LIFO, this will invoke cancel() followed by the Wait() (to wait on the canceled functions to clean up).

The revised code is effectively this:

         defer func() {
	        func() { require.NoError(t, egrp.Wait()) }()
	        cancel()
          }

Which waits on the group to stop, then shuts it down, a clear deadlock.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing out and fixing this lingering problem! Having said that, this is just an intermediate problem in the tests - the root problem is topology mock server doesn't response in TestRegistryKeyChainingOSDF and test timed out in TestRegistryKeyChaining (see the error logs in the tests result). These two errors were suppressed if the other two tests in client_commands_test.go are commented out. Justin figured out there is a test state leak because commenting out the other tests allowed both TestRegistryKeyChaining and TestRegistryKeyChainingOSDF to pass. I'm still looking for the leak and hoping fixing the problems mentioned in your new PR comments could potentially fix the leak.

Copy link
Collaborator Author

@h2zh h2zh Dec 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  Test result Test result
Comment Out ⬇️ TestRegistryKeyChainingOSDF TestRegistryKeyChaining
TestServeNamespaceRegistry
TestMultiPubKeysRegisteredOnNamespace
Both test funcs above ✔️ ✔️

The current code had the logic reversed -- first it waited for completion,
then it cancelled the running goroutines.
Copy link
Collaborator

@bbockelm bbockelm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's still a lot to do here. Some minor stylistic / convention comments in the review -- plus a few major requests:

  1. Redo the handshake between origin and registry to allow arbitrary replacement of the public keys.
  2. Move the key refresh logic into the config module and add unit tests. Make the updates atomic and have things in lexicographical order.
  3. Do not deprecate the existing issuer key (at least not in this version!) to allow external folks to adapt to the new setup. Do not delete a key that an admin has provided. Instead, behave as if the issuer key was found within the key directory.

@@ -39,7 +39,7 @@ func keygenMain(cmd *cobra.Command, args []string) error {
return errors.Wrap(err, "failed to get the current working directory")
}
if privateKeyPath == "" {
privateKeyPath = filepath.Join(wd, "issuer.jwk")
privateKeyPath = filepath.Join(wd, "issuer-keys")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this work? Before, the variable was a file and now it's a directory. All the code below operates on this as if it was a directory.

On first glance, appears that the tool is broken, no?

Copy link
Collaborator Author

@h2zh h2zh Dec 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, the private/public generation logic lies in config.GetIssuerPublicJWKS(), which was already updated to incorporate the new keys in IssuerKeysDirectory instead of only one key file at IssuerKey. The pelican generate keygen command works as expected (creating a key in ./issuer-keys).

config/init_server_creds.go Show resolved Hide resolved
}

// Rename the existing private key file and set destination path
fileName := fmt.Sprintf("migrated_%d_%s.pem",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the appropriate mkstemp to create a unique file. This is not sufficient.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I also use the mkstemp logic in another similar function GeneratePEM

newKey, err := loadIssuerPrivateJWK(issuerKeyFile)
issuerKeysDir := param.IssuerKeysDirectory.GetString()
currentIssuerKeysDir := getCurrentIssuerKeysDir()
// Handles runtime changes to the issuer keys directory (configured via "IssuerKeysDirectory" parameter).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the tests should take a different approach: reset the config state between the calls. Otherwise, it clutters up the runtime code.

config/init_server_creds.go Show resolved Hide resolved
// Check the origin is authorized to update (possessing the public key used for prefix initial registration)
// Parse all public keys of the sender into a JWKS
var clientKeySet jwk.Set
if data.AllPubkeys == nil { // backward compatibility - AllPubkeys only exists in the payload in Pelican 7.12 or later
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused by this.

data is from the client -- why is there backward compatibility code in an API call introduced in 7.12?

continue
}

if registryDbKid == clientKid {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems unnecessary.

The client should prove it possess one of the existing keys.

However, this additionally forces the client to keep the key it used for the challenge in the updated key set. That seems unnecessary. We should simply see if the key used to sign is one of the existing known keys.

}

// Check if any key in `clientKeySet` matches a key in `registryDbKeySet`
registryDbKeysIter := registryDbKeySet.Keys(ctx)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These loops could be replaced by the LookupKeyID method.


// Update the public key of registered prefix(es) if the http request passed client and server verification for nonce.
// It returns the response data, and an error if any
func updateNsKeySignChallengeCommit(ctx *gin.Context, data *RegisteredPrefixUpdate) (map[string]interface{}, error) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the design here is still a bit confused. For example, it removes the previous key -- not clear that is matching the intent here.

Let me suggest something cleaner:

  1. At the start of handshake, the registry sends all known public keys for the namespace.
  2. The client receives the list of public keys and selects a corresponding private key from that list (if one is available) to sign the challenge.
  3. The client sends the server the updated set of public keys plus the proof of possession from one of the known keys.

Thus, at the end of the handshake:

  • The client has demonstrated it owns a private key from the existing known public keys.
  • The client's desired set of keys is synchronized with the registry; the client can remove or add any arbitrary keys in the call.

registry/client_commands_test.go Outdated Show resolved Hide resolved
@bbockelm
Copy link
Collaborator

Oh! One final item from the review -- please rebase and squash down the commits to a clean list.

Each commit should be self-contained (compiles, passes tests), covering a single piece of functionality (such as adding the key directory logic; or adding the API to update the keys), and have a detailed commit message. Please refer to https://osg-htc.org/technology/software/git-software-development/#making-good-pull-requests-the-art-of-good-commits for some best practices. Given the scope of the changes, I'd expect 4-5 commits to come out on this branch -- not 60.

I typically like keeping the git history in response to reviewer comments to help the reader understand the evolution of ideas. However, at this point, the history of the PR is so messy that it'd be impossible to glean that knowledge -- more value is in a clean branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request go Pull requests that update Go code origin Issue relating to the origin component registry Issue relating to the registry component security
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow origin issuer key to be rotated
4 participants