Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributed Lock/Lease Function #773

Open
3 tasks
waab76 opened this issue Mar 17, 2021 · 1 comment
Open
3 tasks

Distributed Lock/Lease Function #773

waab76 opened this issue Mar 17, 2021 · 1 comment

Comments

@waab76
Copy link
Contributor

waab76 commented Mar 17, 2021

In order to support polling inputs failing over when the primary node fails (and other use cases), we need a distributed lock/lease function supported in Graylog.

Notes

Will need a single interface with a Mongo implementation and possibly a Cloud implementation. Need an atomic function like:

boolean claim(lease_id, client_id, duration) {
  if (lease exists) {
    if (client owns lease OR current time > expire time) {
      upsert lease record with client_id and new expire time
      return true
    } else {
      return false
    }
  } else {
    claim lease for client with expire time
    return true
  }
}

Will probably need to ensure lease/lock collection is indexed to enforce uniqueness. Need to consider how it will behave in a multi-node Mongo cluster.

Polling inputs will need to attempt to claim a lease before they can run. If the claim fails, they no-op.

Input Criteria

  • TBD

Acceptance Criteria

  • TBD

Tasks

  • TBD
@bernd
Copy link
Member

bernd commented Mar 18, 2021

@waab76 Thanks for starting this issue! 👍 We talked about locking in the job scheduler context about two years ago so I will add some thoughts from that discussion:

  • Use atomic database operations to avoid race conditions (we rely on the guarantees in MongoDB)
  • Nodes need to periodically update their leases to extend the expiration time and avoid other nodes claiming the lease
  • Use database time whenever possible instead of using the time of the database client (see MongoDB's $currentDate)
    • Reduces issues with out-of-sync clocks on client nodes
    • Even if database clock(s) is/are off, we at least use a single time source
    • Relies on the database clocks to be in sync between database nodes
    • For the job scheduler we thought about using monotonic clocks instead of wall time clock to avoid time sync issues
      • Each node updates a monotonic clock on its leases
      • Requires each Graylog node to monitor the monotonic clocks on all leases and remember the values to be able to detect when a node stops updating the lease
      • This makes it much more complicated, so we might get around with relying on the database's $currentDate to minimize wall clock issues
  • A node updating its lease needs to check that it's still the owner of that lease (must be an atomic database operation again)
    • Losing ownership might happen during long GC pauses, network partitions, etc
    • If the lease has a new owner, stop the task that requires the lease

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants