You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In order to support polling inputs failing over when the primary node fails (and other use cases), we need a distributed lock/lease function supported in Graylog.
Notes
Will need a single interface with a Mongo implementation and possibly a Cloud implementation. Need an atomic function like:
boolean claim(lease_id, client_id, duration) {
if (lease exists) {
if (client owns lease OR current time > expire time) {
upsert lease record with client_id and new expire time
return true
} else {
return false
}
} else {
claim lease for client with expire time
return true
}
}
Will probably need to ensure lease/lock collection is indexed to enforce uniqueness. Need to consider how it will behave in a multi-node Mongo cluster.
Polling inputs will need to attempt to claim a lease before they can run. If the claim fails, they no-op.
Input Criteria
TBD
Acceptance Criteria
TBD
Tasks
TBD
The text was updated successfully, but these errors were encountered:
@waab76 Thanks for starting this issue! 👍 We talked about locking in the job scheduler context about two years ago so I will add some thoughts from that discussion:
Use atomic database operations to avoid race conditions (we rely on the guarantees in MongoDB)
Nodes need to periodically update their leases to extend the expiration time and avoid other nodes claiming the lease
Use database time whenever possible instead of using the time of the database client (see MongoDB's $currentDate)
Reduces issues with out-of-sync clocks on client nodes
Even if database clock(s) is/are off, we at least use a single time source
Relies on the database clocks to be in sync between database nodes
For the job scheduler we thought about using monotonic clocks instead of wall time clock to avoid time sync issues
Each node updates a monotonic clock on its leases
Requires each Graylog node to monitor the monotonic clocks on all leases and remember the values to be able to detect when a node stops updating the lease
This makes it much more complicated, so we might get around with relying on the database's $currentDate to minimize wall clock issues
A node updating its lease needs to check that it's still the owner of that lease (must be an atomic database operation again)
Losing ownership might happen during long GC pauses, network partitions, etc
If the lease has a new owner, stop the task that requires the lease
In order to support polling inputs failing over when the primary node fails (and other use cases), we need a distributed lock/lease function supported in Graylog.
Notes
Will need a single interface with a Mongo implementation and possibly a Cloud implementation. Need an atomic function like:
Will probably need to ensure lease/lock collection is indexed to enforce uniqueness. Need to consider how it will behave in a multi-node Mongo cluster.
Polling inputs will need to attempt to claim a lease before they can run. If the claim fails, they no-op.
Input Criteria
Acceptance Criteria
Tasks
The text was updated successfully, but these errors were encountered: