-
Notifications
You must be signed in to change notification settings - Fork 285
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CA-394343: After clock jump the xapi assumed the host is HOST_OFFLINE #5700
CA-394343: After clock jump the xapi assumed the host is HOST_OFFLINE #5700
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅ Additional details and impacted files@@ Coverage Diff @@
## master #5700 +/- ##
========================================
+ Coverage 44.8% 45.5% +0.6%
========================================
Files 16 13 -3
Lines 2210 1624 -586
========================================
- Hits 992 739 -253
+ Misses 1218 885 -333 see 8 files with indirect coverage changes
Flags with carried forward coverage won't be shown. Click here to find out more. |
df9ca7b
to
205fb48
Compare
Robustness against clock changes is desirable but I question that it can be expected for large clock changes. That's like designing for power outages - it needs to be an explicit goal right from the start. |
Yes. The robustness is the major concern of this change. The automation testing relies on the large clock changes heavily, although there may not be common cases in production. But I think improving the robustness in xapi would resolve the potential issues in either case. |
a9b5612
to
01c3493
Compare
e418f5d
to
9603537
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The biggest is the drop of the retrieval of the timestamp from the metrics, I don't think that code was useful
Prior to this commit, the xapi on the coordinator host records the 'Unix.gettimeofday' as the timestamps of the heartbeat with other pool supporter hosts. When the system clock is updated with a huge jump forward, the timestamps would be far back into the past. This would cause the xapi assumes that the hosts are offline as long time no heartbeats. In this commit, the timestamps are changed to get from a monotonic clock. In this way, the system clock changes will not impact the heartbeats' timestamps any more. Additionally, Host_metrics.last_updated is only set when the object is created. It's useless in check_host_liveness at all. Signed-off-by: Ming Lu <[email protected]>
9603537
to
9291f21
Compare
Rebased with the master branch to resolve the conflict. |
| None -> | ||
Clock.Timer.start | ||
~duration:!Xapi_globs.host_assumed_dead_interval |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in case of None
, it means the host has never tickle heartbeat? Then we should consider it as expired?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
287 let start_db_gc_thread () =
288 Thread.create
289 (fun () ->
290 Debug.with_thread_named "db_gc"
291 (fun () ->
292 while true do
293 try Thread.delay db_GC_TIMER ; single_pass ()
294 with e ->
295 debug "Exception in DB GC thread: %s" (ExnHelper.string_of_exn e)
296 done
297 )
298 ()
299 )
300 ()
301
For
293 try Thread.delay db_GC_TIMER ; single_pass ()
As we use a timer to check heartbeat and a default timer for each host, I think here we should start single_pass
immediately.
I guess it was updated periodically with some time interval in some old version xapi which did not |
Is there any testing left to do? This seems ready to merge otherwise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 3 files reviewed, 9 unresolved discussions (waiting on @minglumlu and @psafont)
ocaml/xapi/db_gc.ml
line 95 at r1 (raw file):
Previously, gangj (Gang Ji) wrote…
287 let start_db_gc_thread () = 288 Thread.create 289 (fun () -> 290 Debug.with_thread_named "db_gc" 291 (fun () -> 292 while true do 293 try Thread.delay db_GC_TIMER ; single_pass () 294 with e -> 295 debug "Exception in DB GC thread: %s" (ExnHelper.string_of_exn e) 296 done 297 ) 298 () 299 ) 300 () 301
For
293 try Thread.delay db_GC_TIMER ; single_pass ()
As we use a timer to check heartbeat and a default timer for each host, I think here we should startsingle_pass
immediately.
How about this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 3 files reviewed, 1 unresolved discussion (waiting on @gangj and @minglumlu)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed all commit messages.
Reviewable status: 0 of 3 files reviewed, 1 unresolved discussion (waiting on @gangj)
ocaml/xapi/db_gc.ml
line 95 at r1 (raw file):
Previously, gangj (Gang Ji) wrote…
How about this change?
I'd rather keep the logic the same. This can be enhanced in another PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 3 files at r1.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @gangj)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @psafont)
ocaml/xapi/db_gc.ml
line 95 at r1 (raw file):
Previously, psafont (Pau Ruiz Safont) wrote…
I'd rather keep the logic the same. This can be enhanced in another PR
With this delay, the first time a host's liveness is checked after xapi startup is not !Xapi_globs.host_assumed_dead_interval, it is (db_GC_TIMER + !Xapi_globs.host_assumed_dead_interval). I agree it is a small issue.
May I know which logic do you mean to "keep the logic the same" as?
Prior to this commit, the xapi on the coordinator host records the 'Unix.gettimeofday' as the timestamps of the heartbeat with other pool supporter hosts. When the system clock is updated with a huge jump forward, the timestamps would be far back into the past. This would cause the xapi assumes that the hosts are offline as long time no heartbeats.
In this commit, the timestamps are changed to get from a monotonic clock. In this way, the system clock changes will not impact the heartbeats' timestamps any more.
Additionally, Host_metrics.last_updated is only set when the object is created. It's useless in check_host_liveness at all.
This change is