Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CA-394343: After clock jump the xapi assumed the host is HOST_OFFLINE #5700

Merged
merged 1 commit into from
Jul 23, 2024

Conversation

minglumlu
Copy link
Member

@minglumlu minglumlu commented Jun 18, 2024

Prior to this commit, the xapi on the coordinator host records the 'Unix.gettimeofday' as the timestamps of the heartbeat with other pool supporter hosts. When the system clock is updated with a huge jump forward, the timestamps would be far back into the past. This would cause the xapi assumes that the hosts are offline as long time no heartbeats.

In this commit, the timestamps are changed to get from a monotonic clock. In this way, the system clock changes will not impact the heartbeats' timestamps any more.

Additionally, Host_metrics.last_updated is only set when the object is created. It's useless in check_host_liveness at all.


This change is Reviewable

Copy link

codecov bot commented Jun 18, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Additional details and impacted files
@@           Coverage Diff            @@
##           master   #5700     +/-   ##
========================================
+ Coverage    44.8%   45.5%   +0.6%     
========================================
  Files          16      13      -3     
  Lines        2210    1624    -586     
========================================
- Hits          992     739    -253     
+ Misses       1218     885    -333     

see 8 files with indirect coverage changes

Flag Coverage Δ
python2.7 45.5% <ø> (ø)
python3.11 ?

Flags with carried forward coverage won't be shown. Click here to find out more.

@minglumlu minglumlu force-pushed the private/mingl/CA-394343 branch from df9ca7b to 205fb48 Compare June 18, 2024 12:55
ocaml/xapi/db_gc.ml Outdated Show resolved Hide resolved
ocaml/xapi/db_gc.ml Outdated Show resolved Hide resolved
ocaml/xapi/db_gc.ml Outdated Show resolved Hide resolved
ocaml/xapi/db_gc.ml Outdated Show resolved Hide resolved
ocaml/xapi/db_gc.ml Outdated Show resolved Hide resolved
@lindig
Copy link
Contributor

lindig commented Jun 19, 2024

Robustness against clock changes is desirable but I question that it can be expected for large clock changes. That's like designing for power outages - it needs to be an explicit goal right from the start.

@minglumlu
Copy link
Member Author

Robustness against clock changes is desirable but I question that it can be expected for large clock changes. That's like designing for power outages - it needs to be an explicit goal right from the start.

Yes. The robustness is the major concern of this change. The automation testing relies on the large clock changes heavily, although there may not be common cases in production. But I think improving the robustness in xapi would resolve the potential issues in either case.

@minglumlu minglumlu force-pushed the private/mingl/CA-394343 branch from a9b5612 to 01c3493 Compare June 21, 2024 07:44
@minglumlu minglumlu requested a review from psafont June 21, 2024 07:45
ocaml/xapi/xapi_globs.ml Outdated Show resolved Hide resolved
ocaml/xapi/xapi_globs.ml Outdated Show resolved Hide resolved
ocaml/xapi/xapi_globs.ml Outdated Show resolved Hide resolved
@minglumlu minglumlu force-pushed the private/mingl/CA-394343 branch 2 times, most recently from e418f5d to 9603537 Compare July 4, 2024 10:54
Copy link
Member

@psafont psafont left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The biggest is the drop of the retrieval of the timestamp from the metrics, I don't think that code was useful

@coveralls
Copy link

Coverage Status

coverage: 44.887%. remained the same
when pulling 9603537 on minglumlu:private/mingl/CA-394343
into e61e0ac on xapi-project:master.

Prior to this commit, the xapi on the coordinator host records the
'Unix.gettimeofday' as the timestamps of the heartbeat with other pool
supporter hosts. When the system clock is updated with a huge jump
forward, the timestamps would be far back into the past. This would
cause the xapi assumes that the hosts are offline as long time no
heartbeats.

In this commit, the timestamps are changed to get from a monotonic
clock. In this way, the system clock changes will not impact the
heartbeats' timestamps any more.

Additionally, Host_metrics.last_updated is only set when the object is
created. It's useless in check_host_liveness at all.

Signed-off-by: Ming Lu <[email protected]>
@minglumlu minglumlu force-pushed the private/mingl/CA-394343 branch from 9603537 to 9291f21 Compare July 9, 2024 04:11
@minglumlu
Copy link
Member Author

Rebased with the master branch to resolve the conflict.

Comment on lines +93 to +95
| None ->
Clock.Timer.start
~duration:!Xapi_globs.host_assumed_dead_interval
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in case of None, it means the host has never tickle heartbeat? Then we should consider it as expired?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

287 let start_db_gc_thread () =
288   Thread.create
289     (fun () ->
290       Debug.with_thread_named "db_gc"
291         (fun () ->
292           while true do
293             try Thread.delay db_GC_TIMER ; single_pass ()
294             with e ->
295               debug "Exception in DB GC thread: %s" (ExnHelper.string_of_exn e)
296           done
297         )
298         ()
299     )
300     ()
301

For
293 try Thread.delay db_GC_TIMER ; single_pass ()
As we use a timer to check heartbeat and a default timer for each host, I think here we should start single_pass immediately.

@gangj
Copy link
Contributor

gangj commented Jul 15, 2024

Additionally, Host_metrics.last_updated is only set when the object is
created. It's useless in check_host_liveness at all.

I guess it was updated periodically with some time interval in some old version xapi which did not tickle_heartbeat yet, so it was considered as the heartbeat for the new added heartbeat mechanism, add then it was removed in the new heartbeat version.
As all hosts will tickle_heartbeat now, it is useless now.

@psafont
Copy link
Member

psafont commented Jul 22, 2024

Is there any testing left to do? This seems ready to merge otherwise

@gangj gangj requested a review from psafont July 23, 2024 03:00
Copy link
Contributor

@gangj gangj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 3 files reviewed, 9 unresolved discussions (waiting on @minglumlu and @psafont)


ocaml/xapi/db_gc.ml line 95 at r1 (raw file):

Previously, gangj (Gang Ji) wrote…
287 let start_db_gc_thread () =
288   Thread.create
289     (fun () ->
290       Debug.with_thread_named "db_gc"
291         (fun () ->
292           while true do
293             try Thread.delay db_GC_TIMER ; single_pass ()
294             with e ->
295               debug "Exception in DB GC thread: %s" (ExnHelper.string_of_exn e)
296           done
297         )
298         ()
299     )
300     ()
301

For
293 try Thread.delay db_GC_TIMER ; single_pass ()
As we use a timer to check heartbeat and a default timer for each host, I think here we should start single_pass immediately.

How about this change?

Copy link
Member

@psafont psafont left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 3 files reviewed, 1 unresolved discussion (waiting on @gangj and @minglumlu)

Copy link
Member

@psafont psafont left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed all commit messages.
Reviewable status: 0 of 3 files reviewed, 1 unresolved discussion (waiting on @gangj)


ocaml/xapi/db_gc.ml line 95 at r1 (raw file):

Previously, gangj (Gang Ji) wrote…

How about this change?

I'd rather keep the logic the same. This can be enhanced in another PR

Copy link
Member

@psafont psafont left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 3 files at r1.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @gangj)

Copy link
Contributor

@gangj gangj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @psafont)


ocaml/xapi/db_gc.ml line 95 at r1 (raw file):

Previously, psafont (Pau Ruiz Safont) wrote…

I'd rather keep the logic the same. This can be enhanced in another PR

With this delay, the first time a host's liveness is checked after xapi startup is not !Xapi_globs.host_assumed_dead_interval, it is (db_GC_TIMER + !Xapi_globs.host_assumed_dead_interval). I agree it is a small issue.

May I know which logic do you mean to "keep the logic the same" as?

@psafont psafont merged commit 090e846 into xapi-project:master Jul 23, 2024
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants