Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Publish timetable snapshot in background #5974

Merged

Conversation

vpaturet
Copy link
Contributor

@vpaturet vpaturet commented Jul 17, 2024

Summary

In the current implementation, the timetable snapshot can be committed:

Committing the snapshot from a reader thread should be avoided: it delays the API call and it makes the implementation more complex and fragile due to synchronization between threads.

This PR modifies the snapshot lazy-loading logic so that it does not trigger a snapshot commit from reader threads.

This allows also to remove synchronization on the snapshot creation process since only the graph writer thread can modify the snapshot.
There is still a need for synchronizing the publishing of the new snapshot to guarantee that all changes that lead to the creation of the snapshot happens-before the publication. In the current implementation this is guaranteed by a volatile field. This PR makes use of the ConcurrentPublished mechanism to align with the publication mechanism used in the TransitLayer.

In addition to this, it makes it possible to construct a TransitService that reflects the current state of the buffer rather than the state of the latest published snapshot.
This is useful when used in the context of an updater, so that the TransitService can see pending changes.

A downside of committing snapshots only when applying updates is that, in case of throttling, the pending changes will be deferred until the next real-time message arrives, after the throttling period. This is not an issue with updaters that receive a steady flow of messages, but could be problematic with updaters that receive messages in batches followed by silent periods.
To cover this edge case, a periodic task TimetableSnapshotFlush commits pending changes, if any. In addition the updaters do not commit a snapshot after an update anymore, leaving TimetableSnapshotFlush the only place where a snapshot can be created. The throttling logic is removed from TimetableSnapshotManager

Issue

No

Unit tests

No

Documentation

No

@vpaturet vpaturet added Technical Debt Real-Time Update The issue/PR is related to RealTime updates Skip Changelog labels Jul 17, 2024
@vpaturet vpaturet self-assigned this Jul 17, 2024
@vpaturet vpaturet force-pushed the publish_timetable_snapshot_in_background branch 2 times, most recently from 592d65a to 944b615 Compare July 17, 2024 15:11
Copy link

codecov bot commented Jul 17, 2024

Codecov Report

Attention: Patch coverage is 67.46032% with 41 lines in your changes missing coverage. Please review.

Project coverage is 69.64%. Comparing base (fa60b78) to head (cd8e29e).

Files Patch % Lines
...pplanner/updater/trip/TimetableSnapshotSource.java 74.32% 14 Missing and 5 partials ⚠️
...ripplanner/updater/spi/TimetableSnapshotFlush.java 0.00% 15 Missing ⚠️
...planner/updater/configure/UpdaterConfigurator.java 28.57% 4 Missing and 1 partial ⚠️
...pplanner/ext/siri/SiriTimetableSnapshotSource.java 90.00% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@              Coverage Diff              @@
##             dev-2.x    #5974      +/-   ##
=============================================
- Coverage      69.66%   69.64%   -0.02%     
- Complexity     17163    17165       +2     
=============================================
  Files           1942     1943       +1     
  Lines          73784    73795      +11     
  Branches        7550     7552       +2     
=============================================
- Hits           51401    51395       -6     
- Misses         19753    19772      +19     
+ Partials        2630     2628       -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@vpaturet vpaturet force-pushed the publish_timetable_snapshot_in_background branch 3 times, most recently from d3aaee0 to c586bd6 Compare July 19, 2024 08:35
@vpaturet
Copy link
Contributor Author

vpaturet commented Jul 19, 2024

@abyrd @leonardehrenfried From my perspective this PR is complete. There is still one open question: within an updater, should the TransitEditorService points to the latest published snapshot, to the snapshot buffer or none of them. This can be discussed in the next developer meeting.
I will rebase this PR and mark it as ready for review when #5973 get merged.

@leonardehrenfried leonardehrenfried requested a review from t2gran July 23, 2024 09:05
@vpaturet vpaturet force-pushed the publish_timetable_snapshot_in_background branch from b5c0104 to 707e01d Compare July 23, 2024 10:42
@vpaturet vpaturet force-pushed the publish_timetable_snapshot_in_background branch from 707e01d to 7fe6999 Compare July 24, 2024 08:02
@vpaturet vpaturet marked this pull request as ready for review July 24, 2024 08:45
@vpaturet vpaturet requested a review from a team as a code owner July 24, 2024 08:45
@vpaturet
Copy link
Contributor Author

@t2gran where would TimetableSnapshotFlush fit in the package structure? I don't see any obvious location.

@vpaturet
Copy link
Contributor Author

I added a couple of debug statements in TimetableSnapshotFlush to get an idea of the time required to create the snapshot and to index the TransitLayer. These debug statements should probably be removed at some point.

* duplicating a TripPattern -> Timetable map and indexing the new Timetables.
*/
private final CountdownTimer snapshotFrequencyThrottle;
private final ConcurrentPublished<TimetableSnapshot> snapshot = new ConcurrentPublished<>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I already know the answer but want to ask anyway: the use of ConcurrentPublished adds another layer of safety even though it isn't strictly necessary as there are never two writer threads active at the same time. Correct?

Or is there a case where its synchronization features are useful?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Synchronization has 2 properties: mutual exclusion and safe publication.
Mutual exclusion is not needed here since, like you said, there is only one writer thread and publishing the snapshot is an atomic operation (updating a reference).
What we need here is safe publication, that is: making sure that the reader thread sees all the modifications done by the writer thread. In practice it means purging the CPU caches and registers so that the reader thread does not use stale data.
Using a volatile field, as done in the current implementation, provides also a guarantee of safe publication, but it is a bit more obscure than using the ConcurrentPublished mechanism.

More details there: https://shipilev.net/blog/2014/safe-public-construction/

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ConcurrentPublished is simple enough, but would the Java AtomicReference be a better alternative. There are pros/cons here, and in the end I think we will revisit this, so maybe leave it for now.

Copy link
Member

@abyrd abyrd Jul 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that AtomicReference is used when you need to do a compare-and-swap (check that the reference is the one you think it is before replacing it). Isn't the ConcurrentPublished synchronization doing something deeper here by imposing happens-before, and blocking reordering-based optimizations?

An example of the problem I think it's avoiding:
Code running on the single graph-writer thread changes some sub-object referenced by the snapshot, such as a TripTimes. It then updates the shared reference to the snapshot containing those TripTimes, but due to some reordering optimization (which would be harmless in a single threaded environment) the snapshot reference becomes visible before the new TripTimes sub-object becomes visible. Even though setting a reference to the snapshot would always be atomic (no word-tearing when seen from other threads), there's no guarantee that the reading thread sees the new snapshot reference after the changes to the TripTimes it contains are in place. Synchronization should establish a kind of reordering "fence" where everything appearing higher up in the source code is definitely visible to the reader.

We're talking about a place where there's likely to be read/write contention only once every 1-5 seconds so extremely rarely at the CPU-timescale. It's probably harmless to synchronize as ConcurrentPublished does, to get the benefit of strong ordering effects.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vpaturet checked the Javadoc in detail. For these purposes, volatile, AtomicReference (Javadoc on the atomic package rather than one class), and synchronization all work because they all establish happens-before relationships, and we don't have to deal with read/write contention here.

@leonardehrenfried
Copy link
Member

Since we only have a single write thread, how is the case of a very busy updater handled? How strong is the guarantee of a scheduleWithFixedDelay? Are the runnables processed in the order they are submitted?

@vpaturet
Copy link
Contributor Author

Since we only have a single write thread, how is the case of a very busy updater handled? How strong is the guarantee of a scheduleWithFixedDelay? Are the runnables processed in the order they are submitted?

The runnables are processed in the order they are submitted but the scheduled task skips the queue: it is delayed at most by the duration of the updater task running right before.
Testing locally with Entur's setup (SIRI PubSub updater receiving messages every 40ms), the periodic task gets scheduled as expected.

return snapshot.get();
}

public TimetableSnapshot getTimetableSnapshotBuffer() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add Javadoc under what circumstances you want to have the buffer rather than the snapshot?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

t2gran
t2gran previously approved these changes Jul 24, 2024
gtfsTimetableSnapshotSource.flushBuffer();
}
LOG.debug("Flushed timetable snapshot buffer");
} catch (Throwable t) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Throwable opens up for a potensial critical error state, where the OTP Server is in a corrupt state, but does not go down. But the pragmatic programmer in me tells me that we probably have bigger issues in the code, and that we should fokus on those. To fix it we would need to catch both RuntimeException and Throwable. Log both, but take down the application for Throwable. I think it is out-of-scope for this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree.

Is there ever a case where you can recover from an OutOfMemoryError? I think in such a case you really want to let the JVM die.

Copy link
Contributor Author

@vpaturet vpaturet Jul 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a real case in a web server: a request thread uses up all memory, triggers an OOME which makes its thread dies, freeing up the memory again. We had this case in our production environment with some problematic GraphQL queries returning gigabytes of data. If the periodic task was running at the same time as such a request, the OOME could be triggered from the periodic task thread and makes it stop forever.

As a general rule, web servers do not stop because a worker thread dies.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See also Andrew's comment on that topic: #5974 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two separate concerns here: recovering from errors, and logging errors. Indeed you probably want to just let certain severe errors run their course even if that incapacitates the server. But you nonetheless want to make sure they are logged so you know what happened. If I recall correctly, an error in a Runnable on an Executor can cause the executor thread to die, but silently. Then the tasks just pile up in a queue for no apparent reason, which would be very confusing to debug.

An OOME within the executor might kill the executor thread. You definitely want to log that, as it will presumably be followed by tasks piling up and the JVM crashing and you'd want to know why.

* duplicating a TripPattern -> Timetable map and indexing the new Timetables.
*/
private final CountdownTimer snapshotFrequencyThrottle;
private final ConcurrentPublished<TimetableSnapshot> snapshot = new ConcurrentPublished<>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ConcurrentPublished is simple enough, but would the Java AtomicReference be a better alternative. There are pros/cons here, and in the end I think we will revisit this, so maybe leave it for now.

@vpaturet vpaturet merged commit de56d00 into opentripplanner:dev-2.x Jul 25, 2024
5 checks passed
@vpaturet vpaturet deleted the publish_timetable_snapshot_in_background branch July 25, 2024 09:21
@t2gran t2gran added this to the 2.6 milestone Sep 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Real-Time Update The issue/PR is related to RealTime updates Skip Changelog Technical Debt
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants