Publish timetable snapshot in background #5974

vpaturet · 2024-07-17T12:51:17Z

Summary

In the current implementation, the timetable snapshot can be committed:

from the graph writer thread when applying updates (see

OpenTripPlanner/src/main/java/org/opentripplanner/updater/trip/TimetableSnapshotManager.java

Line 141 in 8391674

public void purgeAndCommit() {

)
from reader threads when lazy-loading a snapshot (see

OpenTripPlanner/src/main/java/org/opentripplanner/transit/service/DefaultTransitService.java

Line 518 in 359cb73

private TimetableSnapshot lazyGetTimeTableSnapShot() {

)

Committing the snapshot from a reader thread should be avoided: it delays the API call and it makes the implementation more complex and fragile due to synchronization between threads.

This PR modifies the snapshot lazy-loading logic so that it does not trigger a snapshot commit from reader threads.

This allows also to remove synchronization on the snapshot creation process since only the graph writer thread can modify the snapshot.
There is still a need for synchronizing the publishing of the new snapshot to guarantee that all changes that lead to the creation of the snapshot happens-before the publication. In the current implementation this is guaranteed by a volatile field. This PR makes use of the ConcurrentPublished mechanism to align with the publication mechanism used in the TransitLayer.

In addition to this, it makes it possible to construct a TransitService that reflects the current state of the buffer rather than the state of the latest published snapshot.
This is useful when used in the context of an updater, so that the TransitService can see pending changes.

A downside of committing snapshots only when applying updates is that, in case of throttling, the pending changes will be deferred until the next real-time message arrives, after the throttling period. This is not an issue with updaters that receive a steady flow of messages, but could be problematic with updaters that receive messages in batches followed by silent periods.
To cover this edge case, a periodic task TimetableSnapshotFlush commits pending changes, if any. In addition the updaters do not commit a snapshot after an update anymore, leaving TimetableSnapshotFlush the only place where a snapshot can be created. The throttling logic is removed from TimetableSnapshotManager

Issue

No

Unit tests

No

Documentation

No

codecov · 2024-07-17T15:18:17Z

Codecov Report

Attention: Patch coverage is 67.46032% with 41 lines in your changes missing coverage. Please review.

Project coverage is 69.64%. Comparing base (fa60b78) to head (cd8e29e).

Files	Patch %	Lines
...pplanner/updater/trip/TimetableSnapshotSource.java	74.32%	14 Missing and 5 partials ⚠️
...ripplanner/updater/spi/TimetableSnapshotFlush.java	0.00%	15 Missing ⚠️
...planner/updater/configure/UpdaterConfigurator.java	28.57%	4 Missing and 1 partial ⚠️
...pplanner/ext/siri/SiriTimetableSnapshotSource.java	90.00%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@              Coverage Diff              @@
##             dev-2.x    #5974      +/-   ##
=============================================
- Coverage      69.66%   69.64%   -0.02%     
- Complexity     17163    17165       +2     
=============================================
  Files           1942     1943       +1     
  Lines          73784    73795      +11     
  Branches        7550     7552       +2     
=============================================
- Hits           51401    51395       -6     
- Misses         19753    19772      +19     
+ Partials        2630     2628       -2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

src/main/java/org/opentripplanner/updater/spi/TimetableSnapshotFlushUpdater.java

vpaturet · 2024-07-19T09:44:37Z

@abyrd @leonardehrenfried From my perspective this PR is complete. There is still one open question: within an updater, should the TransitEditorService points to the latest published snapshot, to the snapshot buffer or none of them. This can be discussed in the next developer meeting.
I will rebase this PR and mark it as ready for review when #5973 get merged.

src/ext/java/org/opentripplanner/ext/siri/SiriTimetableSnapshotSource.java

src/main/java/org/opentripplanner/updater/spi/TimetableSnapshotFlush.java

…metableSnapshotSource

src/test/java/org/opentripplanner/updater/trip/RealtimeTestEnvironment.java

vpaturet · 2024-07-24T08:46:19Z

@t2gran where would TimetableSnapshotFlush fit in the package structure? I don't see any obvious location.

vpaturet · 2024-07-24T08:48:41Z

I added a couple of debug statements in TimetableSnapshotFlush to get an idea of the time required to create the snapshot and to index the TransitLayer. These debug statements should probably be removed at some point.

leonardehrenfried · 2024-07-24T09:06:05Z

src/main/java/org/opentripplanner/updater/trip/TimetableSnapshotManager.java

-   * duplicating a TripPattern -> Timetable map and indexing the new Timetables.
-   */
-  private final CountdownTimer snapshotFrequencyThrottle;
+  private final ConcurrentPublished<TimetableSnapshot> snapshot = new ConcurrentPublished<>();


I think I already know the answer but want to ask anyway: the use of ConcurrentPublished adds another layer of safety even though it isn't strictly necessary as there are never two writer threads active at the same time. Correct?

Or is there a case where its synchronization features are useful?

Synchronization has 2 properties: mutual exclusion and safe publication.
Mutual exclusion is not needed here since, like you said, there is only one writer thread and publishing the snapshot is an atomic operation (updating a reference).
What we need here is safe publication, that is: making sure that the reader thread sees all the modifications done by the writer thread. In practice it means purging the CPU caches and registers so that the reader thread does not use stale data.
Using a volatile field, as done in the current implementation, provides also a guarantee of safe publication, but it is a bit more obscure than using the ConcurrentPublished mechanism.

More details there: https://shipilev.net/blog/2014/safe-public-construction/

The ConcurrentPublished is simple enough, but would the Java AtomicReference be a better alternative. There are pros/cons here, and in the end I think we will revisit this, so maybe leave it for now.

My understanding is that AtomicReference is used when you need to do a compare-and-swap (check that the reference is the one you think it is before replacing it). Isn't the ConcurrentPublished synchronization doing something deeper here by imposing happens-before, and blocking reordering-based optimizations?

An example of the problem I think it's avoiding:
Code running on the single graph-writer thread changes some sub-object referenced by the snapshot, such as a TripTimes. It then updates the shared reference to the snapshot containing those TripTimes, but due to some reordering optimization (which would be harmless in a single threaded environment) the snapshot reference becomes visible before the new TripTimes sub-object becomes visible. Even though setting a reference to the snapshot would always be atomic (no word-tearing when seen from other threads), there's no guarantee that the reading thread sees the new snapshot reference after the changes to the TripTimes it contains are in place. Synchronization should establish a kind of reordering "fence" where everything appearing higher up in the source code is definitely visible to the reader.

We're talking about a place where there's likely to be read/write contention only once every 1-5 seconds so extremely rarely at the CPU-timescale. It's probably harmless to synchronize as ConcurrentPublished does, to get the benefit of strong ordering effects.

@vpaturet checked the Javadoc in detail. For these purposes, volatile, AtomicReference (Javadoc on the atomic package rather than one class), and synchronization all work because they all establish happens-before relationships, and we don't have to deal with read/write contention here.

leonardehrenfried · 2024-07-24T10:11:36Z

Since we only have a single write thread, how is the case of a very busy updater handled? How strong is the guarantee of a scheduleWithFixedDelay? Are the runnables processed in the order they are submitted?

vpaturet · 2024-07-24T11:43:12Z

Since we only have a single write thread, how is the case of a very busy updater handled? How strong is the guarantee of a scheduleWithFixedDelay? Are the runnables processed in the order they are submitted?

The runnables are processed in the order they are submitted but the scheduled task skips the queue: it is delayed at most by the duration of the updater task running right before.
Testing locally with Entur's setup (SIRI PubSub updater receiving messages every 40ms), the periodic task gets scheduled as expected.

leonardehrenfried · 2024-07-24T11:46:34Z

src/main/java/org/opentripplanner/updater/trip/TimetableSnapshotManager.java

+    return snapshot.get();
+  }
+
+  public TimetableSnapshot getTimetableSnapshotBuffer() {


Can you add Javadoc under what circumstances you want to have the buffer rather than the snapshot?

t2gran · 2024-07-24T12:14:51Z

src/main/java/org/opentripplanner/updater/spi/TimetableSnapshotFlush.java

+        gtfsTimetableSnapshotSource.flushBuffer();
+      }
+      LOG.debug("Flushed timetable snapshot buffer");
+    } catch (Throwable t) {


The Throwable opens up for a potensial critical error state, where the OTP Server is in a corrupt state, but does not go down. But the pragmatic programmer in me tells me that we probably have bigger issues in the code, and that we should fokus on those. To fix it we would need to catch both RuntimeException and Throwable. Log both, but take down the application for Throwable. I think it is out-of-scope for this PR.

I agree.

Is there ever a case where you can recover from an OutOfMemoryError? I think in such a case you really want to let the JVM die.

This is a real case in a web server: a request thread uses up all memory, triggers an OOME which makes its thread dies, freeing up the memory again. We had this case in our production environment with some problematic GraphQL queries returning gigabytes of data. If the periodic task was running at the same time as such a request, the OOME could be triggered from the periodic task thread and makes it stop forever.

As a general rule, web servers do not stop because a worker thread dies.

See also Andrew's comment on that topic: #5974 (comment)

There are two separate concerns here: recovering from errors, and logging errors. Indeed you probably want to just let certain severe errors run their course even if that incapacitates the server. But you nonetheless want to make sure they are logged so you know what happened. If I recall correctly, an error in a Runnable on an Executor can cause the executor thread to die, but silently. Then the tasks just pile up in a queue for no apparent reason, which would be very confusing to debug.

An OOME within the executor might kill the executor thread. You definitely want to log that, as it will presumably be followed by tasks piling up and the JVM crashing and you'd want to know why.

t2gran · 2024-07-24T12:23:12Z

src/main/java/org/opentripplanner/updater/trip/TimetableSnapshotManager.java

-   * duplicating a TripPattern -> Timetable map and indexing the new Timetables.
-   */
-  private final CountdownTimer snapshotFrequencyThrottle;
+  private final ConcurrentPublished<TimetableSnapshot> snapshot = new ConcurrentPublished<>();


The ConcurrentPublished is simple enough, but would the Java AtomicReference be a better alternative. There are pros/cons here, and in the end I think we will revisit this, so maybe leave it for now.

vpaturet added Technical Debt Real-Time Update The issue/PR is related to RealTime updates Skip Changelog labels Jul 17, 2024

vpaturet self-assigned this Jul 17, 2024

vpaturet force-pushed the publish_timetable_snapshot_in_background branch 2 times, most recently from 592d65a to 944b615 Compare July 17, 2024 15:11

abyrd reviewed Jul 18, 2024

View reviewed changes

src/main/java/org/opentripplanner/updater/spi/TimetableSnapshotFlushUpdater.java Outdated Show resolved Hide resolved

vpaturet force-pushed the publish_timetable_snapshot_in_background branch 3 times, most recently from d3aaee0 to c586bd6 Compare July 19, 2024 08:35

abyrd reviewed Jul 23, 2024

View reviewed changes

src/ext/java/org/opentripplanner/ext/siri/SiriTimetableSnapshotSource.java Outdated Show resolved Hide resolved

src/main/java/org/opentripplanner/updater/spi/TimetableSnapshotFlush.java Outdated Show resolved Hide resolved

leonardehrenfried requested a review from t2gran July 23, 2024 09:05

vpaturet force-pushed the publish_timetable_snapshot_in_background branch from b5c0104 to 707e01d Compare July 23, 2024 10:42

abyrd requested changes Jul 24, 2024

View reviewed changes

src/main/java/org/opentripplanner/updater/spi/TimetableSnapshotFlush.java Show resolved Hide resolved

vpaturet added 9 commits July 24, 2024 10:02

Remove lock on Timetable snapshot buffer

2c34a4e

Use current buffer from SiriTimetableSnapshotSource

892449a

use current buffer in TransitEditorService

78dd01f

Reduce visibility of commitTimetableSnapshot

2050550

Introduce timetable snapshot flush

79802f0

PUblish timetable snapshot with ConcurrentPublished

104ee66

Commit snapshot only from the TimetableSnapshotFlush periodic task.

0f77374

Document transit editor service in TimetableSnapshotSource and SiriTi…

850a2be

…metableSnapshotSource

Ignore exceptions in TimetableSnapshotFlush

7fe6999

vpaturet force-pushed the publish_timetable_snapshot_in_background branch from 707e01d to 7fe6999 Compare July 24, 2024 08:02

leonardehrenfried reviewed Jul 24, 2024

View reviewed changes

src/test/java/org/opentripplanner/updater/trip/RealtimeTestEnvironment.java Outdated Show resolved Hide resolved

Commit snapshot while applying trip updates in RealtimeTestEnvironment

d1d7f43

vpaturet requested review from abyrd and leonardehrenfried July 24, 2024 08:45

vpaturet marked this pull request as ready for review July 24, 2024 08:45

vpaturet requested a review from a team as a code owner July 24, 2024 08:45

leonardehrenfried reviewed Jul 24, 2024

View reviewed changes

t2gran previously approved these changes Jul 24, 2024

View reviewed changes

Added Javadoc

cd8e29e

vpaturet dismissed t2gran’s stale review via cd8e29e July 24, 2024 13:46

vpaturet requested review from leonardehrenfried and t2gran July 24, 2024 13:50

t2gran approved these changes Jul 24, 2024

View reviewed changes

abyrd approved these changes Jul 24, 2024

View reviewed changes

leonardehrenfried approved these changes Jul 25, 2024

View reviewed changes

vpaturet merged commit de56d00 into opentripplanner:dev-2.x Jul 25, 2024
5 checks passed

vpaturet deleted the publish_timetable_snapshot_in_background branch July 25, 2024 09:21

abyrd mentioned this pull request Jul 25, 2024

Additional SVG diagrams in updaters package.md #5936

Merged

t2gran added this to the 2.6 milestone Sep 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Publish timetable snapshot in background #5974

Publish timetable snapshot in background #5974

vpaturet commented Jul 17, 2024 •

edited

Loading

codecov bot commented Jul 17, 2024 •

edited

Loading

vpaturet commented Jul 19, 2024 •

edited

Loading

vpaturet commented Jul 24, 2024

vpaturet commented Jul 24, 2024

leonardehrenfried Jul 24, 2024

vpaturet Jul 24, 2024

t2gran Jul 24, 2024

abyrd Jul 24, 2024 •

edited

Loading

abyrd Jul 25, 2024

leonardehrenfried commented Jul 24, 2024

vpaturet commented Jul 24, 2024

leonardehrenfried Jul 24, 2024

vpaturet Jul 24, 2024

t2gran Jul 24, 2024

leonardehrenfried Jul 24, 2024

vpaturet Jul 24, 2024 •

edited

Loading

vpaturet Jul 24, 2024

abyrd Jul 24, 2024

t2gran Jul 24, 2024

Publish timetable snapshot in background #5974

Publish timetable snapshot in background #5974

Conversation

vpaturet commented Jul 17, 2024 • edited Loading

Summary

Issue

Unit tests

Documentation

codecov bot commented Jul 17, 2024 • edited Loading

Codecov Report

vpaturet commented Jul 19, 2024 • edited Loading

vpaturet commented Jul 24, 2024

vpaturet commented Jul 24, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abyrd Jul 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

leonardehrenfried commented Jul 24, 2024

vpaturet commented Jul 24, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vpaturet Jul 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vpaturet commented Jul 17, 2024 •

edited

Loading

codecov bot commented Jul 17, 2024 •

edited

Loading

vpaturet commented Jul 19, 2024 •

edited

Loading

abyrd Jul 24, 2024 •

edited

Loading

vpaturet Jul 24, 2024 •

edited

Loading