-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Publish timetable snapshot in background #5974
Publish timetable snapshot in background #5974
Conversation
592d65a
to
944b615
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## dev-2.x #5974 +/- ##
=============================================
- Coverage 69.66% 69.64% -0.02%
- Complexity 17163 17165 +2
=============================================
Files 1942 1943 +1
Lines 73784 73795 +11
Branches 7550 7552 +2
=============================================
- Hits 51401 51395 -6
- Misses 19753 19772 +19
+ Partials 2630 2628 -2 ☔ View full report in Codecov by Sentry. |
src/main/java/org/opentripplanner/updater/spi/TimetableSnapshotFlushUpdater.java
Outdated
Show resolved
Hide resolved
d3aaee0
to
c586bd6
Compare
@abyrd @leonardehrenfried From my perspective this PR is complete. There is still one open question: within an updater, should the |
src/ext/java/org/opentripplanner/ext/siri/SiriTimetableSnapshotSource.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opentripplanner/updater/spi/TimetableSnapshotFlush.java
Outdated
Show resolved
Hide resolved
b5c0104
to
707e01d
Compare
src/main/java/org/opentripplanner/updater/spi/TimetableSnapshotFlush.java
Show resolved
Hide resolved
…metableSnapshotSource
707e01d
to
7fe6999
Compare
src/test/java/org/opentripplanner/updater/trip/RealtimeTestEnvironment.java
Outdated
Show resolved
Hide resolved
@t2gran where would |
I added a couple of debug statements in |
* duplicating a TripPattern -> Timetable map and indexing the new Timetables. | ||
*/ | ||
private final CountdownTimer snapshotFrequencyThrottle; | ||
private final ConcurrentPublished<TimetableSnapshot> snapshot = new ConcurrentPublished<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I already know the answer but want to ask anyway: the use of ConcurrentPublished
adds another layer of safety even though it isn't strictly necessary as there are never two writer threads active at the same time. Correct?
Or is there a case where its synchronization features are useful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Synchronization has 2 properties: mutual exclusion and safe publication.
Mutual exclusion is not needed here since, like you said, there is only one writer thread and publishing the snapshot is an atomic operation (updating a reference).
What we need here is safe publication, that is: making sure that the reader thread sees all the modifications done by the writer thread. In practice it means purging the CPU caches and registers so that the reader thread does not use stale data.
Using a volatile field, as done in the current implementation, provides also a guarantee of safe publication, but it is a bit more obscure than using the ConcurrentPublished
mechanism.
More details there: https://shipilev.net/blog/2014/safe-public-construction/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ConcurrentPublished
is simple enough, but would the Java AtomicReference be a better alternative. There are pros/cons here, and in the end I think we will revisit this, so maybe leave it for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that AtomicReference is used when you need to do a compare-and-swap (check that the reference is the one you think it is before replacing it). Isn't the ConcurrentPublished synchronization doing something deeper here by imposing happens-before, and blocking reordering-based optimizations?
An example of the problem I think it's avoiding:
Code running on the single graph-writer thread changes some sub-object referenced by the snapshot, such as a TripTimes. It then updates the shared reference to the snapshot containing those TripTimes, but due to some reordering optimization (which would be harmless in a single threaded environment) the snapshot reference becomes visible before the new TripTimes sub-object becomes visible. Even though setting a reference to the snapshot would always be atomic (no word-tearing when seen from other threads), there's no guarantee that the reading thread sees the new snapshot reference after the changes to the TripTimes it contains are in place. Synchronization should establish a kind of reordering "fence" where everything appearing higher up in the source code is definitely visible to the reader.
We're talking about a place where there's likely to be read/write contention only once every 1-5 seconds so extremely rarely at the CPU-timescale. It's probably harmless to synchronize as ConcurrentPublished does, to get the benefit of strong ordering effects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vpaturet checked the Javadoc in detail. For these purposes, volatile, AtomicReference (Javadoc on the atomic
package rather than one class), and synchronization all work because they all establish happens-before relationships, and we don't have to deal with read/write contention here.
Since we only have a single write thread, how is the case of a very busy updater handled? How strong is the guarantee of a |
The runnables are processed in the order they are submitted but the scheduled task skips the queue: it is delayed at most by the duration of the updater task running right before. |
return snapshot.get(); | ||
} | ||
|
||
public TimetableSnapshot getTimetableSnapshotBuffer() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add Javadoc under what circumstances you want to have the buffer rather than the snapshot?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
gtfsTimetableSnapshotSource.flushBuffer(); | ||
} | ||
LOG.debug("Flushed timetable snapshot buffer"); | ||
} catch (Throwable t) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Throwable
opens up for a potensial critical error state, where the OTP Server is in a corrupt state, but does not go down. But the pragmatic programmer in me tells me that we probably have bigger issues in the code, and that we should fokus on those. To fix it we would need to catch both RuntimeException and Throwable. Log both, but take down the application for Throwable. I think it is out-of-scope for this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree.
Is there ever a case where you can recover from an OutOfMemoryError
? I think in such a case you really want to let the JVM die.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a real case in a web server: a request thread uses up all memory, triggers an OOME which makes its thread dies, freeing up the memory again. We had this case in our production environment with some problematic GraphQL queries returning gigabytes of data. If the periodic task was running at the same time as such a request, the OOME could be triggered from the periodic task thread and makes it stop forever.
As a general rule, web servers do not stop because a worker thread dies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See also Andrew's comment on that topic: #5974 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two separate concerns here: recovering from errors, and logging errors. Indeed you probably want to just let certain severe errors run their course even if that incapacitates the server. But you nonetheless want to make sure they are logged so you know what happened. If I recall correctly, an error in a Runnable on an Executor can cause the executor thread to die, but silently. Then the tasks just pile up in a queue for no apparent reason, which would be very confusing to debug.
An OOME within the executor might kill the executor thread. You definitely want to log that, as it will presumably be followed by tasks piling up and the JVM crashing and you'd want to know why.
* duplicating a TripPattern -> Timetable map and indexing the new Timetables. | ||
*/ | ||
private final CountdownTimer snapshotFrequencyThrottle; | ||
private final ConcurrentPublished<TimetableSnapshot> snapshot = new ConcurrentPublished<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ConcurrentPublished
is simple enough, but would the Java AtomicReference be a better alternative. There are pros/cons here, and in the end I think we will revisit this, so maybe leave it for now.
Summary
In the current implementation, the timetable snapshot can be committed:
OpenTripPlanner/src/main/java/org/opentripplanner/updater/trip/TimetableSnapshotManager.java
Line 141 in 8391674
OpenTripPlanner/src/main/java/org/opentripplanner/transit/service/DefaultTransitService.java
Line 518 in 359cb73
Committing the snapshot from a reader thread should be avoided: it delays the API call and it makes the implementation more complex and fragile due to synchronization between threads.
This PR modifies the snapshot lazy-loading logic so that it does not trigger a snapshot commit from reader threads.
This allows also to remove synchronization on the snapshot creation process since only the graph writer thread can modify the snapshot.
There is still a need for synchronizing the publishing of the new snapshot to guarantee that all changes that lead to the creation of the snapshot happens-before the publication. In the current implementation this is guaranteed by a volatile field. This PR makes use of the
ConcurrentPublished
mechanism to align with the publication mechanism used in theTransitLayer
.In addition to this, it makes it possible to construct a
TransitService
that reflects the current state of the buffer rather than the state of the latest published snapshot.This is useful when used in the context of an updater, so that the
TransitService
can see pending changes.A downside of committing snapshots only when applying updates is that, in case of throttling, the pending changes will be deferred until the next real-time message arrives, after the throttling period. This is not an issue with updaters that receive a steady flow of messages, but could be problematic with updaters that receive messages in batches followed by silent periods.
To cover this edge case, a periodic task
TimetableSnapshotFlush
commits pending changes, if any. In addition the updaters do not commit a snapshot after an update anymore, leavingTimetableSnapshotFlush
the only place where a snapshot can be created. The throttling logic is removed fromTimetableSnapshotManager
Issue
No
Unit tests
No
Documentation
No