-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
java.lang.InterruptedException race in chiming functions. #37
Comments
Thanks @dazld, will have a look 😄 |
No biggie! Was taking chime for a test drive and these couple of things popped out. While we’re talking, would you mind pushing a release of the version with the IPending implementation? That threw me as well, and noticed it’s fixed in master. |
Ah, I see what's going on here - I'm afraid this is (currently, at least) intended behaviour within Chime. If a schedule is explicitly closed, we assume the system is shutting down in some way, and interrupt the scheduling thread to give it a chance to cleanly close any resources it's opened. Interrupting a thread that's currently waiting on a With that in mind, we can repro it reliably by making the future One example of prior art here would be the difference between I'll consider what it'd involve to add something like that to Chime, and what it's impact would be. In the meantime, maybe either a lock or a shared (let [now (Instant/now)
close-lock (Object.)
chiming (chime/chime-at (chime/periodic-seq now (Duration/ofSeconds 1))
(fn [time]
(log/info :chime (str time))
(locking close-lock
(log/info :future-task @(future (Thread/sleep 500)))))
{:error-handler (fn [e]
(log/error :chime-failed e)
true)
:on-finished #(log/info :done)})]
(Thread/sleep 250)
(locking close-lock
(.close chiming))) Re the 0.3.3 release - I've pushed a Thanks again for raising this issue :) |
The behaviour of |
BTW, I tried out the approach described in my previous comment, and the only test that breaks is this (for obvious reasons): FAIL in (test-cancelling-overrunning-task) (core_test.clj:111)
expected: (instance? InterruptedException (clojure.core/deref !error))
actual: nil
Ran 18 tests containing 45 assertions.
1 failures, 0 errors.
Tests failed. |
A complementary approach would be to let consumers choose how the executor is shutdown (i.e. (let [pool (Executors/newSingleThreadScheduledExecutor thread-factory)
!latch (promise)
cancelled? (atom false)
shutdown* (case exec-shutdown
:orderly (fn []
(reset! cancelled? true)
(.shutdown pool))
:abruptly (fn [] (.shutdownNow pool)))] ;; default |
I retract my last two comments! There is already a (when-not (realized? !latch)
(f time)
true) This allows moving to [EDIT]: Actually it's not 'breaking', is it? The default error-handler (or any other that already mimics it) can stay as is - they just will never see an |
I'm trying to digest the following comment (seen a few messages above):
This is an implementation detail, and users shouldn't depend on it - in fact I'm struggling to see how they even could. Users of My point is that Looking back at the history of the project, I can see that Anyway, enough waffling...i think it's pretty clear how much I want to use |
Hey @jimpil, thanks for the analysis - you've clearly gone quite deep into this 😄 If I may, to summarise, it seems there's a couple of points here:
Have I understood it correctly? :) James |
Hi @jarohen, See my comments inline:
We may arrive at this conclusion in the end, but it is kind of premature to consider this at this stage. We first need to understand the following:
Assuming that we agree on chime's singular promise (i.e.
As I've already explained, when there is no global notion of
I think you'll agree that relying on interrupts is inherently racy, and therefore by definition a
There hasn't always been one, but the moment you added the Now, if after doing all that, you still feel that there are masochists out there who would actually prefer their core scheduling primitive to be racy/not-deterministic and centered around interrupts, despite
Yes indeed, if you go down the
Yes but this is kind of necessary. The real problem, given the current implementation is that it's not communicated to the users (e.g via the doc-string - it just mentions truthy VS falsy).
There is no better/clearer error-handler given the current implementation. It will ultimately be an Hope that helps in your thinking process 👍 |
Thanks An important meta point, first of all:
Please don't dismiss people with alternative opinions in this way on my repos. You have very valid points and suggestions that I'm more than willing to consider - there is no need to detract from them by using such language and tone. |
You're right - I apologise for my poor choice of words, but I'm not actually dismissing anyone. I'm basically concluding that there may ultimately be value in accommodating those people (regardless of their opinions, or how they might look to me). By the way, I'm not a native speaker, but I totally get your point...I re-read my comment and i could have easily expressed the same point using different words. Point taken 👍 |
I disagree, I'm afraid. Could you elaborate?
I disagree with these too - that we have a 'global notion of done?', that we have a 'fully reliable means of communication without interrupts', and hence that we don't need them. What happens if the user wants to shut down the schedule while the job is currently running? Openly, I'm a little discouraged by your outright dismissal of interrupts without an acknowledgement of under what circumstances they might be useful. It doesn't fill me with confidence that you've fully considered the benefits of the opposite position before rejecting it. Similarly, I am aware of their pitfalls - so please don't assume I introduced them without due thought and consideration, nor that I am not sufficiently experienced that I don't know of alternatives.
Likewise. I have been working on/with Chime for ~7 years now. While I don't pretend to have gotten everything right, please do consider that I may have put some thought into it over that time! :) |
The perils of the interrupt flag, what/how affects it, and the general pains of using it correctly have been discussed extensively, and I'd like to avoid focusing on that, as it's rather secondary to my main point. Their most useful feature is the fact that blocking operations like network IO,
A user who wants to stop a side-effecting task that has already started (e.g sending an email, writing a file etc), is by definition a user who doesn't care about how much of the side-effect actually happened. I know it seems like the opposite because he/she (seems to) wants for things to stop asap, but there is no way for them to actually know, unless they are the ones driving it. Moreover, what does it mean
It's rather sad to hear this, as it proves that I didn't quite get the message across. I'm not opposing, nor dismissing interrupts in the general sense. I just don't think they provide a sound foundation to carry the unscheduling semantics of
I don't know what idea you got, but I consider the work you've done with chime simply excellent - from the core idea down to the actual implementation, and ultimately I am utterly grateful for the hours you've put in. The discussion we're having here is not to have a go at you, or your work. Quite the contrary I would say - I'm itching to use Finally, I was already wondering in the back of my head, but given how much more you've re-enforced your position around interrupts and Thanks again... |
It just occurred to me that the Future
(cancel [_ interrupt?]
(when-let [^Future fut @current] ;; `current` is an atom which holds the current-job's future (as returned by `.schecule`)
(and (not (.isCancelled fut))
(.cancel fut interrupt?)))) Again, just a thought - I'll come back to this with a clean head tomorrow and re-evaluate... |
This is kind of neat and it also kind of addresses #30, if I'm understanding the request correctly: ScheduledFuture
(cancel [_ interrupt?] ;; expose interrupt facilities (opt-in)
(when-let [^ScheduledFuture fut @current]
(and (not (.isCancelled fut))
(.cancel fut interrupt?))))
(getDelay [_ time-unit] ;; expose remaining time until next chime (partly addresses #30)
(when-let [^ScheduledFuture fut @current]
(.getDelay fut time-unit))) |
I was just reading this this morning, after several years of reading the actual book. I am not going to copy-paste any sections in particular, as there is a lot of good information in there, but the key takeaway (for me at least), is that the right reaction to an interrupt, will ultimately depend on the actual method that is being interrupted. In other words, there is no one size fits all solution, and |
Hey, thanks for the clarifications.
Ah, ok - apologies, I thought you were suggesting that the interrupt flag had/caused race conditions (the original subject of this issue). I definitely agree that interrupts have their perils - if we can get closer to the textbooks this can only be a good thing. |
This is the essence of my original proposal - I recognise that there are two distinct use cases, and Chime currently only handles the
IMO it's not really for Chime to decide this. In terms of opinions, it's a scheduler that takes a seq of times and a function, that's it. Beyond that, it's for Chime's users to determine how to handle errors, how to shutdown safely, etc. If you're supplying a long-running function (say, an overnight batch - the original use-case that led to Chime) that is interrupt-aware, you should be able to interrupt it. Where I think we're short atm is the more graceful stop - the user should (but doesn't currently) have the option to stop the schedule after the currently running job. On balance, I'd say it'd be less intuitive to run the next schedule, even if you stopped it beforehand? In your examples with banks and Amazon, their systems may well have constraints which means they need to operate like this - they might have daily batch processing of DDs, or the parcel may well already be making its way through the logistics chain - but we don't need to impose such constraints on all Chime users, who may just be using the scheduler to send a daily email? If I stopped the 2am daily schedule at midday on the 30th, I wouldn't expect it to send an email at 2am on the 31st, say.
Hopefully the above clarifies my thoughts here? If it helps, I wouldn't go so far as to call it a 'design center-piece' - it's not the core value of Chime, and I'm certainly open to iterating on it. We do have a test for the current behaviour, which I think you've run into - |
I wasn't aware of this checklist - thanks :) |
Re
Maybe this could be a chime-specific protocol, but with semantics/naming taken from |
FWIW, it was about a handful of sentences in your original message that gave an overall impression of being a little dismissive and patronising. I appreciate the clarifications, thanks - won't dwell on them any further. Happy to file under 'the perils of text-only communication' and move on with improving Chime :) |
First of all, it's great to hear that there is some form of convergence. Secondly, both this comment of yours, and a later one make it sound like chime aimed/aspired to accommodate interrupt-aware loops from day one. If that is indeed the case, should this be mentioned somewhere in the README, or even better the doc-string?
It is great that you have this opinion, but unfortunately
I couldn't agree more! You should definitely be able to interrupt it, but you should be explicit about what you're trying to do. The semantics of
I'm not following 100% what you mean here so let's try to use a minimal example using 3 successive times (
Do you see any problems with the above? What do you find less intuitive?
The only purpose of these examples was to showcase how tricky/leaky it is to express things like
Again, I couldn't agree more, but it kinda worries me that you have to state this at all, so could you elaborate if you don't mind? Have I said, or proposed something that would cause this?
The only aspect of that test that fails is the
This book is a goldmine 👍
My understanding from the Java docs is that this is calculated on-the-fly every time you call it, so it's always correct. That said, I would need to look at the actual source to be 100% sure. To be clear, I'm not opposing what you suggested - I'm just saying that there is a good chance this does exactly what a consumer might want (to know how long until the next chime). After all, he/she does have access to the times (he/she provided them), and there is a utility fn (
No such dilemma introduced. If you can call both
I haven't explored this idea, simply because I didn't need to. You have already made the decision to implement |
Re |
Yep, let's add it to the doc-string of the abrupt shutdown 👍
Right - propose that we fix this by exposing a separate graceful shutdown, and take the knowledge of the
Agreed, While I remember, I haven't yet considered the impact of this on the core.async wrapper - TODO 🤔
Right - it's the 'job remains scheduled to run at
Ah, ok - yep. As a first pass, this test could check that the sleep is interrupted, rather than relying on the error handling behaviour. After that, there's probably better ways of expressing the desired behaviour too.
It's always correct, but only at the exact moment
We've fixed an issue previously (#10) around holding on to the head of the lazy seq - I wouldn't want to ask the user to do the same, especially if it's infinite. We have the tail of the sequence available, and could expose it - this would be particularly suitable if the user only wanted the next time, as it wouldn't require evaluating any more of the sequence. Let's take that one to #30 :)
Your point about calling |
Hi there, Don't be alarmed by the out-of-order responses - I just think architectural decisions will greatly inform the implementation, so I'd like to start with those.
From an architectural stsandpoint, what we are trying to do here is to separate the (macroscopic) notion of a schedule from the (microscopic) notion of an individual task, and ultimately provide the end-user sane semantics for managing both (depending on his needs which only he knows). Looking at
While the issue with lazy-seqs and holding their head is definitely there, and chime users should always be vigilante and aware, I don't think it's worth fixating on it, as holding the head of an infinite sequence is a bad idea in general - not just in
You've lost me here, but that could be down to me claiming that
It may be a long time, but it may not be...what does 'long-time' mean in this context anyway? And how informative this ultimately is? From a pure scheduling perspective, I'd argue that the thing that matters is that there are no more chimes. From a pure resource-management perspective, in theory you're right - the sooner we can get rid of resources that are not ultimately going to be used, the better (especially when talking about current heavy-weight Thread objejcts). However, I refuse to let that 'nice-to-have' pollute my thoughts of correctness. In fact, having seen the latest talks on Loom, I strongly believe that we will eventually all move away from this obsession to reuse as few threads as we possibly need etc etc. In such a world, where Threads are so cheap that pooling them becomes an anti-pattern, this whole resource-management concern you alluded to becomes literaly a non-issue. I'll come back to you with implementation ideas tomorrow...As to how this all interacts with |
While that is one valid interpretation, there is enough ambiguity here that I don't want to use these names. Let's go with
This again comes across dismissive and patronising, as if the only way I'm able to think is when I'm being guided by you. I've given you the benefit of the doubt on a number of occasions, but seriously, please make this the last time.
I'm suggesting to give the user the raw data, and letting them make calculations based on it should they wish. If you give the users the raw data, they're free to make the calculation immediately, or they're free to make it later. In any event, this isn't a blocker for this issue - let's either move it to #30 as requested or drop it for now.
In this case, I am the user - and I am choosing to make this calculation as close as possible to when it's used. I'd prefer to pass an
Agreed.
Are you suggesting that closing the thread pool when we know the schedule's shut down is somehow 'incorrect'? If not, assuming it doesn't introduce excessive complexity, why would we not clean up when we can?
I'm also looking forward to Loom, and agree that it'll change the way we all think. However, on the understanding that we're not precluding ourselves from moving that way in the future as and when it becomes widely available, I don't think it applies in our current context. Even when Loom does become widely available, we still have to consider that there'll be users of Chime who aren't in a position to make that migration for a while. |
Imagine for a second that
Out of curiosity, would you still consider the returned object semantically ambiguous, after reading this?
I honestly don't see how this comes off as dismissive and/or patronising. I'm admitting that I didn't quite follow your comment, but also that it could be my own fault. I have no problem apologising, but not for something I don't even understand. In fact, I'm starting to feel that you're the one being overly defensive and somewhat dismissive. Communicating such deep/complex matters is not easy, and I'm making a big effort to to be both polite and informative here. If you aren't finding this useful or informative we can stop at any time - just let me know. After all we're both spending valuable time from our lives...
Again, fair-enough if you think this is helpful. My point is that that raw-data the user provided in the first place. To say that you're going to expose what the user gave you, just sounds weird (to me).
Again out of curiosity, what makes you think that an end-user will have a dramatically different-use case (when calling
If there was a way to have a graceful shutdown AND immediate cleanup, that would be just great! However, that is part of the conundrum with
Sure, but it does highlight that this notion of resource-management (for threads), is only ever going to decrease in importance. People who haven't made the transition, should of course be accommodated, but we're not talking about breaking anyone's program here. We're talking about leaving a couple of objects in memory. As I mentioned, if you can come with a solution for that, I'm looking forward to hear about it. |
Am I understanding correctly that you propose that we change I agree that we need to make the doc-string absolutely semantically clear - I additionally think that unambiguous names would be preferable. |
This has the tradeoff of the user keeping hold of the either the original seq (or the seq generator, passing it around with the schedule.
I don't - I just think that passing back the delay means that users pretty much have to use it immediately; passing them the raw time gives them the option of using it immediately or later. Maybe I'm missing some value here regarding re-purposing the Similarly, if |
Just to confirm, we're not proposing Assuming we were to go with distinct |
I'm not planning to (personally) solve this at this point in time. I'm not saying it's not valuable - if there's a way of migrating Chime to be more Loom-friendly, I'm all ears - but it's not one I'll dedicate brain cycles to just yet. |
Re: the core-async integration: At first glance, I don't see how the stuff we're discussing breaks
I see no evidence of this statement. I suspect you meant to say (fn [time]
(when-not (a/>!! ch time)
(a/close ch))) With the risk of rushing to conclusions, i will say that this looks like another remnant of a time when |
Agreed. This is why I feel it important to remove such tension from the discussion where possible. I was aiming to feed back in a (hopefully) constructive manner, without judgement, how some of the things you've been saying over the course of this issue have genuinely come across - from the other side, it's not conducive to a good discussion. I don't intend to be defensive, and for that I'm sorry - I'd rather be open about it so that we can get on with improving Chime.
To be clear, I'm not saying this isn't useful - just harder than it needs to be. |
I am not suggesting anything concrete yet. If I'm honest, mentally I'm still in the design phase, and so I'd like to avoid mixing in existing baggage. In other words I'm coming at this from a clean-slate perspective (at least for now). I do realise that this is not super practical, but I am not great at focusing on multiple things :(.
Accounting for the fact that
I don't disagree per-ce, I just find that to be a perfectly reasonable expectation for such a niche feature.
This is the bit I truly struggle to understand what you mean...The user can call
The underlying construct that
Of course, but only what the semantics of
Absolutely! If you go back to my original comment on this you'll see that my first draft implemented
Do you recall my very first suggestion of a new
👍 |
Here is a rough sketch of the API I would personally like for (defn cancel-next!
"Cancels the next upcoming chime, potentially abruptly,
as it may have already started. The rest of the schedule
will remain unaffected, unless the interruption is handled
by the error-handler (i.e. `InterruptedException`), and it
returns falsey, or throws (the default one returns true)."
([sched]
(cancel-next! sched true)) ;; essentially the same as `future-cancel`
([^ScheduledFuture sched interrupt?]
(.cancel sched interrupt?)))
(defn until-next
"Returns the remaining time (in millis by default)
until the next chime (via `ScheduledFuture.getDelay()`)."
([sched]
(until-next sched TimeUnit/MILLISECONDS))
([^ScheduledFuture sched time-unit]
(.getDelay sched time-unit)))
(defn cancel-next?!
"Like `cancel-next!`, but only if the next task hasn't already started."
[^ScheduledFuture sched]
(when (pos? (until-next sched))
(cancel-next! sched)))
(defn shutdown!
"Gracefully closes the entire schedule (per `pool.shutdown()`).
If the next task hasn't started yet, it will be cancelled,
otherwise it will be allowed to finish."
[^AutoCloseable sched]
(-> (doto sched (.close))
cancel-next?!))
(defn shutdown-now!
"Attempts a graceful shutdown, but if the latest task
is already happening attempts to interrupt it.
Semantically equivalent to `pool.shutdownNow()`
when called with <interrupt?> = true (the default)."
([sched]
(shutdown-now! sched true))
([sched interrupt?]
(or (shutdown! sched)
(cancel-next! sched interrupt?))))
(defn wait-for!
"Blocking call for waiting until the schedule finishes,
or the provided <timeout-ms> has elapsed."
([sched]
(deref sched))
([sched timeout-ms timeout-val]
(deref sched timeout-ms timeout-val))) |
I've edited my last comment 3 times, but this this third iteration shows signs of maturity...For instance, notice how |
Thanks - and a Happy New Year to you too :) Plenty of food for thought here, thanks - some initial musings:
Incidentally, there's a bug in the (admittedly pseudo-code) implementation of
Thinking also of the use case of rendering the next time the schedule will run (stealing inspiration from #30), there's no way to do this reliably with That said, I'm coming around to the idea of additionally exposing
This outlines why we might adopt a similar API, I was wondering whether there were benefits to returning an object that implements this exact interface. Having had a few days to mull this over, though, the fact that Likewise, I'll think on this a bit more :) |
Hi again,
The arity is there more to match the
Not really, but in my view a higher level API like this should be instructive/informative to the user.
You're right this is a terrible name! Something like
I really don't want to dwell over this, but two points:
So as long as the user can live with the millisecond tolerance (999 microseconds left are considered too late), I don't see the bug. In fact, I used my google-fu to try to find what kind of use-cases
Indeed, I don't think it is possible to calculate the datetime of the next-chime and have it always be 100% (defn next-chime-at
"Returns the (future) `ZonedDateTime` when the next chime will occur,
or nil if it has already started (millisecond tolerance)."
(^ZonedDateTime [sched]
(next-chime-at sched times/*clock*))
(^ZonedDateTime [sched ^Clock clock]
(let [remaining (until-next-chime sched)]
(when (pos? remaining)
(-> (ZonedDateTime/now clock)
(.plus remaining ChronoUnit/MILLIS))))))
There is no way (that I can see) to use
👍 |
For reference, here is what I am able to do locally: (def sched (chime-at (times/every-n-seconds 2) println))
=> #'user/sched
#object[java.time.Instant 0x6d43f538 2021-01-03T22:52:38.986748Z]
#object[java.time.Instant 0x697c7c48 2021-01-03T22:52:40.986748Z]
(cancel-current?! sched)
=> true
#object[java.time.Instant 0x2c7eff03 2021-01-03T22:52:44.986748Z]
#object[java.time.Instant 0x6f801063 2021-01-03T22:52:46.986748Z]
#object[java.time.Instant 0x1d18d29e 2021-01-03T22:52:48.986748Z]
(cancel-current?! sched)
=> true
#object[java.time.Instant 0x170c6349 2021-01-03T22:52:52.986748Z]
#object[java.time.Instant 0x4c51004c 2021-01-03T22:52:54.986748Z]
(shutdown! sched)
=> true
;; no more chimes |
Let's perform the following thought experiment: Assume that a user has the ability to get both the milliseconds (from |
There seems to be a race between closing a chime and the chiming function invocation that can lead to a
java.lang.InterruptedException
being thrown when functions are awaiting something coming from, for example, afuture
, or have consumed a function that does this somewhere along the call stack.It'd be nice if either the chiming function was never called, or any inflight functions were allowed to complete before closing.
Example:
outputs:
Note, might have to run this a couple of times to see the exception, as it doesn't always happen.
The text was updated successfully, but these errors were encountered: