Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cloneEnv intermittently tries to relink a deleted effect #243

Closed
michaelpj opened this issue Sep 10, 2024 · 15 comments · Fixed by #256
Closed

cloneEnv intermittently tries to relink a deleted effect #243

michaelpj opened this issue Sep 10, 2024 · 15 comments · Fixed by #256

Comments

@michaelpj
Copy link
Contributor

We've been running with master for a bit in order to get the fix for #237. I think we may have hit another bug?

I don't have a reproducer, but we're getting the following error:

undefined data
CallStack (from HasCallStack):
  error, called at src/Effectful/Internal/Env.hs:428:17 in effectful-core-2.4.0.0-BiuUQUZzpHcJ16lkaBGrSU:Effectful.Internal.Env
  undefinedData, called at src/Effectful/Internal/Env.hs:413:26 in effectful-core-2.4.0.0-BiuUQUZzpHcJ16lkaBGrSU:Effectful.Internal.Env

From a very cursory look, I guess this could occur if some parts of the environment are not initialized properly, which could conceivably have been introduced with the recent fix PRs? I can try and debug more, but it's not easy to reproduce. Perhaps a good step would just to be to add some more HasCallStacks?

@arybczak
Copy link
Member

arybczak commented Sep 10, 2024

Hmm, this is very weird. undefinedData, called at src/Effectful/Internal/Env.hs:413:26 is in deleteEffect. So that means that you're trying to access an effect that went out of scope. However, there are additional version checks (in getLocation) that should fail before.

Can you try with https://github.com/haskell-effectful/effectful/tree/hascallstack-runners? This should at least give you a location which deleted effect it tries to use.

@michaelpj
Copy link
Contributor Author

Great, that's helpful. I'll try and do it but it might be a bit since it only happens in production (:see_no_evil: ).

In case it helps, here are a few possibly relevant facts:

  1. The function that triggers this is being run for the second time. So it worked once without failure; I think something must have modified its environment?
  2. The function that triggers this is being run in parallel with a second thread. Both of them are running with cloned environments from the parent thread.

So conceivably something is going wrong with cloning?

Anyway, I'll try and run with your branch and report back.

@arybczak
Copy link
Member

arybczak commented Sep 10, 2024

since it only happens in production (:see_no_evil: ).

Ugh, my condolences.

Ok, if this is a concurrent code and is hard to reproduce, that explains it. It looks like you have two threads using Env that hasn't been cloned and thus uses the same underlying Storage 😕

import Control.Concurrent
import Control.Concurrent.Async
import Effectful
import Effectful.Reader.Static
import Effectful.Dispatch.Static.Unsafe

works :: IO ()
works = runEff $ do
  x <- runReader () $ do
    withEffToIO (ConcUnlift Ephemeral $ Limited 1) $ \unlift -> do
      x <- liftIO . async $ do
        threadDelay 1000000
        unlift ask
      threadDelay 500000
      pure x
  liftIO $ wait x

fails :: IO ()
fails = runEff $ do
  x <- runReader () $ do
    reallyUnsafeUnliftIO $ \unlift -> do
      x <- liftIO . async $ do
        threadDelay 1000000
        unlift ask
      threadDelay 500000
      pure x
  liftIO $ wait x
>>> works
>>> fails
*** Exception: version (2) /= storageVersion (0)
CallStack (from HasCallStack):
  error, called at src/Effectful/Internal/Env.hs:360:5 in effectful-core-2.4.0.0-inplace:Effectful.Internal.Env

fails tries to use the Reader after it was deleted by the parent thread and it fails, but it still fails on the version check (because it's reset to 0 when an effect is deleted). What I think is happening in your case is there is a race between a thread accessing the effect and the one deleting it:

  1. T1 tries to access the effect E, effectful does version check, it succeeds.
  2. T2 deletes E.
  3. T1 gets the effect from the Storage and it's no longer there since it was deleted by T2.

Very unfortunate. It's likely you use a static effect that wraps an IO-based concurrent API and one of the wrapper functions has a bug, i.e. doesn't properly clone Env when spawning a thread 🤔

@arybczak
Copy link
Member

arybczak commented Sep 10, 2024

Ok, I merged #244 with a bunch of HasCallStack constraint for better call stacks.

I also pushed https://github.com/haskell-effectful/effectful/tree/thread-sanitizer with a PoC of a primitive thread sanitizer.

master:

>>> fails
*** Exception: version (2) /= storageVersion (0)
CallStack (from HasCallStack):
  error, called at src/Effectful/Internal/Env.hs:360:5 in effectful-core-2.4.0.0-inplace:Effectful.Internal.Env
  getLocation, called at src/Effectful/Internal/Env.hs:309:14 in effectful-core-2.4.0.0-inplace:Effectful.Internal.Env
  getEnv, called at src/Effectful/Internal/Monad.hs:600:35 in effectful-core-2.4.0.0-inplace:Effectful.Internal.Monad
  getStaticRep, called at src/Effectful/Reader/Static.hs:53:15 in effectful-core-2.4.0.0-inplace:Effectful.Reader.Static
  ask, called at tests/Main.hs:53:16 in effectful-2.4.0.0-inplace-test:Main

thread-sanitizer:

>>> fails
*** Exception: Storage used from multiple threads: owner (54) /= current (55)
CallStack (from HasCallStack):
  error, called at src/Effectful/Internal/Env.hs:456:14 in effectful-core-2.4.0.0-inplace:Effectful.Internal.Env
  readStorage, called at src/Effectful/Internal/Env.hs:356:28 in effectful-core-2.4.0.0-inplace:Effectful.Internal.Env
  getLocation, called at src/Effectful/Internal/Env.hs:311:14 in effectful-core-2.4.0.0-inplace:Effectful.Internal.Env
  getEnv, called at src/Effectful/Internal/Monad.hs:600:35 in effectful-core-2.4.0.0-inplace:Effectful.Internal.Monad
  getStaticRep, called at src/Effectful/Reader/Static.hs:53:15 in effectful-core-2.4.0.0-inplace:Effectful.Reader.Static
  ask, called at tests/Main.hs:53:16 in effectful-2.4.0.0-inplace-test:Main

Basically, whenever you access a fresh Storage, it gets bound to the current thread, so if it ever gets accessed from a different thread, you'll immediately know. If my hypothesis from above post is correct, this should let you identify the problem quickly 🤔

@michaelpj
Copy link
Contributor Author

Hmmmmm. I don't think we're doing anything weird here. The two threads run together and the main thread waits for both, and I don't think anything changes the environment. But I think it must be something like that.

That thread sanitizer stuff looks amazing! I will try that out too!

@michaelpj
Copy link
Contributor Author

The thread sanitizer did not trigger. Here's a stack trace:

undefined data
CallStack (from HasCallStack):
  error, called at src/Effectful/Internal/Env.hs:433:17 in effectful-core-2.4.0.0-4kVXPiFKBfI6XV8bIBZFIP:Effectful.Internal.Env
  undefinedData, called at src/Effectful/Internal/Env.hs:418:26 in effectful-core-2.4.0.0-4kVXPiFKBfI6XV8bIBZFIP:Effectful.Internal.Env
  deleteEffect, called at src/Effectful/Internal/Env.hs:224:3 in effectful-core-2.4.0.0-4kVXPiFKBfI6XV8bIBZFIP:Effectful.Internal.Env
  unconsEnv, called at src/Effectful/Internal/Monad.hs:526:5 in effectful-core-2.4.0.0-4kVXPiFKBfI6XV8bIBZFIP:Effectful.Internal.Monad
  runHandler, called at src/Effectful/Dispatch/Dynamic.hs:465:20 in effectful-core-2.4.0.0-4kVXPiFKBfI6XV8bIBZFIP:Effectful.Dispatch.Dynamic
  reinterpret, called at src/Effectful/NonDet.hs:78:21 in effectful-core-2.4.0.0-4kVXPiFKBfI6XV8bIBZFIP:Effectful.NonDet
  runNonDetRollback, called at src/Effectful/NonDet.hs:60:22 in effectful-core-2.4.0.0-4kVXPiFKBfI6XV8bIBZFIP:Effectful.NonDet
  runNonDet, called at <our code>

So possibly a bug or gotcha in runNonDetRollback. I think I don't fundamentally need that, so I'm probably going to just try and get rid of it and see if that helps. But maybe somewhere to look? I think it's likely that this is happening when we hit an empty and trigger a rollback.

@arybczak
Copy link
Member

The thread sanitizer did not trigger

Well, that's one good news at least.

So possibly a bug or gotcha in runNonDetRollback.

Well, it turns out that stack traces that come from undefinedData are deceptive. Apparently, due to how HasCallStack on values works, it points to the place in the code where undefinedData was inserted into the Storage, not where it was accessed 😕 So looks like your code tries to access the NonDet effect that is not there, sigh.

However, I previously misread the trace. The code doesn't try to access the effect data, it tries to access its relinker. So looks like this is triggered inside cloneEnv somewhere that tries to relink a NonDet effect that's not there anymore. I still don't understand how this can happen however 😞

I'm going to merge #247 that will start producing expected call stacks when undefined relinker is called, hopefully this sheds more light on the situation.

@michaelpj
Copy link
Contributor Author

Okay, I'll try running with #247 too.

@michaelpj
Copy link
Contributor Author

New day, new stack trace!

undefined relinker
CallStack (from HasCallStack):
  error, called at src/Effectful/Internal/Env.hs:459:56 in effectful-core-2.4.0.0-4kVXPiFKBfI6XV8bIBZFIP:Effectful.Internal.Env
  relinker, called at src/Effectful/Internal/Env.hs:179:17 in effectful-core-2.4.0.0-4kVXPiFKBfI6XV8bIBZFIP:Effectful.Internal.Env
  cloneEnv, called at src/Effectful/Internal/Unlift.hs:148:17 in effectful-core-2.4.0.0-4kVXPiFKBfI6XV8bIBZFIP:Effectful.Internal.Unlift
  ephemeralConcUnlift, called at src/Effectful/Internal/Unlift.hs:122:3 in effectful-core-2.4.0.0-4kVXPiFKBfI6XV8bIBZFIP:Effectful.Internal.Unlift
  concUnlift, called at src/Effectful/Internal/Monad.hs:242:39 in effectful-core-2.4.0.0-4kVXPiFKBfI6XV8bIBZFIP:Effectful.Internal.Monad
  concUnliftIO, called at src/Effectful/Internal/Monad.hs:205:40 in effectful-core-2.4.0.0-4kVXPiFKBfI6XV8bIBZFIP:Effectful.Internal.Monad
  withEffToIO, called at <our code>

@michaelpj
Copy link
Contributor Author

So we call an unlifting function, but when we copy our environment, some of it is missing?

@michaelpj
Copy link
Contributor Author

I am also somewhat baffled to say that switching OnEmptyRollback to OnEmptyKeep seems to have made the problem go away. But it might well just be lurking, and we just don't happen to hit it.

@arybczak
Copy link
Member

arybczak commented Sep 12, 2024

So we call an unlifting function, but when we copy our environment, some of it is missing?

Yeah, basically. cloneEnv goes through all effects in Storage and relinks them. Apparently in this case there is an effect there that was already deleted, but I don't understand at all how this could've happened since deleteEffect always deletes from the top and decreases the size of Storage.

It would be interesting if this indeed disappeared with switching to OnEmptyKeep, because it might be some weird interaction caused by restoreEnv (thatOnEmptyRollback uses) which overwrites the Storage, so in principle the size of the storage and its contents might diverge when this function is used, but of course it has a sanity check that relevant Storage sizes are the same -_- So I genuinely have no idea what's going on.

I fixed a bug in OnEmptyRollback that wouldn't roll back when the whole action failed in 1da8876, but that's unrelated.

I don't think I can do much more without a reproducer and no access to your code :(

Now that you have a location of the NonDet effect that went out of scope and the location of withEffToIO that attempts to call its relinker, how do these locations relate to each other? Perhaps with this information you could try to write a reproducer?

FWIW, I also pushed d97c9db so that when an undefined relinker is called, it will show both call stacks of its creation and its call site (since both are important), thought we already got them.

@michaelpj
Copy link
Contributor Author

Hmm okay. I will see if I can make a reproducer.

Another question is: are there any functions exposed from effectful that don't do the right checks? We do have a bunch of effect definitions and it's possible that one of them does something unsafe.

@arybczak
Copy link
Member

arybczak commented Sep 13, 2024

Apart from stuff in Effectful.Dispatch.Static.Unsafe, no.

Of course you can always just pass an Env to a different thread such that there is more than one thread operating on the same Env simultaneously, but then presumably you would've got errors from the thread-sanitizer branch.

Btw, did you stop getting this error with OnEmptyKeep?

@michaelpj
Copy link
Contributor Author

Btw, did you stop getting this error with OnEmptyKeep?

Yes I did. Although to be honest, that may just mean that I've peturbed the situation enough that we happen not to trip over the problem, not that it's necessarily related. I'm planning to roll back that change occasionally to keep debugging 😅

@arybczak arybczak changed the title undefined data on effectful master cloneEnv intermittently tries to relink a deleted effect Sep 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants