Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Could not remove pg_stat - unsatisfied constraints (Directory not empty) #251

Open
ocharles opened this issue Feb 5, 2020 · 12 comments
Open
Labels

Comments

@ocharles
Copy link

ocharles commented Feb 5, 2020

I have been happily using tmp-postgres locally, but on CI I got:

Exception: /tmp/tmp-postgres-data-4de8de8b7e43c151/pg_stat: removeDirectoryRecursive:removeContentsRecursive:removePathRecursive:removeContentsRecursive:removeDirectory: unsatisfied constraints (Directory not empty)

I've never seen this before, any idea what could have happened?

@jfischoff
Copy link
Owner

jfischoff commented Feb 5, 2020

ah crap.

This is a bug I'm really struggling to kill.

I think that as removeDirectoryRecursive is running new files are being added to the directory.

This is the silly hack I have left in.

https://github.com/jfischoff/tmp-postgres/blob/master/src/Database/Postgres/Temp/Internal/Config.hs#L393

The other thing I have tried is rename folder before removing but that also did not work.

I guess I should try the idea I had of locking the directories ... but I am not really sure what I was thinking because I would lock in the same process (I see now ... it is postgres that is writing into the folder so it would be a different process).

I could also just try deleting over and over again for some amount of time.

@ocharles
Copy link
Author

ocharles commented Feb 5, 2020

Ah, you have experienced this too then?

@jfischoff
Copy link
Owner

jfischoff commented Feb 5, 2020

Not recently but yes. I was pretty sure I had not solved it.

@ocharles
Copy link
Author

ocharles commented Feb 21, 2021

We're hitting this a lot at the moment, so I'm doing some investigation into this. Some findings:

  • Disabling copy-on-write doesn't stop the problem happening. Or more, copyOnWrite = False and copyOnWrite = cowCheck both cause the bug (I haven't confirmed what cowCheck is on the failing machine yet).

  • The tests are failing in CI builds, which are all Nix builds. Interestingly when we don't have the Nix sandbox enabled (currently our default), tests fail with this error about 90% of the time. With the Nix sandbox enabled, I am yet to see this bug happen. So it seems to be sensitive to a tainted environment.

  • Next, I note that find /tmp -name 'tmp-postgres-data-*' -type d | wc -l = 2653, so there's a lot of junk lying around from previous runs!

  • I ran a test suite until it failed with this bug. I already had a list of all existing /tmp/tmp-postgres-data directories, and the directory that failed to be removed wasn't in the list - so it's not like the random directory names are picking existing directories.

  • I moved /tmp/tmp-postgres-* to /tmp/old-tmp-postgres and ran a failing test suite again. This test suite still failed, and I was left with 6 tmp-postgres-data directories. This is a somewhat surprising number - this test suite uses snapshots and opens 11 connections, so why are almost half of the directories still around? I only observed one exception, too! This makes me think that sometimes this cleanup function entirely runs and then PostgreSQL writes more files. So are we running the clean up function too early?

  • I notice that it's only ever a complaint about the pg_stat directory. Reading https://www.postgresql.org/docs/current/monitoring-stats.html, we see

    When the server shuts down cleanly, a permanent copy of the statistics data is stored in the pg_stat subdirectory, so that statistics can be retained across server restarts.

    So I'm again wondering if we're doing the cleanup too early.

  • Looking at the cleanup code, I see that the cleanup routine is (probably) ultimately called from:

    Async.concurrently_ (stopPlan dbPostgresProcess) $ cleanupConfig dbResources
    Now I'm suspicious - we know that shutting the server down can write files. This change was introduced in parallel stop #214 #215 which was introduced before this issue - so could this be the culprit?

  • I try reverting parallel stop #214 #215 circuithub@b90ed91 and rebuild using my fork.

  • With this change, I've managed to run the test suite 50 times in succession. 48 passed without exception. Something is still going wrong, but it appears to be a different problem (unrelated to tmp-postgres entirely).

I think my sandboxing stuff is just luck that causes the exception to happen less frequently.

I suggest that #215 is reverted, or least configurable. I'll live with an 8ms penalty if tests actually work reliably. Also, this time is spent per test, and we run tests in parallel anyway

@jfischoff
Copy link
Owner

Thanks @ocharles. I think you are probably right and reverting #215 is the problem. I'll revert it.

@the-dr-lazy
Copy link

the-dr-lazy commented Mar 14, 2021

As a temporary hack for anyone who have same issue:

Just catch the close action.

If there is a better solution, I'd be happy to see.

@ocharles
Copy link
Author

The better solution is to just revert b90ed91

@ocharles
Copy link
Author

@jfischoff Do you still plan to revert the above mentioned commit? We haven't had any more problems since we reverted it (we've been running https://github.com/circuithub/tmp-postgres since my last comment)

@codygman
Copy link

@jfischoff My team is also affected by this, it looks like our fix will be a temporary fork of tmp-postgres reverting b90ed91 since @ocharles said it's been working for some time.

This error started showing up for us seemingly out of nowhere.

codygman added a commit to EdutainmentLIVE/tmp-postgres that referenced this issue Sep 16, 2021
At least, from context that seems an accurate summary of what's
happening. This was recommended by @ocharles and has been working for
CircuitHub per:

jfischoff#251 (comment)
ismaelbouyaf added a commit to ismaelbouyaf/docker-nix that referenced this issue Oct 14, 2021
The sandbox was (probably) enabled due to the .stack/shell.nix files: it made use of buildStackProject, which requires to run outside of the sandbox for some reason ( https://github.com/NixOS/nixpkgs/blob/master/pkgs/development/haskell-modules/generic-stack-builder.nix#L25 ). Now that this file is not used anymore, we can reenable the sandbox.

It should help fix in particular this recent issue we’re facing in the CI: jfischoff/tmp-postgres#251
@ocharles
Copy link
Author

@jfischoff Polite ping. I still think reverting #215 is worth doing.

@jfischoff
Copy link
Owner

jfischoff commented Aug 15, 2022

@ocharles I haven't had time to work on this project recently, but I should have time this week. I'll take a look. Thanks for the ping.

@parsonsmatt
Copy link

I just received our first report on this. Is there anything I can do to help diagnose or get the #215 revert moving along?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants