Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Caching of file-set hashes by local path and mtimes #700

Merged
merged 36 commits into from
Mar 17, 2024

Conversation

tclose
Copy link
Contributor

@tclose tclose commented Sep 11, 2023

Types of changes

  • New feature (non-breaking change which adds functionality)

Summary

Will address #683 by

  • bytes_repr() overloads for FileSets to yield a "time-stamp" object consisting of a key (file path) and modification time as the first item in the generator.
  • hash_single() checks for these key/mtime pairs in a global "local hashes cache" dict
    • returns cached hash, if present
    • otherwise it proceeds through the remaining byte chunks, and calculates the hash
  • calculated hashes are saved into/loaded from the cache directory using a hash of the key/mtime

Checklist

  • I have added tests to cover my changes (if necessary)
  • I have updated documentation (if necessary)

Notes

I'm pretty happy with how this turned out with the exception of a few of wrinkles. Any suggestions would be most welcome

  • There isn't a clean way to specify the location of the persistent hash cache in the top-level code or put it in the cache_location path (my original plan) given that checksum and hash are object properties instead of methods
    • I have therefore just dumped it in ~/.pydra-hash-cache
    • not ideal to have to (effectively) hard-code this and having it user dependent
    • Having it in the user directory does mean that if the same cache directory is accessed from different machines (e.g. shared network drive) then local paths won't clash (although chance of mtimes being the same would be vanishingly small)
  • The keys of the caches themselves are hashed and stored as files in the persistent cache directory, and will therefore never be cleaned up.
    • This could be ok as they are very small files so if it grows as time goes on it probably won't have much impact
    • This is how pydra caches work in general
  • The resolution of mtime differs depending on OS, with Ubuntu and Windows having quite low resolution, i.e. order of seconds
    • I have had to put a sleep call inside the hash test to ensure the mtimes are different
    • Could cause issues if you were updating the contents of files within a loop and spinning off different workflows (not sure whether this would ever happen in practice)
    • No easy way to disable the mtime caching behaviour if this does become a problem

@tclose tclose added the enhancement New feature or request label Sep 11, 2023
@tclose tclose changed the title added code to handle "locally-persistent-ids" Caching of file-set hashes by local path and mtimes Sep 11, 2023
@codecov
Copy link

codecov bot commented Sep 11, 2023

Codecov Report

Attention: Patch coverage is 99.04762% with 1 lines in your changes are missing coverage. Please review.

Project coverage is 84.22%. Comparing base (ff01e4c) to head (921979c).

Files Patch % Lines
pydra/utils/hash.py 98.97% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #700      +/-   ##
==========================================
+ Coverage   83.93%   84.22%   +0.29%     
==========================================
  Files          24       25       +1     
  Lines        5029     5123      +94     
  Branches     1429     1449      +20     
==========================================
+ Hits         4221     4315      +94     
  Misses        802      802              
  Partials        6        6              
Flag Coverage Δ
unittests 84.22% <99.04%> (+0.29%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@tclose tclose marked this pull request as ready for review September 12, 2023 23:58
@ghisvail
Copy link
Collaborator

I am not confident I understand enough of the previous and proposed caching methods to have an opinion and assess this PR.

AFAIK, the previous caching mechanism generated cache folders for each task (including the workflow) in a common cache location which was set to a temporary folder, unless overridden with the cache_dir argument.

Each task's cache folder was composed of its name and a hash value, the latter being computed from the task's input field values. The cache would be re-used, if the name and hash values did not change, and the same cache folder was specified as cache_dir. Otherwise, it would recompute everything in a new cache location somewhere in temp.

Could you confirm whether my summary is accurate and whether this PR is proposing to change the fundamentals of this mechanism?

@tclose
Copy link
Contributor Author

tclose commented Sep 17, 2023

I am not confident I understand enough of the previous and proposed caching methods to have an opinion and assess this PR.

No worries, I just put you all down as reviewers so you were notified. Don't feel like you need to contribute if the area isn't familiar

AFAIK, the previous caching mechanism generated cache folders for each task (including the workflow) in a common cache location which was set to a temporary folder, unless overridden with the cache_dir argument.

Each task's cache folder was composed of its name and a hash value, the latter being computed from the task's input field values. The cache would be re-used, if the name and hash values did not change, and the same cache folder was specified as cache_dir. Otherwise, it would recompute everything in a new cache location somewhere in temp.

Could you confirm whether my summary is accurate and whether this PR is proposing to change the fundamentals of this mechanism?

Yes, your understanding is correct. This is how the execution cache works, both currently and in this PR. This PR seeks to improve (or more accurately, restore) performance by caching the hashes of file/directory types themselves, so files/directories don't need to be rehashed (which can be an expensive operation) each time the checksum is accessed.

It does this by hashing the path and mtime of the file/directory (not to be confused with the hashing of the file/directory contents), and using it as a key to look up a "hash cache" (not to be confused with the execution cache) that contains previously computed file/directory hashes.

@effigies
Copy link
Contributor

  • I have had to put a sleep call inside the hash test to ensure the mtimes are different

Better to mock the call than to sleep. Here's an example where I've done that with time.time():
https://github.com/nipy/nibabel/blob/5f37398a2f8211c175b798eb42298b963c693ae0/nibabel/tests/test_openers.py#L457-L464

pydra/utils/hash.py Outdated Show resolved Hide resolved
@tclose
Copy link
Contributor Author

tclose commented Sep 19, 2023

  • I have had to put a sleep call inside the hash test to ensure the mtimes are different

Better to mock the call than to sleep. Here's an example where I've done that with time.time(): https://github.com/nipy/nibabel/blob/5f37398a2f8211c175b798eb42298b963c693ae0/nibabel/tests/test_openers.py#L457-L464

Interesting, I assumed that the value for the mtime was controlled by the file-system not Python.

Copy link
Collaborator

@ghisvail ghisvail left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some suggestions and a question mark regarding the cache path in the home folder. Maybe also worth having a look to mtime mocking instead of sleep steps as suggested by Chris.

Looks solid otherwise 👍

pydra/utils/hash.py Outdated Show resolved Hide resolved
pydra/utils/hash.py Outdated Show resolved Hide resolved
@djarecka
Copy link
Collaborator

@tclose - you can try with new slurm testing workflow, should work now!

Copy link
Collaborator

@djarecka djarecka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! just left comments regarding the location of the cache directory

@tclose
Copy link
Contributor Author

tclose commented Sep 25, 2023

  • I have had to put a sleep call inside the hash test to ensure the mtimes are different

Better to mock the call than to sleep. Here's an example where I've done that with time.time(): https://github.com/nipy/nibabel/blob/5f37398a2f8211c175b798eb42298b963c693ae0/nibabel/tests/test_openers.py#L457-L464

@effigies I'm not sure this works on Windows (it seems to work on Ubuntu), see test failures

@tclose tclose force-pushed the local-cache-ids branch 2 times, most recently from 21540ad to 96dcc48 Compare September 26, 2023 03:24
@tclose
Copy link
Contributor Author

tclose commented Sep 26, 2023

@effigies, I think that I have addressed the outstanding issues with this PR so it just needs your review. However, I had to revert the mtime mocking as it doesn't appear to work for Windows (see https://github.com/nipype/pydra/actions/runs/6307685788/job/17124695852), unless you have some ideas on how to do it.

@djarecka
Copy link
Collaborator

@tclose - the recent commit to this PR comes from some rebasing?

@tclose
Copy link
Contributor Author

tclose commented Feb 28, 2024

@tclose - the recent commit to this PR comes from some rebasing?

I rebased it on top of master after the environment PR was merged and fixed up a few things related to those changes

@djarecka
Copy link
Collaborator

ok, trying to figure out if I should review it again, since I already accepted

@effigies effigies dismissed ghisvail’s stale review February 28, 2024 14:32

Changes addressed

pydra/utils/hash.py Outdated Show resolved Hide resolved
Copy link
Contributor

@effigies effigies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I've completely lost the plot on this PR. I don't understand what's going on, but there are some things that look wrong to me.

pydra/utils/hash.py Outdated Show resolved Hide resolved
pydra/utils/hash.py Show resolved Hide resolved
pydra/utils/hash.py Outdated Show resolved Hide resolved
@djarecka
Copy link
Collaborator

djarecka commented Mar 7, 2024

sorry, I'm sure we discussed it at some point, but I'm not sure about one important thing... looks like if I move file around my filesystem I have no way of reusing the previous tasks now, is that right?

@djarecka
Copy link
Collaborator

djarecka commented Mar 8, 2024

sorry, I was wrong, the task has correct hash when the file is the same with a different path.

Perhaps you can add this test: https://github.com/djarecka/pydra/blob/db782c20890797bd58eb1d52545b7104f7d41aa4/pydra/engine/tests/test_node_task.py#L1568

(I was planning to create a PR to you, but I must have merged to the branch more things to my branch)

@tclose
Copy link
Contributor Author

tclose commented Mar 8, 2024

sorry, I was wrong, the task has correct hash when the file is the same with a different path.

Perhaps you can add this test: https://github.com/djarecka/pydra/blob/db782c20890797bd58eb1d52545b7104f7d41aa4/pydra/engine/tests/test_node_task.py#L1568

(I was planning to create a PR to you, but I must have merged to the branch more things to my branch)

Ok, nice addition

@djarecka
Copy link
Collaborator

djarecka commented Mar 8, 2024

I've just realized that we should also have a similar test when the persistent cache is used, but realized that it's not enough to set PYDRA_HASH_CACHE. Sorry, for asking for more work, but can we have some test when the persistent cache is used together with running task or workflow

@tclose
Copy link
Contributor Author

tclose commented Mar 16, 2024

@djarecka I have added a new test to ensure that the persistent cache gets hit during the running of tasks. Let me know if there is anything else you need me to do

return super().contents


def test_task_files_persistentcache(tmp_path):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tclose - where are you setting the persistent cach path? i.e. PYDRA_HASH_CACHE

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At @ghisvail's suggestion, the hash cache is stored in a system-dependent user cache directory using platformdirs.user_cache_dir by default (e.g. /Users/<username>/Library/Caches/pydra/<version-number> on MacOS).

I have just tweaked the code so that it is now put in a hashes subdirectory of that user cache dir (accessible in the pydra.utils.user_cache_dir variable) just in case any other cache data needs to be stored at some point in the future.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, got it! Sorry I missed that. I've just realized that I made a typo when I was setting PYDRA_HASH_CACHE and that's why it was not saving the hashes there, and was confused where this is being saved..

@djarecka
Copy link
Collaborator

@tclose - thanks so much for the work!

@djarecka djarecka merged commit 811dc45 into nipype:master Mar 17, 2024
43 checks passed
@tclose tclose deleted the local-cache-ids branch March 17, 2024 20:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants