-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create manifest creation script. #55
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some cosmetic changes
"directory_listing": {} | ||
} | ||
for directory in os.walk(base_dir): | ||
versions = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mind changing this to something like version_list
? The one character difference between version
and versions
might be confusing (channeling my inner Russell Owen here)
Base release manifest dictionary with all directories, links, paths, | ||
and files populated. | ||
""" | ||
datasets = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optionally: same comment as before regarding just calling this the plural of dataset
. Maybe this could be dataset_lookup
. The need is not as acute because datasets
and data_set
are a little more distinguishable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to dataset_lookup and dataset.
ver_dict = ds_dict[data_kind] | ||
|
||
data_dir = os.path.join(base_dir, ver_dict['relative_path']) | ||
# print('---',data_dir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove commented-out line
(along with commented print
lines below)
return release | ||
|
||
|
||
def populate_paths_and_urls( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you give this function a name that more clearly indicates that this is for populating information at the level of directories/datasets, rather than individual files? I got a little tripped-up by the double implementation of bucket_prefix + relative_path
between this function and populate_datasets
. It took me a moment to realize this function was operating at a different level than the other.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to populate_directories
datasets[data_set][data_kind][tag][norm]['files'] = {} | ||
datasets[data_set][data_kind][tag][norm]['files'][ | ||
ext] = {} | ||
datasets[data_set][data_kind][tag][norm]['files'][ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than individually set 'version'
, 'url'
, 'file_hash'
etc. in each section of this if/else block, can you define a file_stats
dict that has each of these elements before entering the if/else block, and then the place file_stats
under whatever chain of keys to need to based on the extension?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch.
from abc_atlas_access.abc_atlas_cache.utils import file_hash_from_path | ||
|
||
|
||
BUCKET_PREFIX = 'https://allen-brain-cell-atlas.s3.us-west-2.amazonaws.com/' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These global variables do not appear to be used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one comment on a docstring. This is good to merge, though.
@@ -280,7 +280,7 @@ def populate_datasets( | |||
"--datasets_to_skip", | |||
nargs='+', | |||
type=str, | |||
default=['releases', 'SEAAD', 'Zhuang-C57BL6J'], | |||
default=['releases', 'SEA', 'Zhuang-C57BL6J'], | |||
help="Skip a given project for all directories that start with the " | |||
"given pattern. (e.g. SEAD will exclude all directories that " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the example in the docstring be SEAAD
(two As), since that is what we call the actual dataset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. Pretty sure there are also directories with SEA-AD and we also want to skip those.
Running a check of the manifest (downloading and verifying hashes). Once that's done I'll merge the changes unless there is anything else you'd like addressed. |
Add manifest builder function. Add doc strings. Add json and h5 file manifest creation. Hide mapmycells for now.
b7b3131
to
a7c5369
Compare
Add function, ported from Jupyter notebook to create manifests given a set of data.