-
Notifications
You must be signed in to change notification settings - Fork 220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
enable .json formats for uploading and metadata editing #443
base: master
Are you sure you want to change the base?
Conversation
Checks if file ends in *.json, will be processed as json, else *.csv
|
||
def is_json(filename): | ||
# Checks whether filetype is .json, else fallsback to .csv file format | ||
return bool(re.fullmatch('^.*\.json$', filename)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use filename.endswith('.json')
instead of a regex. The bool will not be required either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maxz In this instance, the bool()
is appropriate because a one-line utility is_xyz()
function should return True
or False
. Without bool()
the function could return two different data types which is a bit confusing for the caller.
>>> import re
>>> type(re.fullmatch('^.*\.json$', "filename"))
<class 'NoneType'>
>>> type(re.fullmatch('^.*\.json$', "filename.json"))
<class 're.Match'>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cclauss I'm pretty sure they meant that the bool
isn't necessary with str.endswith
.
|
||
|
||
def is_json(filename): | ||
# Checks whether filetype is .json, else fallsback to .csv file format |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"falls back" should be written as two words.
@@ -66,6 +66,7 @@ | |||
from internetarchive.cli.argparser import get_args_dict, get_args_dict_many_write,\ | |||
get_args_header_dict | |||
from internetarchive.exceptions import ItemLocateError | |||
from internetarchive.utils import (is_json) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parentheses should be removed.
@@ -78,7 +78,7 @@ | |||
from internetarchive.cli.argparser import get_args_dict, convert_str_list_to_unicode | |||
from internetarchive.session import ArchiveSession | |||
from internetarchive.utils import (InvalidIdentifierException, get_s3_xml_text, | |||
is_valid_metadata_key, validate_s3_identifier) | |||
is_valid_metadata_key, validate_s3_identifier, is_json) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move the new import into its proper alphabetical position, (between get_s3_xml_text and is_valid_metadata_key)
while minding the proper maximum line length of 79 characters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tendency seems to be towards 88 chars max per line.
https://black.readthedocs.io/en/stable/the_black_code_style/current_style.html#line-length
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, 102 in internetarchive:
Line 4 in 2456376
max-line-length = 102 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the past I've shot for 90 chars max per line, personallly. If 88 is the tendency, I'm happy to change to that! much less than that tends to frustrate me personally though (creating a lot of weird multi-line statements).
@@ -398,3 +398,8 @@ def merge_dictionaries(dict0, dict1, keys_to_drop=None): | |||
# Items from `dict1` take precedence over items from `dict0`. | |||
new_dict.update(dict1) | |||
return new_dict | |||
|
|||
|
|||
def is_json(filename): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name should most likely be changed to has_json_extension
since it does not inspect and verify the content of a file and therefore could lead to future confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that we have dropped legacy Python, let's add type hints on the function parameters and return type.
Also, given that import os
is already done above, let's leverage that.
def is_json(file_path: str) -> bool:
return os.path.splitext(file_path)[1] == ".json"
https://docs.python.org/3/library/os.path.html#os.path.splitext
Of course, my favorite would be import pathlib
above and then:
def is_json(file_path: str) -> bool:
return Path(file_path).suffix == ".json"
https://docs.python.org/3/library/pathlib.html#pathlib.PurePath.suffix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any value in doing either of those over file_path.endswith('.json')
in this case. They're exactly equivalent, just with extra machinery.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd love to make more use of pathlib now that we're PY3 only. With that in mind, it might be worth using it here vs. the simplified file_path.endswith('.json')
for the sake of consistency. I'm all for type hints, sounds great.
I hear your argument against it though, too, @JustAnotherArchivist. I'm fine with either, but would lean towards cclauss's suggestion above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion the os.path.splitext
solution is clearly inferior to path.endswith
.
Yet if me managed to change to paths where possible and over time consistently manged to use pathlib, that would be pretty nice e.g. if functions where it made sense would already take a path as input instead of a string.
Additionally to the previously suggested changes, you should definitely amend the first line of your commit to something more descriptive than I also don't really like these changes since a JSON file is not a spreadsheet. So the functionality should probably be separate from |
Thanks for the contribution @laptopsftw, and thanks for reviewing @maxz! I agree with all of @maxz's suggestions. Happy to merge once these are resolved. Thanks again. |
@@ -78,7 +78,7 @@ | |||
from internetarchive.cli.argparser import get_args_dict, convert_str_list_to_unicode | |||
from internetarchive.session import ArchiveSession | |||
from internetarchive.utils import (InvalidIdentifierException, get_s3_xml_text, | |||
is_valid_metadata_key, validate_s3_identifier) | |||
is_valid_metadata_key, validate_s3_identifier, is_json) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tendency seems to be towards 88 chars max per line.
https://black.readthedocs.io/en/stable/the_black_code_style/current_style.html#line-length
@@ -398,3 +398,8 @@ def merge_dictionaries(dict0, dict1, keys_to_drop=None): | |||
# Items from `dict1` take precedence over items from `dict0`. | |||
new_dict.update(dict1) | |||
return new_dict | |||
|
|||
|
|||
def is_json(filename): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that we have dropped legacy Python, let's add type hints on the function parameters and return type.
Also, given that import os
is already done above, let's leverage that.
def is_json(file_path: str) -> bool:
return os.path.splitext(file_path)[1] == ".json"
https://docs.python.org/3/library/os.path.html#os.path.splitext
Of course, my favorite would be import pathlib
above and then:
def is_json(file_path: str) -> bool:
return Path(file_path).suffix == ".json"
https://docs.python.org/3/library/pathlib.html#pathlib.PurePath.suffix
|
||
def is_json(filename): | ||
# Checks whether filetype is .json, else fallsback to .csv file format | ||
return bool(re.fullmatch('^.*\.json$', filename)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maxz In this instance, the bool()
is appropriate because a one-line utility is_xyz()
function should return True
or False
. Without bool()
the function could return two different data types which is a bit confusing for the caller.
>>> import re
>>> type(re.fullmatch('^.*\.json$', "filename"))
<class 'NoneType'>
>>> type(re.fullmatch('^.*\.json$', "filename.json"))
<class 're.Match'>
if is_json(args['--spreadsheet']): | ||
spreadsheet = json.load(fp) | ||
else: | ||
spreadsheet = csv.DictReader(fp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is strange to call a dict
a spreadsheet
.
Does the for row in spreadsheet:
logic below really work for all .json files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The csv.DictReader
is an iterator producing a dict
for each row in the CSV file. You only get dict
s in the loop below.
All that means is that the JSON file must have a similar structure, i.e. an array of objects:
[
{"identifier": "item", "file": "./foobar"},
{"file": "./foobaz"}
]
This needs test coverage though and an example in the documentation.
Frankly, I do not understand how the recent review activity on this pull request was motivated. To me it seems like @laptopsftw does not have too much interest in integrating this pull request or else they would at least have notified us that they are working on the changes (unless something major in their life came up which hindered them). |
honestly i was still on the edge of actually integrating it on the mainline, since .csv was already a good workaround for this and it's just me being dum dum (unfamiliar) with the file format. as for me i have a separate local fork working for my case. would try to update a new pull request if i started working for 3.0.0 (i'm still uploading my files in 2.1.0 atm) |
Good to know, thanks for your reply. |
it's a small edit of mine: i was having problems uploading files since it has commas on it. instead of fixing it with double quotes, i just tried editing the code. just a small one.
what it does is it checks *.json in the filename, else it's a *.csv. that's pretty much it.