Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeDecodeError due to non-ASCII chars in key #72

Open
jakubgs opened this issue Mar 5, 2018 · 2 comments
Open

UnicodeDecodeError due to non-ASCII chars in key #72

jakubgs opened this issue Mar 5, 2018 · 2 comments

Comments

@jakubgs
Copy link

jakubgs commented Mar 5, 2018

I've encountered this issue with glacier-cli failing due to git-annex mistakenly adding things that look like file extension to the key when using the SHA256E backend. Essentially what it means is that certain files will have characters that look like a file extension appended to the key, even when they might not be part of the extension.

Example:

 % ls 12.\ Change\ The\ World\ \(feat.\ 웅산\).mp3 
12. Change The World (feat. 웅산).mp3
 % git annex info 12.\ Change\ The\ World\ \(feat.\ 웅산\).mp3
file: 12. Change The World (feat. 웅산).mp3
size: 7.48 megabytes
key: SHA256E-s7479642--957208748ae03fe4fc8d7877b2c9d82b7f31be0726e4a3dec9063b84cc64cf09.웅산.mp3
present: true
 % git annex calckey 12.\ Change\ The\ World\ \(feat.\ 웅산\).mp3
SHA256E-s7479642--957208748ae03fe4fc8d7877b2c9d82b7f31be0726e4a3dec9063b84cc64cf09.웅산.mp3

I've opened an issue with git-annex here:
https://git-annex.branchable.com/bugs/git-annex_adds_unicode_characters_at_end_of_checksum/

And the will be a fix for the case with brackets, but there are other cases in which a file extension might not be just ASCII. And then this is what happens:

% git annex copy 12.\ Change\ The\ World\ \(feat.\ 웅산\).mp3 --to glacier
copy 12. Change The World (feat. 웅산).mp3 (checking glacier...) Traceback (most recent call last):
  File "/usr/local/bin/glacier", line 737, in <module>
    main() 
  File "/usr/local/bin/glacier", line 733, in main
    App().main()
  File "/usr/local/bin/glacier", line 719, in main
    self.args.func()
  File "/usr/local/bin/glacier", line 600, in archive_checkpresent
    self.args.vault, self.args.name)
  File "/usr/local/bin/glacier", line 161, in get_archive_last_seen
    result = self._get_archive_query_by_ref(vault, ref).one()
  File "/usr/local/bin/glacier", line 136, in _get_archive_query_by_ref
    if ref.startswith('id:'):
UnicodeDecodeError: 'ascii' codec can't decode byte 0xec in position 83: ordinal not in range(128)
(user error (glacier ["--region=eu-west-1","archive","checkpresent","music","--quiet","SHA256E-s7479642--957208748ae03fe4fc8d7877b2c9d82b7f31be0726e4a3dec9063b84cc64cf09.\50885\49328.mp3"] exited 1)) failed
git-annex: copy: 1 failed

Now, As the bug report says, you can avoid this issue by changing your backend from SHA256E to SHA256 to avoid adding extensions. But I think addressing this issue would be good anyway.

@joeyh
Copy link

joeyh commented Mar 6, 2018

Note that on unix, filenames have no defined encoding. No matter how the locale is set up, any filename can contain most any series of bytes. It would be good to just treat the filename passed to glacier as a binary blob if you can.

@basak
Copy link
Owner

basak commented Mar 6, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants