Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance UTF-8 support for filename handling in downloads #19161

Merged
merged 6 commits into from
Nov 20, 2024

Conversation

arash77
Copy link
Collaborator

@arash77 arash77 commented Nov 19, 2024

Fixes step 3 of #18584
Improve filename handling in data downloads and zipstream headers to support UTF-8 encoding, ensuring proper display of filenames with special characters.

How to test the changes?

(Select all options that apply)

  • I've included appropriate automated tests.
  • This is a refactoring of components with existing test coverage.
  • Instructions for manual testing are as follows:
    1. [add testing steps and prerequisites here if you didn't write automated tests covering all your changes]

License

  • I agree to license these and all my past contributions to the core galaxy codebase under the MIT license.

@galaxyproject-sentryintegration

🔍 Existing Issues For Review

Your pull request is modifying functions with the following pre-existing issues:

📄 File: lib/galaxy/datatypes/data.py

Function Unhandled Issue
_serve_raw FileNotFoundError: [Errno 2] No such file or directory: '' /datasets/11ac94870d0bb33ae88951f6779a9da...
Event Count: 1

Did you find this useful? React with a 👍 or 👎

@arash77 arash77 marked this pull request as ready for review November 19, 2024 15:12
@github-actions github-actions bot added this to the 25.0 milestone Nov 19, 2024
dataset,
to_ext,
hdca=kwd.get("hdca"),
element_identifier=kwd.get("element_identifier"),
filename_pattern=kwd.get("filename_pattern"),
)
headers["Content-Disposition"] = f'attachment; filename="{filename}"'
headers["Content-Disposition"] = (
f"attachment; filename=\"{filename}\"; filename*=UTF-8''{utf8_encoded_filename}"
Copy link
Contributor

@davelopez davelopez Nov 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this to a utility function to reduce duplication?
Something like def to_content_disposition(filename: str) or something similar. The utf8 encoding should be inside too, so _download_filename just returns filename.

Copy link
Collaborator Author

@arash77 arash77 Nov 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion!
Should I remove the escape from name in _download_filename?
Is it even necessary to escape the name anymore? Can we just use the quote function to escape?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is not relevant anymore I guess so? it seems to be used in hdca_name so check carefully if you can really drop it 👍

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dropped the escape function and also removed it from hdca_name too, as it didn’t seem necessary since the quote function handles it effectively.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The escape function limits input to 150 characters and excludes characters that aren't file system safe.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can limit the characters and I think that the quote function already excludes these kinds of characters.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No it does not, it just encodes them and the user-agent then knows how to turn them back into the original form.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Content-Disposition#filename, which unfortunately doesn't go into detail on restricted characters

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we have to keep this but I will move this part into the new function.



def toContentDisposition(filename: str) -> str:
sanitized_filename = "".join(c in FILENAME_VALID_CHARS and c or "_" for c in filename)[0:150]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You will also need to apply the file length limit and exclude certain characters inutf8_encoded_filename as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and the name should be to_content_disposition or similar, sorry my brain was in frontend mode when I suggested the change 😄

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, also the limit is applied to 150 characters shouldn't it be 255?

Copy link
Collaborator Author

@arash77 arash77 Nov 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And by exclude certain characters you mean which characters?
Should this be excluded? /\?%*:|"<>

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find any specific limitation on characters from the link but I found out that for a typical filename, the characters limitation is 256 characters and /\?%*:|"<> are not allowed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. I bet most browsers are putting proper restrictions in place anyway.

apply 255 characters limit to filename
exclude `/\?%*:|"<>` characters from encoded filename
@mvdbeek mvdbeek merged commit 1615f12 into galaxyproject:dev Nov 20, 2024
53 of 54 checks passed
@mvdbeek
Copy link
Member

mvdbeek commented Nov 20, 2024

Thanks @arash77!

@arash77 arash77 deleted the add-utf8-encoded-filename branch November 20, 2024 15:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants