Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bitstream download URLs based on the handle of the item and the bitstream filename don't work with accented characters #2727

Closed
Marwa1988-S opened this issue Jan 2, 2024 · 21 comments · Fixed by #3062
Assignees
Labels
bug claimed: Atmire Atmire team is working on this issue & will contribute back component: SEO Search Engine Optimization testathon Reported by a tester during Community Testathon
Milestone

Comments

@Marwa1988-S
Copy link

Marwa1988-S commented Jan 2, 2024

When trying to download a bitstream using the URL:
/bitstream/handle/[prefix]/[suffix]/[filename]
if the filename in the URL contain non-English characters, it redirects to 504 ERROR.

To Reproduce

  1. login to https://demo.dspace.org/home
  2. copy https://demo.dspace.org/bitstream/123456789/263/1/An_Interview_with_Sarah_Asebedo_ć.pdf in the browser

Expected behavior
The bitstream file should be open

Try another file with just English characters in the filename: https://demo.dspace.org/bitstream/123456789/263/1/An_Interview_with_Sarah_Asebedo.pdf, it will works.

@Marwa1988-S Marwa1988-S added bug needs triage New issue needs triage and/or scheduling labels Jan 2, 2024
@github-project-automation github-project-automation bot moved this to 🆕 Triage in DSpace Backlog Jan 2, 2024
@alanorth
Copy link
Contributor

alanorth commented Jan 3, 2024

I think this is an issue with the Amazon Cloudfront configuration for demo.dspace.org. I uploaded a bitstream with the same name to my DSpace 7.6 test repository and was able to download it.

Screenshot 2024-01-03 at 07-56-26 Morphology and growth of maize IITA research guide No  9-fs8

@alexandrevryghem
Copy link
Member

Atmire would like to claim this issue. Currently only these 2 legacy urls are supported:

  • /bitstream/handle/{prefix}/{suffix}/{filename}?sequence={sequence_id}
  • /bitstream/{prefix}/{suffix}/{sequence_id}/{filename}

But we found out that some of our clients also used this format /bitstream/handle/{prefix}/{suffix}/{sequence_id}/{filename}, which currently redirects you to that 404 page

@tdonohue
Copy link
Member

tdonohue commented Jan 8, 2024

@alexandrevryghem : I'm confused by your comment & this ticket description. This ticket description seems to say that the bug is with accented characters, but your comment implies it's not an issue with accented characters, but rather an issue of a missing legacy URL.

I'm OK with Atmire claiming this & will assign to you. But, please update the ticket description/title if the bug is actually a missing legacy URL. That way testers will understand what your (eventual) PR is trying to fix.

@tdonohue tdonohue added claimed: Atmire Atmire team is working on this issue & will contribute back and removed needs triage New issue needs triage and/or scheduling labels Jan 8, 2024
@tdonohue tdonohue removed this from DSpace Backlog Jan 8, 2024
@github-project-automation github-project-automation bot moved this to 📋 To Do in DSpace 8.0 Release Jan 8, 2024
@tdonohue tdonohue moved this from 📋 To Do to 🏗 In Progress in DSpace 8.0 Release Jan 8, 2024
@tdonohue tdonohue moved this to 🏗 In Progress in DSpace 8.x and 7.6.x Maintenance Jan 8, 2024
@alexandrevryghem
Copy link
Member

@tdonohue: I didn't create this ticket so I can't update it but like Alan mentioned it's not related to the accented characters. When you look at the example url that was given by @Marwa1988-S and the format he used /bitstream/handle/[prefix]/[suffix]/[filename] in his description these urls don't match. That's why I assume that this is related to that bug, but maybe it would be best to wait until @Marwa1988-S can confirm this 😅

@Marwa1988-S
Copy link
Author

Marwa1988-S commented Jan 9, 2024

I think this is the correct muster for my URL:
/bitstream/{handle-prefix}/{handle-suffix}/{sequence_id}/{filename}

As already described in my description:
if the URL does not have an accent character in the file name, you can access it directly, and if not it gives a 504 error.
other example: item with handle = 10673/1194 has 2 text files: Testing file.txt, file2_ć.txt
This link works:
https://demo.dspace.org/bitstream/10673/1194/1/Testing file.txt
while this doesn't:
https://demo.dspace.org/bitstream/10673/1194/1/file2_ć.txt

I'm not sure if this refers to an issue with legacy url or some configurations.

@alexandrevryghem
Copy link
Member

Thnx for the additional info, this is indeed a separate issue. The correct url for downloading that bitstream is actually:
https://demo.dspace.org/bitstream/10673/1194/3/file2_%C4%87.txt

Because the sequenceId of that bitstream is 3 (see here), but that also throws a 504 error

@alexandrevryghem alexandrevryghem removed the claimed: Atmire Atmire team is working on this issue & will contribute back label Jan 9, 2024
@alexandrevryghem alexandrevryghem removed their assignment Jan 9, 2024
@alexandrevryghem alexandrevryghem moved this from 🏗 In Progress to 📋 To Do in DSpace 8.x and 7.6.x Maintenance Jan 9, 2024
@MW3000
Copy link
Contributor

MW3000 commented Jan 9, 2024

What is the meaning of sequenceId? And the sequenceID is independent of the bug. The bug also occurs with the correct sequenceID.

@alexandrevryghem
Copy link
Member

It's a number that ensured that the URLs were unique for each bitstream because it is possible for two bitstreams on a certain item to have the same name. It indeed seems to be more of a configuration problem (but I didn't check it myself).

@tdonohue
Copy link
Member

tdonohue commented Jan 9, 2024

All, I dug into this a bit more today with help from the team hosting demo.dspace.org.

It looks (to me) like it might be a DSpace UI bug with the redirect from a legacy URL to a new one..
e.g. A legacy URL like this
[dspace.ui.url]/bitstream/[handle-prefix]/[handle-suffix]/[sequence]/[filename]
or
[dspace.ui.url]/bitstream/handle/[handle-prefix]/[handle-suffix]/[filename]
Will redirect to the new download URLs like this:
[dspace.ui.url]/bitstreams/[bitstream-uuid]/download

For instance, I notice that downloading files with non-English characters works perfectly on the demo site, if you visit the Item page first and click on the filename.

However, if instead you use a direct link using the legacy URL style (from DSpace 6.x), then you will encounter this 504 error. My current guess is this ticket may be related to #1242, which added automatic redirects for these legacy (DSpace 6.x) URLs. It's possible that redirect isn't working properly when it encounters non-English characters.

@alexandrevryghem : This analysis does mean that it might be possible that this ticket is loosely related to the legacy redirect issues you noticed with your clients. It's just that this ticket points out an additional problem...that the legacy redirects don't handle special characters well. Let me know if you'd like to claim this ticket or not.

@tdonohue tdonohue added component: SEO Search Engine Optimization help wanted Needs a volunteer to claim to move forward labels Jan 9, 2024
@alexandrevryghem
Copy link
Member

@tdonohue: I was able to reproduce this bug locally with the demo backend. This error is only reproducible when you run Angular in production mode and the fix is simply to encode the filename you send to the backend.

I can create a small PR for this, but maybe it would be cleaner to fix these encoding issues higher up (in RequestParam). I recently also fixed another issue which is also related to encoding parameters and encoding it higher up would fix both issues at once. If you agree that this is better I can still create this quick fix for 7.6.2, but then I could also create an additional fix for 8.0 where all RequestParams are also encoded in that constructor (I would then also remove some hardcoded parameter encoding in order to not encode those parameters twice for example here).

@tdonohue
Copy link
Member

tdonohue commented Jan 9, 2024

@alexandrevryghem : I have to admit, I don't have a strong preference here. Either fix is fine. So, I'd recommend using your best judgement....or you could ping @artlowel to see if he has a strong preference.

@tdonohue
Copy link
Member

tdonohue commented Mar 6, 2024

@alexandrevryghem : Are you still working on this ticket? (Somehow it is no longer assigned to anyone) Just noting this bug came up in a discussion on dspace-tech: https://groups.google.com/g/dspace-tech/c/FO2cPiQWhlA

@alexandrevryghem
Copy link
Member

@tdonohue: I can create fix for this if you want, I think that fixing it here for all the RequestParams would maybe be cleaner. The benefit of fixing it there would be that other functionlities would automatically also be fixed like #2724 and it would prevent new functionalities to have the same issue. The only disadvantage is that if you already fixed it by passing the encoded value (like here) you would need to remove the encodeURIComponent, so this might be a "breaking" change

@tdonohue
Copy link
Member

tdonohue commented Mar 6, 2024

@alexandrevryghem : If you have time, that'd be great. I would however like to backport this to 7.x if possible. Even if it requires removing encodeURIComponent calls, I think that's OK to backport...as long as we note that change.

I'll assign this to you officially.

@tdonohue tdonohue added claimed: Atmire Atmire team is working on this issue & will contribute back and removed help wanted Needs a volunteer to claim to move forward labels Mar 6, 2024
@tdonohue tdonohue moved this from 📋 To Do to 🏗 In Progress in DSpace 8.x and 7.6.x Maintenance Mar 6, 2024
@tdonohue tdonohue added the testathon Reported by a tester during Community Testathon label Apr 17, 2024
@pkmcnc
Copy link

pkmcnc commented May 7, 2024

@alexandrevryghem
Hi! Can you share the minimum fix for non-English characters in URL?
I have doubts about applying commit #2725 to 7.6.1
Thanks!

@tdonohue
Copy link
Member

After #2725 was merged, this bug still exists. However, the behavior has changed slightly. This is what I'm seeing when I run in production mode (yarn build:prod; yarn serve:ssr):

  1. First, create an item with two files with no access restrictions. Make sure one has zero accented characters and one has at least one. Here's an example:
    • One bitstream named test.doc
    • One bitstream named test_ć.pdf
  2. Determine the Item's handle (e.g. 123456789/10). Make sure that a basic redirect works properly like this:
  3. Now, try the "handle" style redirect for the first bitstream (test.doc). It should work.
  4. Finally, try the same redirect for the accented bitstream (test_ć.pdf) It will NOT work

So, after #2725, special characters in bitstreams no longer return an error. But, they still don't work properly if you attempt to access them via the /bitstream/handle/[prefix]/[suffix]/[bitstream-filename] URL. (They do work though if you just click the download link in the UI)

It's possible however that fixing #2963 will also now fix this bug, as the new behavior almost seems like a redirect failure? Pinging @artlowel and @alexandrevryghem to let them know these two issues might now be related (unverified though).

@alexandrevryghem
Copy link
Member

@tdonohue: I retested this myself today and for me the redirect now works correctly on sandbox, I think there might have been a confusion because there are 2 characters that look like a "c" with an accent but they have different unicodes so it may not look like it but: 'ć' !== 'ć' 😅

So currently:

  • curl --head https://sandbox.dspace.org/bitstream/handle/123456789/248/1test_ć.pdf → 302
  • curl --head https://sandbox.dspace.org/bitstream/handle/123456789/248/2test_ć.pdf → 302

With #2963:

  • curl --head http://localhost:4000/bitstream/handle/123456789/248/1test_ć.pdf → 301 Moved Permanently
  • curl --head http://localhost:4000/bitstream/handle/123456789/248/2test_ć.pdf → 301 Moved Permanently

@tdonohue
Copy link
Member

tdonohue commented May 20, 2024

@alexandrevryghem : I tried this again today, and it's still not working for me on Sandbox.

I created a new test Item to test with: https://sandbox.dspace.org/handle/10673/1131 It has three files, one with no special characters and the other two having the names you specified above.

Here's my results for curl --head for each:

$ curl --head https://sandbox.dspace.org/bitstream/handle/10673/1131/test_pdf.pdf
HTTP/2 302
date: Mon, 20 May 2024 15:09:20 GMT
content-type: text/plain; charset=utf-8
content-length: 120
location: https://sandbox.dspace.org/server/api/core/bitstreams/83ebc316-dc6b-45f6-b667-b76ad99a818c/content
x-powered-by: Express
x-ratelimit-limit: 500
x-ratelimit-remaining: 499
x-ratelimit-reset: 1716217800
cache-control: max-age=604800
link:
vary: Accept

$ curl --head https://sandbox.dspace.org/bitstream/handle/10673/1131/1test_ć.pdf
HTTP/2 200
date: Mon, 20 May 2024 15:09:45 GMT
content-type: text/html; charset=utf-8
content-length: 360440
x-powered-by: Express
x-ratelimit-limit: 500
x-ratelimit-remaining: 498
x-ratelimit-reset: 1716217800
cache-control: max-age=604800
link:
etag: W/"57ff8-PNVyJKGADoZ+xlKUSdZLlmLoy4A"
vary: Accept-Encoding

$ curl --head https://sandbox.dspace.org/bitstream/handle/10673/1131/2test_ć.pdf
HTTP/2 200
date: Mon, 20 May 2024 15:10:12 GMT
content-type: text/html; charset=utf-8
content-length: 360435
x-powered-by: Express
x-ratelimit-limit: 500
x-ratelimit-remaining: 499
x-ratelimit-reset: 1716217860
cache-control: max-age=604800
link:
etag: W/"57ff3-K0RDKeL2LMu4u9/nzyhTKTtisvk"
vary: Accept-Encoding

So, as you can see, on my end the curl only returns a 302 if there are no special characters. When special characters exist, I'm still seeing a 200 OK

@tdonohue
Copy link
Member

@alexandrevryghem : A follow-up. I did notice it works perfectly in my browser now.

If I copy either of these URLs into my browser window, then I briefly see the download page in the UI & I'm properly redirected to download the file.

https://sandbox.dspace.org/bitstream/handle/10673/1131/1test_ć.pdf
https://sandbox.dspace.org/bitstream/handle/10673/1131/2test_ć.pdf

It's odd to me though that the curl returns 200 OK, even though the redirect does occur in the browser.

So, this bug ticket appears to be partially fixed. These /bitstream/handle/[prefix]/[suffix]/[filename] download URLs are redirecting & working. It's just they are sometimes returning a 200 OK instead of the expected 302 Redirect

I'll try to test the fixes in #3062 later today to see if they have an impact on this & stop the 200 OK from being returned.

@alexandrevryghem
Copy link
Member

alexandrevryghem commented May 20, 2024

@tdonohue: I just retested it now too, but for me it still returns 302 for me so that's weird 😅

→ curl --head https://sandbox.dspace.org/bitstream/handle/10673/1131/test_pdf.pdf 

HTTP/2 302
date: Mon, 20 May 2024 15:42:21 GMT
content-type: text/plain; charset=utf-8
content-length: 120
location: https://sandbox.dspace.org/server/api/core/bitstreams/83ebc316-dc6b-45f6-b667-b76ad99a818c/content
x-powered-by: Express
x-ratelimit-limit: 500
x-ratelimit-remaining: 499
x-ratelimit-reset: 1716219780
cache-control: max-age=604800
vary: Accept

→ curl --head https://sandbox.dspace.org/bitstream/handle/10673/1131/1test_ć.pdf
HTTP/2 302
date: Mon, 20 May 2024 15:42:32 GMT
content-type: text/plain; charset=utf-8
content-length: 120
location: https://sandbox.dspace.org/server/api/core/bitstreams/55021e94-455e-4eab-b1b8-eff9f552b882/content
x-powered-by: Express
x-ratelimit-limit: 500
x-ratelimit-remaining: 498
x-ratelimit-reset: 1716219780
cache-control: max-age=604800
link:
vary: Accept

→ curl --head https://sandbox.dspace.org/bitstream/handle/10673/1131/2test_ć.pdf
HTTP/2 302
date: Mon, 20 May 2024 15:42:41 GMT
content-type: text/plain; charset=utf-8
content-length: 120
location: https://sandbox.dspace.org/server/api/core/bitstreams/de373b4a-9959-4c3e-82f5-e855926050d9/content
x-powered-by: Express
x-ratelimit-limit: 500
x-ratelimit-remaining: 499
x-ratelimit-reset: 1716219780
cache-control: max-age=604800
link:
vary: Accept

Maybe it has to do with our curl version, mine is on 8.6.0:

→ curl --version                                                                                                                                                                                           [17:42:41]
curl 8.6.0 (x86_64-apple-darwin23.0) libcurl/8.6.0 (SecureTransport) LibreSSL/3.3.6 zlib/1.2.12 nghttp2/1.61.0
Release-Date: 2024-01-31
Protocols: dict file ftp ftps gopher gophers http https imap imaps ipfs ipns ldap ldaps mqtt pop3 pop3s rtsp smb smbs smtp smtps telnet tftp
Features: alt-svc AsynchDNS GSS-API HSTS HTTP2 HTTPS-proxy IPv6 Kerberos Largefile libz MultiSSL NTLM NTLM_WB SPNEGO SSL threadsafe UnixSocketsc

@tdonohue
Copy link
Member

@alexandrevryghem You may be correct that it's something with curl. I seem to be running an older version::

curl 7.81.0 (x86_64-pc-linux-gnu) libcurl/7.81.0 OpenSSL/3.0.2 zlib/1.2.11 brotli/1.0.9 zstd/1.4.8 libidn2/2.3.2 libpsl/0.21.0 (+libidn2/2.3.2) libssh/0.9.6/openssl/zlib nghttp2/1.43.0 librtmp/2.3 OpenLDAP/2.5.14

In any case, I'm seeing your point that it appears this ticket might already be "solved". I've got some time in my afternoon to do some more testing, so I'll see if #3062 helps. Either way, if this is working for later versions of curl, that seems good enough for me. I'm hoping that would mean Google Scholar (and similar crawlers) will see this redirect properly instead of seeing the "200 OK".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug claimed: Atmire Atmire team is working on this issue & will contribute back component: SEO Search Engine Optimization testathon Reported by a tester during Community Testathon
Projects
Development

Successfully merging a pull request may close this issue.

6 participants