Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing URL shortening for URLs with Arabic characters. #1651

Merged
merged 1 commit into from
Sep 13, 2023

Conversation

caiosba
Copy link
Contributor

@caiosba caiosba commented Sep 13, 2023

Description

When URLs have unescaped Arabic characters, they are not extracted correctly. This happens not only with the library we use (twitter-text), but also with Ruby's uri library and postrank-uri gem:

irb(main):009:0> PostRank::URI.extract('https://fatabyyano.net/هذا-المقطع-ليس-لاشتباكات-حديثة-بين-الج/')
=> ["https://fatabyyano.net/"]
irb(main):010:0> URI.extract('https://fatabyyano.net/هذا-المقطع-ليس-لاشتباكات-حديثة-بين-الج/')
=> ["https://fatabyyano.net/"]
irb(main):011:0> Twitter::TwitterText::Extractor.extract_urls('https://fatabyyano.net/هذا-المقطع-ليس-لاشتباكات-حديثة-بين-الج/')
=> ["https://fatabyyano.net/"]

So, the fix here is to first escape URLs that contain Arabic characters before sending them to the URL extraction method when shortening URLs.

Fixes CV2-3690.

How has this been tested?

TDD. I added a unit test that reproduced the issue.

Things to pay attention to during code review

Resources, reports and newsletters are affected by this change. The fix should just be applied for input texts with Arabic characters.

Checklist

  • I have performed a self-review of my own code
  • I have added unit and feature tests, if the PR implements a new feature or otherwise would benefit from additional testing
  • I have added regression tests, if the PR fixes a bug
  • I have added logging, exception reporting, and custom tracing with any additional information required for debugging
  • I have commented my code in hard-to-understand areas, if any
  • I have made needed changes to the README
  • My changes generate no new warnings
  • If I added a third party module, I included a rationale for doing so and followed our current guidelines

When URLs have unescaped Arabic characters, they are not extracted correctly. This happens not only with the library we use (`twitter-text`), but also with Ruby's `uri` library and `postrank-uri` gem:

```
irb(main):009:0> PostRank::URI.extract('https://fatabyyano.net/هذا-المقطع-ليس-لاشتباكات-حديثة-بين-الج/')
=> ["https://fatabyyano.net/"]
irb(main):010:0> URI.extract('https://fatabyyano.net/هذا-المقطع-ليس-لاشتباكات-حديثة-بين-الج/')
=> ["https://fatabyyano.net/"]
irb(main):011:0> Twitter::TwitterText::Extractor.extract_urls('https://fatabyyano.net/هذا-المقطع-ليس-لاشتباكات-حديثة-بين-الج/')
=> ["https://fatabyyano.net/"]
```

So, the fix here is to first escape URLs that contain Arabic characters before sending them to the URL extraction method when shortening URLs.

Fixes CV2-3690.
@caiosba caiosba requested a review from melsawy as a code owner September 13, 2023 02:05
@codeclimate
Copy link

codeclimate bot commented Sep 13, 2023

Code Climate has analyzed commit 5e8e55d and detected 0 issues on this pull request.

The test coverage on the diff in this pull request is 100.0% (100% is the threshold).

This pull request will bring the total coverage in the repository to 99.7% (0.1% change).

View more on Code Climate.

@caiosba caiosba merged commit 8a9a02b into develop Sep 13, 2023
8 checks passed
@caiosba caiosba deleted the fix/CV2-3690-shorten-arabic-urls branch September 13, 2023 12:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants