Fixing URL shortening for URLs with Arabic characters. #1651

caiosba · 2023-09-13T02:05:01Z

Description

When URLs have unescaped Arabic characters, they are not extracted correctly. This happens not only with the library we use (twitter-text), but also with Ruby's uri library and postrank-uri gem:

irb(main):009:0> PostRank::URI.extract('https://fatabyyano.net/هذا-المقطع-ليس-لاشتباكات-حديثة-بين-الج/')
=> ["https://fatabyyano.net/"]
irb(main):010:0> URI.extract('https://fatabyyano.net/هذا-المقطع-ليس-لاشتباكات-حديثة-بين-الج/')
=> ["https://fatabyyano.net/"]
irb(main):011:0> Twitter::TwitterText::Extractor.extract_urls('https://fatabyyano.net/هذا-المقطع-ليس-لاشتباكات-حديثة-بين-الج/')
=> ["https://fatabyyano.net/"]

So, the fix here is to first escape URLs that contain Arabic characters before sending them to the URL extraction method when shortening URLs.

Fixes CV2-3690.

How has this been tested?

TDD. I added a unit test that reproduced the issue.

Things to pay attention to during code review

Resources, reports and newsletters are affected by this change. The fix should just be applied for input texts with Arabic characters.

Checklist

I have performed a self-review of my own code
I have added unit and feature tests, if the PR implements a new feature or otherwise would benefit from additional testing
I have added regression tests, if the PR fixes a bug
I have added logging, exception reporting, and custom tracing with any additional information required for debugging
I have commented my code in hard-to-understand areas, if any
I have made needed changes to the README
My changes generate no new warnings
If I added a third party module, I included a rationale for doing so and followed our current guidelines

When URLs have unescaped Arabic characters, they are not extracted correctly. This happens not only with the library we use (`twitter-text`), but also with Ruby's `uri` library and `postrank-uri` gem: ``` irb(main):009:0> PostRank::URI.extract('https://fatabyyano.net/هذا-المقطع-ليس-لاشتباكات-حديثة-بين-الج/') => ["https://fatabyyano.net/"] irb(main):010:0> URI.extract('https://fatabyyano.net/هذا-المقطع-ليس-لاشتباكات-حديثة-بين-الج/') => ["https://fatabyyano.net/"] irb(main):011:0> Twitter::TwitterText::Extractor.extract_urls('https://fatabyyano.net/هذا-المقطع-ليس-لاشتباكات-حديثة-بين-الج/') => ["https://fatabyyano.net/"] ``` So, the fix here is to first escape URLs that contain Arabic characters before sending them to the URL extraction method when shortening URLs. Fixes CV2-3690.

codeclimate · 2023-09-13T03:46:43Z

Code Climate has analyzed commit 5e8e55d and detected 0 issues on this pull request.

The test coverage on the diff in this pull request is 100.0% (100% is the threshold).

This pull request will bring the total coverage in the repository to 99.7% (0.1% change).

View more on Code Climate.

caiosba requested a review from melsawy as a code owner September 13, 2023 02:05

melsawy approved these changes Sep 13, 2023

View reviewed changes

caiosba merged commit 8a9a02b into develop Sep 13, 2023
8 checks passed

caiosba deleted the fix/CV2-3690-shorten-arabic-urls branch September 13, 2023 12:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixing URL shortening for URLs with Arabic characters. #1651

Fixing URL shortening for URLs with Arabic characters. #1651

caiosba commented Sep 13, 2023

codeclimate bot commented Sep 13, 2023

Fixing URL shortening for URLs with Arabic characters. #1651

Fixing URL shortening for URLs with Arabic characters. #1651

Conversation

caiosba commented Sep 13, 2023

Description

How has this been tested?

Things to pay attention to during code review

Checklist

codeclimate bot commented Sep 13, 2023