-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Source specifying for relative links is allowed #232
base: master
Are you sure you want to change the base?
feat: Source specifying for relative links is allowed #232
Conversation
Thanks for the pull request, @myhailo-chernyshov-rg! What's next?Please work through the following steps to get your changes ready for engineering review: 🔘 Get product approvalIf you haven't already, check this list to see if your contribution needs to go through the product review process.
🔘 Provide contextTo help your reviewers and other members of the community understand the purpose and larger context of your changes, feel free to add as much of the following information to the PR description as you can:
🔘 Get a green buildIf one or more checks are failing, continue working on your changes until this is no longer the case and your build turns green. 🔘 Let us know that your PR is ready for review:Who will review my changes?This repository is currently unmaintained. To get help with finding a technical reviewer, tag the community contributions project manager for this PR in a comment and let them know that your changes are ready for review:
Where can I find more information?If you'd like to get more details on all aspects of the review process for open source pull requests (OSPRs), check out the following resources:
When can I expect my changes to be merged?Our goal is to get community contributions seen and reviewed as efficiently as possible. However, the amount of time that it takes to review and merge a PR can vary significantly based on factors such as:
💡 As a result it may take up to several weeks or months to complete a review and merge your PR. |
Hi @myhailo-chernyshov-rg! Thanks for this contribution! It looks like you're contributing on behalf of Raccoon Gang - please have your manager reach out to [email protected] to have you added to Raccoon Gang's existing entity agreement with us. Thank you! |
src/cc2olx/olx.py
Outdated
@@ -250,6 +252,18 @@ def process_external_tools_link(item, html): | |||
html = html.replace(item, external_tool_url) | |||
return html | |||
|
|||
def process_extra_static_files(item, html): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a docstring here explaining what this function is for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
self.cartridge = cartridge | ||
self.doc = None | ||
self.link_file = link_file | ||
self.passport_file = passport_file | ||
self.relative_links_source = relative_links_source |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add some validation for the format of relative_links_source
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Validation is added on the argument parsing step.
src/cc2olx/olx.py
Outdated
if self.relative_links_source is None: | ||
return html | ||
|
||
for static_file in self.cartridge.extra_static_files: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How large can extra_static_files
get? This looks like it could get pretty expensive to loop through every entity in this list for each item.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
During the resource content collection, when the external file is found, it's added to this list. So, the list size isn't limited and depends on CC archive content. But we need to process each such link to make it absolute, so we do need to iterate over the list. I don't think it's too expensive because other exporter tasks require much more actions to perform. Do you have any ideas how we can handle all the links without iterating over them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be misunderstanding this at a basic level, but here is how I interpret this:
- There is a shared list of
extra_static_files
. - There are many items.
- This is lookup n^2, i.e. (number of items) * (number of extra static files).
- Each static file maps to the same OLX static path every time.
If that is accurate, then could create a dict that has the mapped values, where the keys are OLX_STATIC_PATH_TEMPLATE.format(static_filename=static_file)
and the values are the static_file
? That dict could be initialized once and stored on the object, so that the lookup here is just if item in {my lookup dict}
, i.e. the whole thing goes from n^2 to n.
The thing is, if extra_static_files
is small, this doesn't matter at all. In the past this sort of thing has been a problem when the list grew to thousands.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
openedx/edx-platform#11095 (comment) is an example of something like this that has come up before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's refactored and the bug with making web_resources
static links absolute is fixed.
ff473b9
to
751e8ed
Compare
9da9855
to
1b67ba1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question about whether it's possible to make use of Django's built-in validation for URLs rather than rolling our own. Other than that, I think this is good to merge.
# IP patterns | ||
IPV4_REGEX = ( | ||
r"(?:0|25[0-5]|2[0-4][0-9]|1[0-9]?[0-9]?|[1-9][0-9]?)" | ||
r"(?:\.(?:0|25[0-5]|2[0-4][0-9]|1[0-9]?[0-9]?|[1-9][0-9]?)){3}" | ||
) | ||
IPV6_REGEX = r"\[[0-9a-f:.]+\]" # (simple regex, validated later) | ||
|
||
# Host patterns | ||
HOSTNAME_REGEX = rf"[a-z{UL}0-9](?:[a-z{UL}0-9-]{{0,61}}[a-z{UL}0-9])?" | ||
# Max length for domain name labels is 63 characters per RFC 1034 sec. 3.1 | ||
DOMAIN_REGEX = rf"(?:\.(?!-)[a-z{UL}0-9-]{{1,63}}(?<!-))*" | ||
TLD_REGEX = ( | ||
r"\." # dot | ||
r"(?!-)" # can't start with a dash | ||
rf"(?:[a-z{UL}-]{{2,63}}" # domain label | ||
r"|xn--[a-z0-9]{1,59})" # or punycode label | ||
r"(?<!-)" # can't end with a dash | ||
r"\.?" # may have a trailing dot | ||
) | ||
HOST_REGEX = "(" + HOSTNAME_REGEX + DOMAIN_REGEX + TLD_REGEX + "|localhost)" | ||
|
||
LINK_SOURCE_REGEX = ( | ||
r"^https?://" # scheme is validated separately | ||
r"(?:[^\s:@/]+(?::[^\s:@/]*)?@)?" # user:pass authentication | ||
rf"(?P<netloc>{IPV4_REGEX}|{IPV6_REGEX}|{HOST_REGEX})" | ||
r"(?::[0-9]{1,5})?" # port | ||
r"/?" # trailing slash | ||
r"\Z" | ||
) | ||
|
||
message = "Enter a valid URL." | ||
|
||
def __call__(self, value: str) -> str: | ||
if not (link_source_match := re.fullmatch(self.LINK_SOURCE_REGEX, value, re.IGNORECASE)): | ||
raise argparse.ArgumentTypeError(self.message) | ||
|
||
self._validate_ipv6_address(link_source_match.group("netloc")) | ||
|
||
return value | ||
|
||
def _validate_ipv6_address(self, netloc: str) -> None: | ||
""" | ||
Check netloc correctness if it's an IPv6 address. | ||
""" | ||
potential_ipv6_regex = r"^\[(.+)\](?::[0-9]{1,5})?$" | ||
if netloc_match := re.search(potential_ipv6_regex, netloc): | ||
potential_ip = netloc_match[1] | ||
if not is_valid_ipv6_address(potential_ip): | ||
raise argparse.ArgumentTypeError(self.message) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there something that this function provides that would not be covered with Django's built-in UrlValidator?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If thought it's not a good idea to add such huge dependency as Django to the project just for validation purpose. But I noticed that for other PR for this repo I need django.utils.module_loading.import_string
, so perhaps adding this dependency is becoming justified. According to your question, there is a difference between the Django UrlValidator and this one: Django UrlValidator allows resource path in the URL, but link source must not contain it. Also, 'ftp', 'ftps' shemes are forbidden in link source validator. We can subclass Django URLValidator and update URL regexp and allowed schemes.
How do you think it’s best to proceed?
Description
It is possible that the
.imscc
course dump contains files with relative links to resources for some reason not included into it (e.g. some static files of LMS that are not processed by its exporting tool). It can become a reason of broken images, not loaded styles etc.To handle such situations, it is allowed to specify the relative links source during converting.Usage instructions
Specify relative links source using
-s
flag:Deadline
"None"