Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Source specifying for relative links is allowed #232

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

myhailo-chernyshov-rg
Copy link
Contributor

Description

It is possible that the .imscc course dump contains files with relative links to resources for some reason not included into it (e.g. some static files of LMS that are not processed by its exporting tool). It can become a reason of broken images, not loaded styles etc.To handle such situations, it is allowed to specify the relative links source during converting.

Usage instructions

Specify relative links source using -s flag:

cc2olx -i <IMSCC_FILE> -s <RELATIVE_LINKS_SOURCE>

Deadline

"None"

@openedx-webhooks openedx-webhooks added the open-source-contribution PR author is not from Axim or 2U label Nov 21, 2024
@openedx-webhooks
Copy link

openedx-webhooks commented Nov 21, 2024

Thanks for the pull request, @myhailo-chernyshov-rg!

What's next?

Please work through the following steps to get your changes ready for engineering review:

🔘 Get product approval

If you haven't already, check this list to see if your contribution needs to go through the product review process.

  • If it does, you'll need to submit a product proposal for your contribution, and have it reviewed by the Product Working Group.
    • This process (including the steps you'll need to take) is documented here.
  • If it doesn't, simply proceed with the next step.

🔘 Provide context

To help your reviewers and other members of the community understand the purpose and larger context of your changes, feel free to add as much of the following information to the PR description as you can:

  • Dependencies

    This PR must be merged before / after / at the same time as ...

  • Blockers

    This PR is waiting for OEP-1234 to be accepted.

  • Timeline information

    This PR must be merged by XX date because ...

  • Partner information

    This is for a course on edx.org.

  • Supporting documentation
  • Relevant Open edX discussion forum threads

🔘 Get a green build

If one or more checks are failing, continue working on your changes until this is no longer the case and your build turns green.

🔘 Let us know that your PR is ready for review:

Who will review my changes?

This repository is currently unmaintained.

To get help with finding a technical reviewer, tag the community contributions project manager for this PR in a comment and let them know that your changes are ready for review:

  1. On the right-hand side of the PR, find the Contributions project, click the caret in the top right corner to expand it, and check the "Primary PM" field for the name of your PM.
  2. Find their GitHub handle here.

Where can I find more information?

If you'd like to get more details on all aspects of the review process for open source pull requests (OSPRs), check out the following resources:

When can I expect my changes to be merged?

Our goal is to get community contributions seen and reviewed as efficiently as possible.

However, the amount of time that it takes to review and merge a PR can vary significantly based on factors such as:

  • The size and impact of the changes that it introduces
  • The need for product review
  • Maintenance status of the parent repository

💡 As a result it may take up to several weeks or months to complete a review and merge your PR.

@mphilbrick211
Copy link

Hi @myhailo-chernyshov-rg! Thanks for this contribution! It looks like you're contributing on behalf of Raccoon Gang - please have your manager reach out to [email protected] to have you added to Raccoon Gang's existing entity agreement with us. Thank you!

@mphilbrick211 mphilbrick211 added the needs test run Author's first PR to this repository, awaiting test authorization from Axim label Nov 26, 2024
@@ -250,6 +252,18 @@ def process_external_tools_link(item, html):
html = html.replace(item, external_tool_url)
return html

def process_extra_static_files(item, html):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a docstring here explaining what this function is for.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

self.cartridge = cartridge
self.doc = None
self.link_file = link_file
self.passport_file = passport_file
self.relative_links_source = relative_links_source
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add some validation for the format of relative_links_source.

Copy link
Contributor Author

@myhailo-chernyshov-rg myhailo-chernyshov-rg Dec 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Validation is added on the argument parsing step.

if self.relative_links_source is None:
return html

for static_file in self.cartridge.extra_static_files:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How large can extra_static_files get? This looks like it could get pretty expensive to loop through every entity in this list for each item.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During the resource content collection, when the external file is found, it's added to this list. So, the list size isn't limited and depends on CC archive content. But we need to process each such link to make it absolute, so we do need to iterate over the list. I don't think it's too expensive because other exporter tasks require much more actions to perform. Do you have any ideas how we can handle all the links without iterating over them?

Copy link

@ormsbee ormsbee Dec 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be misunderstanding this at a basic level, but here is how I interpret this:

  1. There is a shared list of extra_static_files.
  2. There are many items.
  3. This is lookup n^2, i.e. (number of items) * (number of extra static files).
  4. Each static file maps to the same OLX static path every time.

If that is accurate, then could create a dict that has the mapped values, where the keys are OLX_STATIC_PATH_TEMPLATE.format(static_filename=static_file) and the values are the static_file? That dict could be initialized once and stored on the object, so that the lookup here is just if item in {my lookup dict}, i.e. the whole thing goes from n^2 to n.

The thing is, if extra_static_files is small, this doesn't matter at all. In the past this sort of thing has been a problem when the list grew to thousands.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

openedx/edx-platform#11095 (comment) is an example of something like this that has come up before.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's refactored and the bug with making web_resources static links absolute is fixed.

@mphilbrick211 mphilbrick211 removed the needs test run Author's first PR to this repository, awaiting test authorization from Axim label Dec 9, 2024
@myhailo-chernyshov-rg myhailo-chernyshov-rg force-pushed the myhailochernyshov/source-specifying-for-relative-links-is-allowed branch from ff473b9 to 751e8ed Compare December 18, 2024 21:39
@myhailo-chernyshov-rg myhailo-chernyshov-rg force-pushed the myhailochernyshov/source-specifying-for-relative-links-is-allowed branch from 9da9855 to 1b67ba1 Compare December 19, 2024 06:06
Copy link

@ormsbee ormsbee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question about whether it's possible to make use of Django's built-in validation for URLs rather than rolling our own. Other than that, I think this is good to merge.

Comment on lines +14 to +62
# IP patterns
IPV4_REGEX = (
r"(?:0|25[0-5]|2[0-4][0-9]|1[0-9]?[0-9]?|[1-9][0-9]?)"
r"(?:\.(?:0|25[0-5]|2[0-4][0-9]|1[0-9]?[0-9]?|[1-9][0-9]?)){3}"
)
IPV6_REGEX = r"\[[0-9a-f:.]+\]" # (simple regex, validated later)

# Host patterns
HOSTNAME_REGEX = rf"[a-z{UL}0-9](?:[a-z{UL}0-9-]{{0,61}}[a-z{UL}0-9])?"
# Max length for domain name labels is 63 characters per RFC 1034 sec. 3.1
DOMAIN_REGEX = rf"(?:\.(?!-)[a-z{UL}0-9-]{{1,63}}(?<!-))*"
TLD_REGEX = (
r"\." # dot
r"(?!-)" # can't start with a dash
rf"(?:[a-z{UL}-]{{2,63}}" # domain label
r"|xn--[a-z0-9]{1,59})" # or punycode label
r"(?<!-)" # can't end with a dash
r"\.?" # may have a trailing dot
)
HOST_REGEX = "(" + HOSTNAME_REGEX + DOMAIN_REGEX + TLD_REGEX + "|localhost)"

LINK_SOURCE_REGEX = (
r"^https?://" # scheme is validated separately
r"(?:[^\s:@/]+(?::[^\s:@/]*)?@)?" # user:pass authentication
rf"(?P<netloc>{IPV4_REGEX}|{IPV6_REGEX}|{HOST_REGEX})"
r"(?::[0-9]{1,5})?" # port
r"/?" # trailing slash
r"\Z"
)

message = "Enter a valid URL."

def __call__(self, value: str) -> str:
if not (link_source_match := re.fullmatch(self.LINK_SOURCE_REGEX, value, re.IGNORECASE)):
raise argparse.ArgumentTypeError(self.message)

self._validate_ipv6_address(link_source_match.group("netloc"))

return value

def _validate_ipv6_address(self, netloc: str) -> None:
"""
Check netloc correctness if it's an IPv6 address.
"""
potential_ipv6_regex = r"^\[(.+)\](?::[0-9]{1,5})?$"
if netloc_match := re.search(potential_ipv6_regex, netloc):
potential_ip = netloc_match[1]
if not is_valid_ipv6_address(potential_ip):
raise argparse.ArgumentTypeError(self.message)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there something that this function provides that would not be covered with Django's built-in UrlValidator?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If thought it's not a good idea to add such huge dependency as Django to the project just for validation purpose. But I noticed that for other PR for this repo I need django.utils.module_loading.import_string, so perhaps adding this dependency is becoming justified. According to your question, there is a difference between the Django UrlValidator and this one: Django UrlValidator allows resource path in the URL, but link source must not contain it. Also, 'ftp', 'ftps' shemes are forbidden in link source validator. We can subclass Django URLValidator and update URL regexp and allowed schemes.
How do you think it’s best to proceed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
open-source-contribution PR author is not from Axim or 2U
Projects
Status: Ready for Review
Development

Successfully merging this pull request may close these issues.

4 participants