Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Relative to absolute #32

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

skytreader
Copy link

Fixes #8.

Instead of passing around the domain, a cleaner way would be to create an object that abstracts the parsing of a document. That way, the domain can be extracted from the constructor into a field and then used by methods that need it. But I think this PR is large enough as it is for #8 so I guess I'll owe the refactor to another PR. :)

@edsu
Copy link
Owner

edsu commented Jul 12, 2019

I apologize for not seeing this sooner. It seems that the pull request has conflicts now. If this is something you still want to get in let me know if you can address the conflicts and I promise to apply it sooner this time!

def __init__(self, string):
self.string = string
def __init__(self, string, domain=""):
if string.startswith("http://") or string.startswith("https://"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like it's duplicating the stdlib urljoin semantics:

>>> from urlparse import urljoin
>>> urljoin('https://example.com/foo/bar', '/baaz')
'https://example.com/baaz'
>>> urljoin('https://example.com/foo/bar', 'http://quux/baaz')
'http://quux/baaz'

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few questions:

  1. Judging by the .travis.yml this project is officially on Python 3. As such I should use urllib.parse right?
  2. Sorry I'm not sure I agree with you that this bit is duplicating urljoin semantics. This conditional just ensures that the string starts with either "http" or "https". urljoin does not do that:
Python 3.5.2 (default, Nov 12 2018, 13:43:14) 
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from urllib.parse import urljoin
>>> urljoin("github.com", "edsu")
'edsu'

Can you clarify if I'm missing something here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

urljoin expects its first parameter to be a URL:

>>> from urllib.parse import urljoin
>>> urljoin("https://github.com", "edsu")
'https://github.com/edsu'

@@ -115,6 +127,15 @@ def __eq__(self, other):
def __repr__(self):
return self.string

@staticmethod
def get_domain(url_string):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps this should be replaced with something using urlparse, which returns a handy named tuple:

>>> urlparse('https://example.com/foo/bar')
ParseResult(scheme='https', netloc='example.com', path='/foo/bar', params='', query='', fragment='')

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used urlparse inside this method instead of manually splitting and joining but urlparse can't be a drop-in replacement for what I want to achieve here:

>>> urlparse("github.com/edsu/microdata/pull/32/files")
ParseResult(scheme='', netloc='', path='github.com/edsu/microdata/pull/32/files', params='', query='', fragment='')

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

urlparse expects a URL:

>>> urlparse("https://github.com/edsu/microdata/pull/32/files")
ParseResult(scheme='https', netloc='github.com', path='/edsu/microdata/pull/32/files', params='', query='', fragment='')

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really knowledgeable about the field here but my point is, are we sure to never encounter malformed-URLs ? I get it that urllib expects well-formed URLs but shouldn't a microdata parser be more forgiving? Hence I manually check and adapt in case the string leaves out the protocol part.

@skytreader
Copy link
Author

Hey there. This is not relevant to my work anymore but let me just complete the job; besides, I know others who might benefit from this. That said, I'll get back to this this weekend.

@skytreader
Copy link
Author

Just got back to this. Resolving conflicts and addressing comments took a while longer since I'm no longer familiar with the domain (knowledge). Anyway I see that tests are failing :(. I'll address that next but in the meantime, feel free to comment further on my clarifications.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

make relative URLs absolute
3 participants