-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Relative to absolute #32
base: master
Are you sure you want to change the base?
Conversation
I apologize for not seeing this sooner. It seems that the pull request has conflicts now. If this is something you still want to get in let me know if you can address the conflicts and I promise to apply it sooner this time! |
def __init__(self, string): | ||
self.string = string | ||
def __init__(self, string, domain=""): | ||
if string.startswith("http://") or string.startswith("https://"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels like it's duplicating the stdlib urljoin
semantics:
>>> from urlparse import urljoin
>>> urljoin('https://example.com/foo/bar', '/baaz')
'https://example.com/baaz'
>>> urljoin('https://example.com/foo/bar', 'http://quux/baaz')
'http://quux/baaz'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few questions:
- Judging by the
.travis.yml
this project is officially on Python 3. As such I should useurllib.parse
right? - Sorry I'm not sure I agree with you that this bit is duplicating
urljoin
semantics. This conditional just ensures that the string starts with either "http" or "https".urljoin
does not do that:
Python 3.5.2 (default, Nov 12 2018, 13:43:14)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from urllib.parse import urljoin
>>> urljoin("github.com", "edsu")
'edsu'
Can you clarify if I'm missing something here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
urljoin expects its first parameter to be a URL:
>>> from urllib.parse import urljoin
>>> urljoin("https://github.com", "edsu")
'https://github.com/edsu'
@@ -115,6 +127,15 @@ def __eq__(self, other): | |||
def __repr__(self): | |||
return self.string | |||
|
|||
@staticmethod | |||
def get_domain(url_string): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps this should be replaced with something using urlparse
, which returns a handy named tuple:
>>> urlparse('https://example.com/foo/bar')
ParseResult(scheme='https', netloc='example.com', path='/foo/bar', params='', query='', fragment='')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used urlparse
inside this method instead of manually splitting and joining but urlparse
can't be a drop-in replacement for what I want to achieve here:
>>> urlparse("github.com/edsu/microdata/pull/32/files")
ParseResult(scheme='', netloc='', path='github.com/edsu/microdata/pull/32/files', params='', query='', fragment='')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
urlparse expects a URL:
>>> urlparse("https://github.com/edsu/microdata/pull/32/files")
ParseResult(scheme='https', netloc='github.com', path='/edsu/microdata/pull/32/files', params='', query='', fragment='')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not really knowledgeable about the field here but my point is, are we sure to never encounter malformed-URLs ? I get it that urllib expects well-formed URLs but shouldn't a microdata parser be more forgiving? Hence I manually check and adapt in case the string leaves out the protocol part.
Hey there. This is not relevant to my work anymore but let me just complete the job; besides, I know others who might benefit from this. That said, I'll get back to this this weekend. |
e03ef53
to
3602f31
Compare
Just got back to this. Resolving conflicts and addressing comments took a while longer since I'm no longer familiar with the domain (knowledge). Anyway I see that tests are failing :(. I'll address that next but in the meantime, feel free to comment further on my clarifications. |
Fixes #8.
Instead of passing around the domain, a cleaner way would be to create an object that abstracts the parsing of a document. That way, the domain can be extracted from the constructor into a field and then used by methods that need it. But I think this PR is large enough as it is for #8 so I guess I'll owe the refactor to another PR. :)