Relative to absolute #32

skytreader · 2016-05-26T22:25:43Z

Fixes #8.

Instead of passing around the domain, a cleaner way would be to create an object that abstracts the parsing of a document. That way, the domain can be extracted from the constructor into a field and then used by methods that need it. But I think this PR is large enough as it is for #8 so I guess I'll owe the refactor to another PR. :)

edsu · 2019-07-12T13:14:33Z

I apologize for not seeing this sooner. It seems that the pull request has conflicts now. If this is something you still want to get in let me know if you can address the conflicts and I promise to apply it sooner this time!

acdha · 2019-07-12T13:33:35Z

microdata.py

-    def __init__(self, string):
-        self.string = string
+    def __init__(self, string, domain=""):
+        if string.startswith("http://") or string.startswith("https://"):


This feels like it's duplicating the stdlib urljoin semantics:

>>> from urlparse import urljoin >>> urljoin('https://example.com/foo/bar', '/baaz') 'https://example.com/baaz' >>> urljoin('https://example.com/foo/bar', 'http://quux/baaz') 'http://quux/baaz'

A few questions:

Judging by the .travis.yml this project is officially on Python 3. As such I should use urllib.parse right?

Sorry I'm not sure I agree with you that this bit is duplicating urljoin semantics. This conditional just ensures that the string starts with either "http" or "https". urljoin does not do that:

Python 3.5.2 (default, Nov 12 2018, 13:43:14) [GCC 5.4.0 20160609] on linux Type "help", "copyright", "credits" or "license" for more information. >>> from urllib.parse import urljoin >>> urljoin("github.com", "edsu") 'edsu'

Can you clarify if I'm missing something here?

urljoin expects its first parameter to be a URL:

>>> from urllib.parse import urljoin >>> urljoin("https://github.com", "edsu") 'https://github.com/edsu'

acdha · 2019-07-12T13:35:35Z

microdata.py

@@ -115,6 +127,15 @@ def __eq__(self, other):
    def __repr__(self):
        return self.string

+    @staticmethod
+    def get_domain(url_string):


Perhaps this should be replaced with something using urlparse, which returns a handy named tuple:

>>> urlparse('https://example.com/foo/bar') ParseResult(scheme='https', netloc='example.com', path='/foo/bar', params='', query='', fragment='')

I used urlparse inside this method instead of manually splitting and joining but urlparse can't be a drop-in replacement for what I want to achieve here:

>>> urlparse("github.com/edsu/microdata/pull/32/files") ParseResult(scheme='', netloc='', path='github.com/edsu/microdata/pull/32/files', params='', query='', fragment='')

urlparse expects a URL:

>>> urlparse("https://github.com/edsu/microdata/pull/32/files") ParseResult(scheme='https', netloc='github.com', path='/edsu/microdata/pull/32/files', params='', query='', fragment='')

I'm not really knowledgeable about the field here but my point is, are we sure to never encounter malformed-URLs ? I get it that urllib expects well-formed URLs but shouldn't a microdata parser be more forgiving? Hence I manually check and adapt in case the string leaves out the protocol part.

skytreader · 2019-07-16T23:23:14Z

Hey there. This is not relevant to my work anymore but let me just complete the job; besides, I know others who might benefit from this. That said, I'll get back to this this weekend.

skytreader · 2019-07-23T01:03:59Z

Just got back to this. Resolving conflicts and addressing comments took a while longer since I'm no longer familiar with the domain (knowledge). Anyway I see that tests are failing :(. I'll address that next but in the meantime, feel free to comment further on my clarifications.

skytreader added 7 commits May 27, 2016 05:39

Write tests first.

500cf9c

Batch 1 of adding the domain to parameters.

b9e3ff7

This looks successful.

b1a7342

Adapt tests.

d457566

All tests passing.

b577bd5

Include the protocol as well.

7be3063

Handle Python 3.

28977ec

acdha reviewed Jul 12, 2019

View reviewed changes

skytreader added 2 commits July 23, 2019 08:34

Merge branch 'master' into relative2absolute

744e466

Use urlparse in get_domain.

3602f31

skytreader force-pushed the relative2absolute branch from e03ef53 to 3602f31 Compare July 23, 2019 01:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Relative to absolute #32

Relative to absolute #32

skytreader commented May 26, 2016

edsu commented Jul 12, 2019

acdha Jul 12, 2019

skytreader Jul 23, 2019

acdha Jul 23, 2019

acdha Jul 12, 2019

skytreader Jul 23, 2019

acdha Jul 23, 2019

skytreader Jul 23, 2019

skytreader commented Jul 16, 2019

skytreader commented Jul 23, 2019

Relative to absolute #32

Are you sure you want to change the base?

Relative to absolute #32

Conversation

skytreader commented May 26, 2016

edsu commented Jul 12, 2019

acdha Jul 12, 2019

Choose a reason for hiding this comment

skytreader Jul 23, 2019

Choose a reason for hiding this comment

acdha Jul 23, 2019

Choose a reason for hiding this comment

acdha Jul 12, 2019

Choose a reason for hiding this comment

skytreader Jul 23, 2019

Choose a reason for hiding this comment

acdha Jul 23, 2019

Choose a reason for hiding this comment

skytreader Jul 23, 2019

Choose a reason for hiding this comment

skytreader commented Jul 16, 2019

skytreader commented Jul 23, 2019