Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add join method to Url class #1378

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions python/pydantic_core/_pydantic_core.pyi
Original file line number Diff line number Diff line change
Expand Up @@ -606,6 +606,18 @@ class Url(SupportsAllComparisons):
An instance of URL
"""

def join(self, path: str, trailing_slash: bool = True) -> Self:
davidhewitt marked this conversation as resolved.
Show resolved Hide resolved
"""
Parse a string `path` as an URL, using this URL as the base.

Args:
path: The string (typically a relative URL) to parse and join with the base URL.
trailing_slash: Whether to append a trailing slash at the end of the URL.

Returns:
A new `Url` instance
"""

class MultiHostUrl(SupportsAllComparisons):
"""
A URL type with support for multiple hosts, as used by some databases for DSNs, e.g. `https://foo.com,bar.com/path`.
Expand Down
28 changes: 28 additions & 0 deletions src/url.rs
Original file line number Diff line number Diff line change
Expand Up @@ -155,6 +155,14 @@ impl PyUrl {
(self.__str__(),)
}

fn __truediv__(&self, other: &str) -> PyResult<Self> {
self.join(other, true)
}

fn __floordiv__(&self, other: &str) -> PyResult<Self> {
self.join(other, false)
}
Comment on lines +158 to +164
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, sorry I missed these in the last round of review. I think the difference between the / and // operators here is subtle and hard to document.

I think better we just have /, and make it so that it matches the default of append_trailing_slash=False. This will also simplify testing, I think.

Suggested change
fn __truediv__(&self, other: &str) -> PyResult<Self> {
self.join(other, true)
}
fn __floordiv__(&self, other: &str) -> PyResult<Self> {
self.join(other, false)
}
fn __truediv__(&self, other: &str) -> PyResult<Self> {
self.join(other, false)
}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay __floordiv__ can be removed but I feel the __truediv__ should have append_trailing_slash=True because this overloaded operator would likely be used to join multiple paths in shorter code. This behaviour would feel familiar to Python users, as it resembles pathlib's path joining.
For example,

a = Url("http://a")
print(a / "b" / "c" / "d")
# http://a/b/c/d/
a = Url("file:///home/user/")
print(a / "music" / "pop")
# file:///home/user/music/pop/

With append_trailing_slash=False it would instead result in http://a/d and file:///home/user/pop which I think is not what the user would expect.
I chose to add __floordiv__ too because it would simplify adding files at the end.

print(a / "dir" / "dir" / "dir" // "file.txt") # file:///home/user/dir/dir/dir/file.txt

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see. Yikes, there are so many subleties here!

It seems to me that our .join() method really works like urllib.parse.urljoin when it comes to semantics, e.g.

>>> urllib.parse.urljoin("https://foo.com/a", "b")
'https://foo.com/b'

versus pathlib's

>>> pathlib.Path("/foo/a").joinpath("b")
PosixPath('/foo/a/b')

Given these are inconsistent, I think we should perhaps back away from trying to have pathlib-like semantics at all.

Would you be open to the idea of dropping the operators from the PR completely, so we can get .join() merged? We could then open a pydantic issue to discuss the design of the operators and move forward with an implementation when there's consensus?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively we could also have joinpath() which works like Pathlib and doesn't accept query string or fragments as the whole input?

And then could have / operator work like joinpath? 🤔

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

joinpath() would certainly make things cleaner. Should I implement joinpath() in this PR, or should we drop the operators for now and discuss it in the issues instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great question. I think I'd prefer we just had .join() here and worried about .joinpath() and the operators later. That said, there's potentially a desire to agree a sketch of the follow ups here. @pydantic/oss - any ideas?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think without comment from anyone else, let's just do .join() here and then follow-up with an issue in the main pydantic repo where we can discuss .joinpath() and operators.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be great to have more time to discuss the semantics (does it need to match urllib? What about other libraries like furl? Should be double check with the current RFCs? We should also check what was said in this discussion).


#[classmethod]
#[pyo3(signature=(*, scheme, host, username=None, password=None, port=None, path=None, query=None, fragment=None))]
#[allow(clippy::too_many_arguments)]
Expand Down Expand Up @@ -190,6 +198,26 @@ impl PyUrl {
}
cls.call1((url,))
}

#[pyo3(signature=(path, trailing_slash=true))]
pub fn join(&self, path: &str, trailing_slash: bool) -> PyResult<Self> {
let mut new_url = self
.lib_url
.join(path)
.map_err(|err| PyValueError::new_err(err.to_string()))?;

if !trailing_slash || new_url.query().is_some() || new_url.fragment().is_some() || new_url.cannot_be_a_base() {
return Ok(PyUrl::new(new_url));
}

new_url
.path_segments_mut()
.map_err(|()| PyValueError::new_err("Url cannot be a base"))?
.pop_if_empty()
.push("");
davidhewitt marked this conversation as resolved.
Show resolved Hide resolved

Ok(PyUrl::new(new_url))
}
}

#[pyclass(name = "MultiHostUrl", module = "pydantic_core._pydantic_core", subclass, frozen)]
Expand Down
151 changes: 151 additions & 0 deletions tests/validators/test_url.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,10 @@

from ..conftest import Err, PyAndJson

SIMPLE_BASE = 'http://a/b/c/d'
QUERY_BASE = 'http://a/b/c/d;p?q'
QUERY_FRAGMENT_BASE = 'http://a/b/c/d;p?q#f'


def test_url_ok(py_and_json: PyAndJson):
v = py_and_json(core_schema.url_schema())
Expand Down Expand Up @@ -1299,3 +1303,150 @@ def test_url_build() -> None:
)
assert url == Url('postgresql://testuser:[email protected]:5432/database?sslmode=require#test')
assert str(url) == 'postgresql://testuser:[email protected]:5432/database?sslmode=require#test'


@pytest.mark.parametrize(
'base_url,join_path,expected_with_slash,expected_without_slash',
[
# Tests are based on the URL specification from https://url.spec.whatwg.org/
# Joining empty path with or without trailing slash should not affect the base url.
('http://example.com/', '', 'http://example.com/', 'http://example.com/'),
('svn://pathtorepo/dir1', 'dir2', 'svn://pathtorepo/dir2/', 'svn://pathtorepo/dir2'),
('svn+ssh://pathtorepo/dir1', 'dir2', 'svn+ssh://pathtorepo/dir2/', 'svn+ssh://pathtorepo/dir2'),
('ws://a/b', 'g', 'ws://a/g/', 'ws://a/g'),
('wss://a/b', 'g', 'wss://a/g/', 'wss://a/g'),
('http://a/b/c/de', ';x', 'http://a/b/c/;x/', 'http://a/b/c/;x'),
# Non-RFC-defined tests, covering variations of base and trailing
# slashes
('http://a/b/c/d/e/', '../../f/g/', 'http://a/b/c/f/g/', 'http://a/b/c/f/g/'),
('http://a/b/c/d/e', '../../f/g/', 'http://a/b/f/g/', 'http://a/b/f/g/'),
('http://a/b/c/d/e/', '/../../f/g/', 'http://a/f/g/', 'http://a/f/g/'),
('http://a/b/c/d/e', '/../../f/g/', 'http://a/f/g/', 'http://a/f/g/'),
('http://a/b/c/d/e/', '../../f/g', 'http://a/b/c/f/g/', 'http://a/b/c/f/g'),
('http://a/b/', '../../f/g/', 'http://a/f/g/', 'http://a/f/g/'),
(SIMPLE_BASE, 'g:h', 'g:h', 'g:h'),
(SIMPLE_BASE, 'g', 'http://a/b/c/g/', 'http://a/b/c/g'),
(SIMPLE_BASE, './g', 'http://a/b/c/g/', 'http://a/b/c/g'),
(SIMPLE_BASE, 'g/', 'http://a/b/c/g/', 'http://a/b/c/g/'),
(SIMPLE_BASE, '/g', 'http://a/g/', 'http://a/g'),
(SIMPLE_BASE, '//g', 'http://g/', 'http://g/'),
(SIMPLE_BASE, '?y', 'http://a/b/c/d?y', 'http://a/b/c/d?y'),
(SIMPLE_BASE, 'g?y', 'http://a/b/c/g?y', 'http://a/b/c/g?y'),
(SIMPLE_BASE, 'g?y/./x', 'http://a/b/c/g?y/./x', 'http://a/b/c/g?y/./x'),
(SIMPLE_BASE, '.', 'http://a/b/c/', 'http://a/b/c/'),
(SIMPLE_BASE, './', 'http://a/b/c/', 'http://a/b/c/'),
(SIMPLE_BASE, '..', 'http://a/b/', 'http://a/b/'),
(SIMPLE_BASE, '../', 'http://a/b/', 'http://a/b/'),
(SIMPLE_BASE, '../g', 'http://a/b/g/', 'http://a/b/g'),
(SIMPLE_BASE, '../..', 'http://a/', 'http://a/'),
(SIMPLE_BASE, '../../g', 'http://a/g/', 'http://a/g'),
(SIMPLE_BASE, './../g', 'http://a/b/g/', 'http://a/b/g'),
(SIMPLE_BASE, './g/.', 'http://a/b/c/g/', 'http://a/b/c/g/'),
(SIMPLE_BASE, 'g/./h', 'http://a/b/c/g/h/', 'http://a/b/c/g/h'),
(SIMPLE_BASE, 'g/../h', 'http://a/b/c/h/', 'http://a/b/c/h'),
(SIMPLE_BASE, 'http:g', 'http://a/b/c/g/', 'http://a/b/c/g'),
(SIMPLE_BASE, 'http:g?y', 'http://a/b/c/g?y', 'http://a/b/c/g?y'),
(SIMPLE_BASE, 'http:g?y/./x', 'http://a/b/c/g?y/./x', 'http://a/b/c/g?y/./x'),
(SIMPLE_BASE + '/', 'foo', SIMPLE_BASE + '/foo/', SIMPLE_BASE + '/foo'),
(QUERY_BASE, '?y', 'http://a/b/c/d;p?y', 'http://a/b/c/d;p?y'),
(QUERY_BASE, ';x', 'http://a/b/c/;x/', 'http://a/b/c/;x'),
(QUERY_BASE, 'g:h', 'g:h', 'g:h'),
(QUERY_BASE, 'g', 'http://a/b/c/g/', 'http://a/b/c/g'),
(QUERY_BASE, './g', 'http://a/b/c/g/', 'http://a/b/c/g'),
(QUERY_BASE, 'g/', 'http://a/b/c/g/', 'http://a/b/c/g/'),
(QUERY_BASE, '/g', 'http://a/g/', 'http://a/g'),
(QUERY_BASE, '//g', 'http://g/', 'http://g/'),
(QUERY_BASE, '?y', 'http://a/b/c/d;p?y', 'http://a/b/c/d;p?y'),
(QUERY_BASE, 'g?y', 'http://a/b/c/g?y', 'http://a/b/c/g?y'),
(QUERY_BASE, '#s', 'http://a/b/c/d;p?q#s', 'http://a/b/c/d;p?q#s'),
(QUERY_BASE, 'g#s', 'http://a/b/c/g#s', 'http://a/b/c/g#s'),
(QUERY_BASE, 'g?y#s', 'http://a/b/c/g?y#s', 'http://a/b/c/g?y#s'),
(QUERY_BASE, ';x', 'http://a/b/c/;x/', 'http://a/b/c/;x'),
(QUERY_BASE, 'g;x', 'http://a/b/c/g;x/', 'http://a/b/c/g;x'),
(QUERY_BASE, 'g;x?y#s', 'http://a/b/c/g;x?y#s', 'http://a/b/c/g;x?y#s'),
(QUERY_BASE, '', 'http://a/b/c/d;p?q', 'http://a/b/c/d;p?q'),
(QUERY_BASE, '.', 'http://a/b/c/', 'http://a/b/c/'),
(QUERY_BASE, './', 'http://a/b/c/', 'http://a/b/c/'),
(QUERY_BASE, '..', 'http://a/b/', 'http://a/b/'),
(QUERY_BASE, '../', 'http://a/b/', 'http://a/b/'),
(QUERY_BASE, '../g', 'http://a/b/g/', 'http://a/b/g'),
(QUERY_BASE, '../..', 'http://a/', 'http://a/'),
(QUERY_BASE, '../../', 'http://a/', 'http://a/'),
(QUERY_BASE, '../../g', 'http://a/g/', 'http://a/g'),
(QUERY_BASE, '../../../g', 'http://a/g/', 'http://a/g'),
# Abnormal Examples
(QUERY_BASE, '../../../g', 'http://a/g/', 'http://a/g'),
(QUERY_BASE, '../../../../g', 'http://a/g/', 'http://a/g'),
(QUERY_BASE, '/./g', 'http://a/g/', 'http://a/g'),
(QUERY_BASE, '/../g', 'http://a/g/', 'http://a/g'),
(QUERY_BASE, 'g.', 'http://a/b/c/g./', 'http://a/b/c/g.'),
(QUERY_BASE, '.g', 'http://a/b/c/.g/', 'http://a/b/c/.g'),
(QUERY_BASE, 'g..', 'http://a/b/c/g../', 'http://a/b/c/g..'),
(QUERY_BASE, '..g', 'http://a/b/c/..g/', 'http://a/b/c/..g'),
(QUERY_BASE, './../g', 'http://a/b/g/', 'http://a/b/g'),
(QUERY_BASE, './g/.', 'http://a/b/c/g/', 'http://a/b/c/g/'),
(QUERY_BASE, 'g/./h', 'http://a/b/c/g/h/', 'http://a/b/c/g/h'),
(QUERY_BASE, 'g/../h', 'http://a/b/c/h/', 'http://a/b/c/h'),
(QUERY_BASE, 'g;x=1/./y', 'http://a/b/c/g;x=1/y/', 'http://a/b/c/g;x=1/y'),
(QUERY_BASE, 'g;x=1/../y', 'http://a/b/c/y/', 'http://a/b/c/y'),
(QUERY_BASE, 'g?y/./x', 'http://a/b/c/g?y/./x', 'http://a/b/c/g?y/./x'),
(QUERY_BASE, 'g?y/../x', 'http://a/b/c/g?y/../x', 'http://a/b/c/g?y/../x'),
(QUERY_BASE, 'g#s/./x', 'http://a/b/c/g#s/./x', 'http://a/b/c/g#s/./x'),
(QUERY_BASE, 'g#s/../x', 'http://a/b/c/g#s/../x', 'http://a/b/c/g#s/../x'),
(QUERY_BASE, 'http:g', 'http://a/b/c/g/', 'http://a/b/c/g'),
# Test with empty (but defined) components.
(QUERY_FRAGMENT_BASE, '', 'http://a/b/c/d;p?q', 'http://a/b/c/d;p?q'),
(QUERY_FRAGMENT_BASE, '#', 'http://a/b/c/d;p?q#', 'http://a/b/c/d;p?q#'),
(QUERY_FRAGMENT_BASE, '#z', 'http://a/b/c/d;p?q#z', 'http://a/b/c/d;p?q#z'),
(QUERY_FRAGMENT_BASE, '?', 'http://a/b/c/d;p?', 'http://a/b/c/d;p?'),
(QUERY_FRAGMENT_BASE, '?#z', 'http://a/b/c/d;p?#z', 'http://a/b/c/d;p?#z'),
(QUERY_FRAGMENT_BASE, '?y', 'http://a/b/c/d;p?y', 'http://a/b/c/d;p?y'),
(QUERY_FRAGMENT_BASE, ';', 'http://a/b/c/;/', 'http://a/b/c/;'),
(QUERY_FRAGMENT_BASE, ';?y', 'http://a/b/c/;?y', 'http://a/b/c/;?y'),
(QUERY_FRAGMENT_BASE, ';#z', 'http://a/b/c/;#z', 'http://a/b/c/;#z'),
(QUERY_FRAGMENT_BASE, ';x', 'http://a/b/c/;x/', 'http://a/b/c/;x'),
(QUERY_FRAGMENT_BASE, '/w', 'http://a/w/', 'http://a/w'),
(QUERY_FRAGMENT_BASE, '//;x', 'http://;x/', 'http://;x/'),
(QUERY_FRAGMENT_BASE, '//v', 'http://v/', 'http://v/'),
# For backward compatibility with RFC1630, the scheme name is allowed
# to be present in a relative reference if it is the same as the base
# URI scheme.
(QUERY_FRAGMENT_BASE, 'http:', 'http://a/b/c/d;p?q', 'http://a/b/c/d;p?q'),
(QUERY_FRAGMENT_BASE, 'http:#', 'http://a/b/c/d;p?q#', 'http://a/b/c/d;p?q#'),
(QUERY_FRAGMENT_BASE, 'http:#z', 'http://a/b/c/d;p?q#z', 'http://a/b/c/d;p?q#z'),
(QUERY_FRAGMENT_BASE, 'http:?', 'http://a/b/c/d;p?', 'http://a/b/c/d;p?'),
(QUERY_FRAGMENT_BASE, 'http:?#z', 'http://a/b/c/d;p?#z', 'http://a/b/c/d;p?#z'),
(QUERY_FRAGMENT_BASE, 'http:?y', 'http://a/b/c/d;p?y', 'http://a/b/c/d;p?y'),
(QUERY_FRAGMENT_BASE, 'http:;', 'http://a/b/c/;/', 'http://a/b/c/;'),
(QUERY_FRAGMENT_BASE, 'http:;?y', 'http://a/b/c/;?y', 'http://a/b/c/;?y'),
(QUERY_FRAGMENT_BASE, 'http:;#z', 'http://a/b/c/;#z', 'http://a/b/c/;#z'),
(QUERY_FRAGMENT_BASE, 'http:;x', 'http://a/b/c/;x/', 'http://a/b/c/;x'),
(QUERY_FRAGMENT_BASE, 'http:/w', 'http://a/w/', 'http://a/w'),
(QUERY_FRAGMENT_BASE, 'http://;x', 'http://;x/', 'http://;x/'),
(QUERY_FRAGMENT_BASE, 'http:///w', 'http://w/', 'http://w/'),
(QUERY_FRAGMENT_BASE, 'http://v', 'http://v/', 'http://v/'),
# Different scheme is not ignored.
(QUERY_FRAGMENT_BASE, 'https:;', 'https://;/', 'https://;/'),
(QUERY_FRAGMENT_BASE, 'https:;x', 'https://;x/', 'https://;x/'),
],
)
def test_url_join(base_url, join_path, expected_with_slash, expected_without_slash) -> None:
"""Tests are based on
https://github.com/python/cpython/blob/3a0e7f57628466aedcaaf6c5ff7c8224f5155a2c/Lib/test/test_urlparse.py
and the URL specification from https://url.spec.whatwg.org/
"""
url = Url(base_url)
assert str(url.join(join_path, trailing_slash=True)) == expected_with_slash
assert str(url.join(join_path, trailing_slash=False)) == expected_without_slash


def test_url_join_operators() -> None:
url = Url('http://a/b/c/d')
assert str(url / 'e' / 'f') == 'http://a/b/c/e/f/'
assert str(url / 'e' // 'f') == 'http://a/b/c/e/f'
assert str(url // 'e' // 'f') == 'http://a/b/c/f'
assert str(url / 'e' / '?x=1') == 'http://a/b/c/e/?x=1'
assert str(url / 'e' / '?x=1' / '#y') == 'http://a/b/c/e/?x=1#y'
assert str(url / 'e' / '?x=1' // '#y') == 'http://a/b/c/e/?x=1#y'
assert str(url / 'e' // '?x=1' / '#y') == 'http://a/b/c/e/?x=1#y'
assert str(url // 'e' / '?x=1' / '#y') == 'http://a/b/c/e?x=1#y'