Skip to content

[FEAT] [Join Optimizations] Add sort-merge join. #3497

[FEAT] [Join Optimizations] Add sort-merge join.

[FEAT] [Join Optimizations] Add sort-merge join. #3497

Triggered via pull request December 22, 2023 21:40
Status Success
Total duration 21s
Artifacts

release-drafter.yml

on: pull_request
update_release_draft
5s
update_release_draft
Fit to window
Zoom out
Zoom in

Annotations

2 errors
update_release_draft
Validation Failed: {"resource":"Release","code":"invalid","field":"target_commitish"} { name: 'HttpError', id: '7304059046', status: 422, response: { url: 'https://api.github.com/repos/Eventual-Inc/Daft/releases/134864929', status: 422, headers: { 'access-control-allow-origin': '*', 'access-control-expose-headers': 'ETag, Link, Location, Retry-After, X-GitHub-OTP, X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Used, X-RateLimit-Resource, X-RateLimit-Reset, X-OAuth-Scopes, X-Accepted-OAuth-Scopes, X-Poll-Interval, X-GitHub-Media-Type, X-GitHub-SSO, X-GitHub-Request-Id, Deprecation, Sunset', connection: 'close', 'content-length': '195', 'content-security-policy': "default-src 'none'", 'content-type': 'application/json; charset=utf-8', date: 'Fri, 22 Dec 2023 21:40:50 GMT', 'referrer-policy': 'origin-when-cross-origin, strict-origin-when-cross-origin', server: 'GitHub.com', 'strict-transport-security': 'max-age=31536000; includeSubdomains; preload', vary: 'Accept-Encoding, Accept, X-Requested-With', 'x-accepted-github-permissions': 'contents=write', 'x-content-type-options': 'nosniff', 'x-frame-options': 'deny', 'x-github-api-version-selected': '2022-11-28', 'x-github-media-type': 'github.v3; format=json', 'x-github-request-id': '781F:8B36:925D712:12E9DA21:65860262', 'x-ratelimit-limit': '1000', 'x-ratelimit-remaining': '986', 'x-ratelimit-reset': '1703284782', 'x-ratelimit-resource': 'core', 'x-ratelimit-used': '14', 'x-xss-protection': '0' }, data: { message: 'Validation Failed', errors: [ { resource: 'Release', code: 'invalid', field: 'target_commitish' } ], documentation_url: 'https://docs.github.com/rest/releases/releases#update-a-release' } }, request: { method: 'PATCH', url: 'https://api.github.com/repos/Eventual-Inc/Daft/releases/134864929', headers: { accept: 'application/vnd.github.v3+json', 'user-agent': 'probot/12.2.5 octokit-core.js/3.5.1 Node.js/16.20.2 (linux; x64)', authorization: 'token [REDACTED]', 'content-type': 'application/json; charset=utf-8' }, body: '{"body":"## Changes\\n\\n* No changes\\n","draft":true,"prerelease":false,"make_latest":"true","name":"v0.2.9","tag_name":"v0.2.9","target_commitish":"refs/pull/1755/merge"}', request: {} }, event: { id: '7304059046', name: 'pull_request', payload: { action: 'edited', changes: { body: { from: 'This PR adds a sort-merge join implementation as a new join strategy, where each side of the join is sorted on the join keys, and then the sorted tables are merged. The sort-merge join strategy is chosen automatically by the query planner if it is expected to be faster than the hash join and the broadcast join.\r\n' + '\r\n' + 'Similar to Spark\'s ability to specify a join algorithm hint, this PR also exposes a new optional `strategy` arg for `df.join()`, which allows users (and our Python-level tests) to manually specify a join algorithm; currently `"hash"`, `"sort_merge"`, and `"broadcast"` are supported, with the default `None` resulting in the query planner choosing a join algorithm automatically.\r\n' + '\r\n' + '```python\r\n' + 'df = left.join(right, on="foo", strategy="sort_merge")\r\n' + '```\r\n' + '\r\n' + '## Query Planning\r\n' + '\r\n' + 'The query planner chooses the sort-merge join as its join strategy if the larger side of the join is range-partitioned, or if the smaller side of the join is range-partitioned and the larger side is not partitioned. In the future, we will want to do a sort-merge join:\r\n' + '1. If a downstream operation requires the table to be sorted on the join key.\r\n' + '2. If neither sides of the join are partitioned AND we determine that the sor
update_release_draft
HttpError: Validation Failed: {"resource":"Release","code":"invalid","field":"target_commitish"} at /home/runner/work/_actions/release-drafter/release-drafter/v5/dist/index.js:8462:21 at processTicksAndRejections (node:internal/process/task_queues:96:5) at async Job.doExecute (/home/runner/work/_actions/release-drafter/release-drafter/v5/dist/index.js:30793:18) { name: 'AggregateError', event: { id: '7304059046', name: 'pull_request', payload: { action: 'edited', changes: { body: { from: 'This PR adds a sort-merge join implementation as a new join strategy, where each side of the join is sorted on the join keys, and then the sorted tables are merged. The sort-merge join strategy is chosen automatically by the query planner if it is expected to be faster than the hash join and the broadcast join.\r\n' + '\r\n' + 'Similar to Spark\'s ability to specify a join algorithm hint, this PR also exposes a new optional `strategy` arg for `df.join()`, which allows users (and our Python-level tests) to manually specify a join algorithm; currently `"hash"`, `"sort_merge"`, and `"broadcast"` are supported, with the default `None` resulting in the query planner choosing a join algorithm automatically.\r\n' + '\r\n' + '```python\r\n' + 'df = left.join(right, on="foo", strategy="sort_merge")\r\n' + '```\r\n' + '\r\n' + '## Query Planning\r\n' + '\r\n' + 'The query planner chooses the sort-merge join as its join strategy if the larger side of the join is range-partitioned, or if the smaller side of the join is range-partitioned and the larger side is not partitioned. In the future, we will want to do a sort-merge join:\r\n' + '1. If a downstream operation requires the table to be sorted on the join key.\r\n' + '2. If neither sides of the join are partitioned AND we determine that the sort-merge join is faster on unpartitioned data the the hash join (pending benchmarking).\r\n' + '\r\n' + '## Query Scheduling\r\n' + '\r\n' + 'All partitions for both sides of the join are materialized, upon which we calculate sort boundaries on samples from both sides of the join. These combined sort boundaries are used to sort each side of the join. Once each side is sorted with the same sort boundaries and the same number of partitions, we perform a merge join, which merges pairs of partitions (which should contain the same range of values since they were partitioned using the same boundaries).\r\n' + '\r\n' + '## TODOs\r\n' + '\r\n' + '- [x] Test coverage\r\n' + '- [ ] Benchmarking to validate + tweak the heuristics used by the query planner to choose whether to use the sort-merge join.\r\n' + "- [ ] Misc. clean up (e.g. cleaning up the comparator's null handling)." } }, number: 1755, organization: { avatar_url: 'https://avatars.githubusercontent.com/u/98941975?v=4', description: 'Eventual Computing', events_url: 'https://api.github.com/orgs/Eventual-Inc/events', hooks_url: 'https://api.github.com/orgs/Eventual-Inc/hooks', id: 98941975, issues_url: 'https://api.github.com/orgs/Eventual-Inc/issues', login: 'Eventual-Inc', members_url: 'https://api.github.com/orgs/Eventual-Inc/members{/member}', node_id: 'O_kgDOBeW8Fw', public_members_url: 'https://api.github.com/orgs/Eventual-Inc/public_members{/member}', repos_url: 'https://api.github.com/orgs/Eventual-Inc/repos', url: 'https://api.github.com/orgs/Eventual-Inc' }, pull_request: { _links: { comments: { href: 'https://api.github.com/repos/Eventual-Inc/Daft/issues/1755/comments' }, commits: { href: 'https://api.github.com/repos/Eventual-Inc/Daft/pulls/1755/commits' },