[FEAT] [Join Optimizations] Add sort-merge join. · Eventual-Inc/Daft@62f87e1

Triggered via pull request December 22, 2023 21:40

clarkzinzow

edited #1755

clark/sort-merge-join

Status Success

Total duration 21s

Artifacts –

release-drafter.yml

on: pull_request

Annotations

2 errors

update_release_draft

Validation Failed: {"resource":"Release","code":"invalid","field":"target_commitish"} { name: 'HttpError', id: '7304059046', status: 422, response: { url: 'https://api.github.com/repos/Eventual-Inc/Daft/releases/134864929', status: 422, headers: { 'access-control-allow-origin': '*', 'access-control-expose-headers': 'ETag, Link, Location, Retry-After, X-GitHub-OTP, X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Used, X-RateLimit-Resource, X-RateLimit-Reset, X-OAuth-Scopes, X-Accepted-OAuth-Scopes, X-Poll-Interval, X-GitHub-Media-Type, X-GitHub-SSO, X-GitHub-Request-Id, Deprecation, Sunset', connection: 'close', 'content-length': '195', 'content-security-policy': "default-src 'none'", 'content-type': 'application/json; charset=utf-8', date: 'Fri, 22 Dec 2023 21:40:50 GMT', 'referrer-policy': 'origin-when-cross-origin, strict-origin-when-cross-origin', server: 'GitHub.com', 'strict-transport-security': 'max-age=31536000; includeSubdomains; preload', vary: 'Accept-Encoding, Accept, X-Requested-With', 'x-accepted-github-permissions': 'contents=write', 'x-content-type-options': 'nosniff', 'x-frame-options': 'deny', 'x-github-api-version-selected': '2022-11-28', 'x-github-media-type': 'github.v3; format=json', 'x-github-request-id': '781F:8B36:925D712:12E9DA21:65860262', 'x-ratelimit-limit': '1000', 'x-ratelimit-remaining': '986', 'x-ratelimit-reset': '1703284782', 'x-ratelimit-resource': 'core', 'x-ratelimit-used': '14', 'x-xss-protection': '0' }, data: { message: 'Validation Failed', errors: [ { resource: 'Release', code: 'invalid', field: 'target_commitish' } ], documentation_url: 'https://docs.github.com/rest/releases/releases#update-a-release' } }, request: { method: 'PATCH', url: 'https://api.github.com/repos/Eventual-Inc/Daft/releases/134864929', headers: { accept: 'application/vnd.github.v3+json', 'user-agent': 'probot/12.2.5 octokit-core.js/3.5.1 Node.js/16.20.2 (linux; x64)', authorization: 'token [REDACTED]', 'content-type': 'application/json; charset=utf-8' }, body: '{"body":"## Changes\\n\\n* No changes\\n","draft":true,"prerelease":false,"make_latest":"true","name":"v0.2.9","tag_name":"v0.2.9","target_commitish":"refs/pull/1755/merge"}', request: {} }, event: { id: '7304059046', name: 'pull_request', payload: { action: 'edited', changes: { body: { from: 'This PR adds a sort-merge join implementation as a new join strategy, where each side of the join is sorted on the join keys, and then the sorted tables are merged. The sort-merge join strategy is chosen automatically by the query planner if it is expected to be faster than the hash join and the broadcast join.\r\n' + '\r\n' + 'Similar to Spark\'s ability to specify a join algorithm hint, this PR also exposes a new optional `strategy` arg for `df.join()`, which allows users (and our Python-level tests) to manually specify a join algorithm; currently `"hash"`, `"sort_merge"`, and `"broadcast"` are supported, with the default `None` resulting in the query planner choosing a join algorithm automatically.\r\n' + '\r\n' + '```python\r\n' + 'df = left.join(right, on="foo", strategy="sort_merge")\r\n' + '```\r\n' + '\r\n' + '## Query Planning\r\n' + '\r\n' + 'The query planner chooses the sort-merge join as its join strategy if the larger side of the join is range-partitioned, or if the smaller side of the join is range-partitioned and the larger side is not partitioned. In the future, we will want to do a sort-merge join:\r\n' + '1. If a downstream operation requires the table to be sorted on the join key.\r\n' + '2. If neither sides of the join are partitioned AND we determine that the sor

update_release_draft

HttpError: Validation Failed: {"resource":"Release","code":"invalid","field":"target_commitish"} at /home/runner/work/_actions/release-drafter/release-drafter/v5/dist/index.js:8462:21 at processTicksAndRejections (node:internal/process/task_queues:96:5) at async Job.doExecute (/home/runner/work/_actions/release-drafter/release-drafter/v5/dist/index.js:30793:18) { name: 'AggregateError', event: { id: '7304059046', name: 'pull_request', payload: { action: 'edited', changes: { body: { from: 'This PR adds a sort-merge join implementation as a new join strategy, where each side of the join is sorted on the join keys, and then the sorted tables are merged. The sort-merge join strategy is chosen automatically by the query planner if it is expected to be faster than the hash join and the broadcast join.\r\n' + '\r\n' + 'Similar to Spark\'s ability to specify a join algorithm hint, this PR also exposes a new optional `strategy` arg for `df.join()`, which allows users (and our Python-level tests) to manually specify a join algorithm; currently `"hash"`, `"sort_merge"`, and `"broadcast"` are supported, with the default `None` resulting in the query planner choosing a join algorithm automatically.\r\n' + '\r\n' + '```python\r\n' + 'df = left.join(right, on="foo", strategy="sort_merge")\r\n' + '```\r\n' + '\r\n' + '## Query Planning\r\n' + '\r\n' + 'The query planner chooses the sort-merge join as its join strategy if the larger side of the join is range-partitioned, or if the smaller side of the join is range-partitioned and the larger side is not partitioned. In the future, we will want to do a sort-merge join:\r\n' + '1. If a downstream operation requires the table to be sorted on the join key.\r\n' + '2. If neither sides of the join are partitioned AND we determine that the sort-merge join is faster on unpartitioned data the the hash join (pending benchmarking).\r\n' + '\r\n' + '## Query Scheduling\r\n' + '\r\n' + 'All partitions for both sides of the join are materialized, upon which we calculate sort boundaries on samples from both sides of the join. These combined sort boundaries are used to sort each side of the join. Once each side is sorted with the same sort boundaries and the same number of partitions, we perform a merge join, which merges pairs of partitions (which should contain the same range of values since they were partitioned using the same boundaries).\r\n' + '\r\n' + '## TODOs\r\n' + '\r\n' + '- [x] Test coverage\r\n' + '- [ ] Benchmarking to validate + tweak the heuristics used by the query planner to choose whether to use the sort-merge join.\r\n' + "- [ ] Misc. clean up (e.g. cleaning up the comparator's null handling)." } }, number: 1755, organization: { avatar_url: 'https://avatars.githubusercontent.com/u/98941975?v=4', description: 'Eventual Computing', events_url: 'https://api.github.com/orgs/Eventual-Inc/events', hooks_url: 'https://api.github.com/orgs/Eventual-Inc/hooks', id: 98941975, issues_url: 'https://api.github.com/orgs/Eventual-Inc/issues', login: 'Eventual-Inc', members_url: 'https://api.github.com/orgs/Eventual-Inc/members{/member}', node_id: 'O_kgDOBeW8Fw', public_members_url: 'https://api.github.com/orgs/Eventual-Inc/public_members{/member}', repos_url: 'https://api.github.com/orgs/Eventual-Inc/repos', url: 'https://api.github.com/orgs/Eventual-Inc' }, pull_request: { _links: { comments: { href: 'https://api.github.com/repos/Eventual-Inc/Daft/issues/1755/comments' }, commits: { href: 'https://api.github.com/repos/Eventual-Inc/Daft/pulls/1755/commits' },

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEAT] [Join Optimizations] Add sort-merge join. #3497

Summary

[FEAT] [Join Optimizations] Add sort-merge join. #3497

Jobs

Run details

release-drafter.yml

Annotations