Skip to content

[FEAT] [Join Optimizations] Add broadcast join. #3282

[FEAT] [Join Optimizations] Add broadcast join.

[FEAT] [Join Optimizations] Add broadcast join. #3282

Triggered via pull request December 7, 2023 21:04
Status Success
Total duration 24s
Artifacts

release-drafter.yml

on: pull_request
update_release_draft
7s
update_release_draft
Fit to window
Zoom out
Zoom in

Annotations

2 errors
update_release_draft
Validation Failed: {"resource":"Release","code":"invalid","field":"target_commitish"} { name: 'HttpError', id: '7133978798', status: 422, response: { url: 'https://api.github.com/repos/Eventual-Inc/Daft/releases/132770221', status: 422, headers: { 'access-control-allow-origin': '*', 'access-control-expose-headers': 'ETag, Link, Location, Retry-After, X-GitHub-OTP, X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Used, X-RateLimit-Resource, X-RateLimit-Reset, X-OAuth-Scopes, X-Accepted-OAuth-Scopes, X-Poll-Interval, X-GitHub-Media-Type, X-GitHub-SSO, X-GitHub-Request-Id, Deprecation, Sunset', connection: 'close', 'content-length': '195', 'content-security-policy': "default-src 'none'", 'content-type': 'application/json; charset=utf-8', date: 'Thu, 07 Dec 2023 21:04:23 GMT', 'referrer-policy': 'origin-when-cross-origin, strict-origin-when-cross-origin', server: 'GitHub.com', 'strict-transport-security': 'max-age=31536000; includeSubdomains; preload', vary: 'Accept-Encoding, Accept, X-Requested-With', 'x-accepted-github-permissions': 'contents=write', 'x-content-type-options': 'nosniff', 'x-frame-options': 'deny', 'x-github-api-version-selected': '2022-11-28', 'x-github-media-type': 'github.v3; format=json', 'x-github-request-id': 'CBC7:1441:126947:2661AB:65723357', 'x-ratelimit-limit': '1000', 'x-ratelimit-remaining': '954', 'x-ratelimit-reset': '1701985799', 'x-ratelimit-resource': 'core', 'x-ratelimit-used': '46', 'x-xss-protection': '0' }, data: { message: 'Validation Failed', errors: [ { resource: 'Release', code: 'invalid', field: 'target_commitish' } ], documentation_url: 'https://docs.github.com/rest/releases/releases#update-a-release' } }, request: { method: 'PATCH', url: 'https://api.github.com/repos/Eventual-Inc/Daft/releases/132770221', headers: { accept: 'application/vnd.github.v3+json', 'user-agent': 'probot/12.2.5 octokit-core.js/3.5.1 Node.js/16.20.2 (linux; x64)', authorization: 'token [REDACTED]', 'content-type': 'application/json; charset=utf-8' }, body: '{"body":"## Changes\\n\\n## ✨ New Features\\n\\n- [FEAT] [JSON Reader] Add native streaming + parallel JSON reader. @clarkzinzow (#1679)\\n\\n## 🚀 Performance Improvements\\n\\n- [PERF] Enable Predicates in Parquet Reader @samster25 (#1702)\\n\\n## 📖 Documentation\\n\\n- [DOCS] Add notebooks used for pydata global 2023 presentation @jaychia (#1703)\\n","draft":true,"prerelease":false,"make_latest":"true","name":"v0.2.7","tag_name":"v0.2.7","target_commitish":"refs/pull/1706/merge"}', request: {} }, event: { id: '7133978798', name: 'pull_request', payload: { action: 'edited', changes: { body: { from: 'This PR adds a broadcast join implementation as a new join strategy, where all partitions of a small table are broadcasted to each partition in the larger table, such that we do a local (hash) join of the entire small table with each individual partition of the larger table.\r\n' + '\r\n' + '## Query Planning\r\n' + '\r\n' + 'The query planner chooses the broadcast join as its join strategy if one of the sides of the join is smaller than a preconfigured broadcasting threshold (set to 10 MiB by default, but is user-configurable).\r\n' + '\r\n' + "If the smaller side of the join is the right side, we invert the join for planning and scheduling simplicity so we can always broadcast the left side; we then swap back to the correct join ordering when performing the local joins. This means that we always form the probe table on the left side of the join; a future optimization (applicable to both the broadcast join and the hash join) would be to have local joins build the probe table on the smaller side while preserving the expected column ordering. We woul
update_release_draft
HttpError: Validation Failed: {"resource":"Release","code":"invalid","field":"target_commitish"} at /home/runner/work/_actions/release-drafter/release-drafter/v5/dist/index.js:8462:21 at processTicksAndRejections (node:internal/process/task_queues:96:5) at async Job.doExecute (/home/runner/work/_actions/release-drafter/release-drafter/v5/dist/index.js:30793:18) { name: 'AggregateError', event: { id: '7133978798', name: 'pull_request', payload: { action: 'edited', changes: { body: { from: 'This PR adds a broadcast join implementation as a new join strategy, where all partitions of a small table are broadcasted to each partition in the larger table, such that we do a local (hash) join of the entire small table with each individual partition of the larger table.\r\n' + '\r\n' + '## Query Planning\r\n' + '\r\n' + 'The query planner chooses the broadcast join as its join strategy if one of the sides of the join is smaller than a preconfigured broadcasting threshold (set to 10 MiB by default, but is user-configurable).\r\n' + '\r\n' + "If the smaller side of the join is the right side, we invert the join for planning and scheduling simplicity so we can always broadcast the left side; we then swap back to the correct join ordering when performing the local joins. This means that we always form the probe table on the left side of the join; a future optimization (applicable to both the broadcast join and the hash join) would be to have local joins build the probe table on the smaller side while preserving the expected column ordering. We would still need to always build the probe table on the left side of the join if we need to preserve the row-ordering of the right side of the join, e.g. if the right side of the join is range-partitioned AND we're doing a broadcast join.\r\n" + '\r\n' + '## Query Scheduling\r\n' + '\r\n' + 'All partitions for the broadcasting side of the join are first materialized. Then, as each partition on the receiving side of the join materialize, we dispatch a hash join task joining all broadcaster partitions with that single receiving-side partition.\r\n' + '\r\n' + '## TODOs\r\n' + '\r\n' + '- [x] Test coverage.\r\n' + '- [ ] (Follow-up?) TPC-H benchmarking demonstrating speedup due to use of broadcast join.\r\n' + '- [ ] (Follow-up) In local joins, build the probe table on the smaller side of the join.\r\n' + '- [ ] (Follow-up) Add table size approximations for operators that affect cardinality.' } }, number: 1706, organization: { avatar_url: 'https://avatars.githubusercontent.com/u/98941975?v=4', description: 'Eventual Computing', events_url: 'https://api.github.com/orgs/Eventual-Inc/events', hooks_url: 'https://api.github.com/orgs/Eventual-Inc/hooks', id: 98941975, issues_url: 'https://api.github.com/orgs/Eventual-Inc/issues', login: 'Eventual-Inc', members_url: 'https://api.github.com/orgs/Eventual-Inc/members{/member}', node_id: 'O_kgDOBeW8Fw', public_members_url: 'https://api.github.com/orgs/Eventual-Inc/public_members{/member}', repos_url: 'https://api.github.com/orgs/Eventual-Inc/repos', url: 'https://api.github.com/orgs/Eventual-Inc' }, pull_request: { _links: { comments: { href: 'https://api.github.com/repos/Eventual-Inc/Daft/issues/1706/comments' }, commits: { href: 'https://api.github.com/repos/Eventual-Inc/Daft/pulls/1706/commits' }, html: { href: 'https://github.com/Eventual-Inc/Daft/pull/1706' }, issue: { href: 'https://api.github.com/repos/Eventual-Inc/Daft/issues/1706' }, review_comment: { href: 'https://api.github.com/repos/Eventual-Inc/Daft/pulls/comme