Skip to content

[FEAT] Add streaming + parallel CSV reader, with decompression support. #2305

[FEAT] Add streaming + parallel CSV reader, with decompression support.

[FEAT] Add streaming + parallel CSV reader, with decompression support. #2305

Triggered via pull request October 18, 2023 17:54
Status Success
Total duration 27s
Artifacts

release-drafter.yml

on: pull_request
update_release_draft
6s
update_release_draft
Fit to window
Zoom out
Zoom in

Annotations

2 errors
update_release_draft
Validation Failed: {"resource":"Release","code":"invalid","field":"target_commitish"} { name: 'HttpError', id: '6564704347', status: 422, response: { url: 'https://api.github.com/repos/Eventual-Inc/Daft/releases/124512520', status: 422, headers: { 'access-control-allow-origin': '*', 'access-control-expose-headers': 'ETag, Link, Location, Retry-After, X-GitHub-OTP, X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Used, X-RateLimit-Resource, X-RateLimit-Reset, X-OAuth-Scopes, X-Accepted-OAuth-Scopes, X-Poll-Interval, X-GitHub-Media-Type, X-GitHub-SSO, X-GitHub-Request-Id, Deprecation, Sunset', connection: 'close', 'content-length': '195', 'content-security-policy': "default-src 'none'", 'content-type': 'application/json; charset=utf-8', date: 'Wed, 18 Oct 2023 17:55:05 GMT', 'referrer-policy': 'origin-when-cross-origin, strict-origin-when-cross-origin', server: 'GitHub.com', 'strict-transport-security': 'max-age=31536000; includeSubdomains; preload', vary: 'Accept-Encoding, Accept, X-Requested-With', 'x-accepted-github-permissions': 'contents=write', 'x-content-type-options': 'nosniff', 'x-frame-options': 'deny', 'x-github-api-version-selected': '2022-11-28', 'x-github-media-type': 'github.v3; format=json', 'x-github-request-id': 'A148:4420:90717F:124E5DF:65301BF9', 'x-ratelimit-limit': '1000', 'x-ratelimit-remaining': '950', 'x-ratelimit-reset': '1697652395', 'x-ratelimit-resource': 'core', 'x-ratelimit-used': '50', 'x-xss-protection': '0' }, data: { message: 'Validation Failed', errors: [ { resource: 'Release', code: 'invalid', field: 'target_commitish' } ], documentation_url: 'https://docs.github.com/rest/releases/releases#update-a-release' } }, request: { method: 'PATCH', url: 'https://api.github.com/repos/Eventual-Inc/Daft/releases/124512520', headers: { accept: 'application/vnd.github.v3+json', 'user-agent': 'probot/12.2.5 octokit-core.js/3.5.1 Node.js/16.20.2 (linux; x64)', authorization: 'token [REDACTED]', 'content-type': 'application/json; charset=utf-8' }, body: '{"body":"## Changes\\n\\n## ✨ New Features\\n\\n- [FEAT] IOStats for Native Reader @samster25 (#1493)\\n\\n## 🚀 Performance Improvements\\n\\n- [PERF] Micropartition, lazy loading and Column Stats @samster25 (#1470)\\n- [PERF] Use pyarrow table for pickling rather than ChunkedArray @samster25 (#1488)\\n- [PERF] Use region from system and leverage cached credentials when making new clients @samster25 (#1490)\\n- [PERF] Update default max\\\\_connections 64->8 because it is now per-io-thread @jaychia (#1485)\\n- [PERF] Pass-through multithreaded\\\\_io flag in read\\\\_parquet @jaychia (#1484)\\n\\n## 👾 Bug Fixes\\n\\n- [BUG] Fix handling of special characters in S3LikeSource @jaychia (#1495)\\n- [BUG] Fix local globbing of current directory @jaychia (#1494)\\n- [BUG] fix script to upload file 1 at a time @samster25 (#1492)\\n- [CHORE] Add tests and fixes for Azure globbing @jaychia (#1482)\\n\\n## 🧰 Maintenance\\n\\n- [CHORE] Better logging for physical plan @jaychia (#1499)\\n- [CHORE] Refactor logging @jaychia (#1489)\\n- [CHORE] Add Workflow to build artifacts and upload to S3 @samster25 (#1491)\\n- [CHORE] Update default num\\\\_tries on S3Config to 25 @jaychia (#1487)\\n- [CHORE] Add tests and fixes for Azure globbing @jaychia (#1482)\\n","draft":true,"prerelease":false,"make_latest":"true","name":"v0.1.21","tag_name":"v0.1.21","target_commitish":"refs/pull/1501/merge"}', request: {} }, event: { id: '6564704347', name: 'pull_request', payload: { action: 'edited', changes: { body: { from: 'This PR adds streaming + parallel CSV reading and parsing, along with support for streaming decompression. In particular, this PR:\r\n' + '- Adds support for streaming decompression for brotli, bz, deflate, gzip,
update_release_draft
HttpError: Validation Failed: {"resource":"Release","code":"invalid","field":"target_commitish"} at /home/runner/work/_actions/release-drafter/release-drafter/v5/dist/index.js:8462:21 at processTicksAndRejections (node:internal/process/task_queues:96:5) at async Job.doExecute (/home/runner/work/_actions/release-drafter/release-drafter/v5/dist/index.js:30793:18) { name: 'AggregateError', event: { id: '6564704347', name: 'pull_request', payload: { action: 'edited', changes: { body: { from: 'This PR adds streaming + parallel CSV reading and parsing, along with support for streaming decompression. In particular, this PR:\r\n' + '- Adds support for streaming decompression for brotli, bz, deflate, gzip, lzma, xz, zlib, and zstd.\r\n' + '- Performs chunk-based streaming CSV reads, filling up a small buffer of unparsed records.\r\n' + '- Pipelines chunk-based CSV parsing with reading by spawning Tokio + rayon parsing tasks.\r\n' + '- Performances chunk parsing, as well as column parsing within a chunk, in parallel on the rayon threadpool.\r\n' + '- Changes schema inference to involve an (at most) 1 MiB file peak rather than a full file read.\r\n' + '- Gathers a mean row size in bytes estimate during schema inference and propagates this estimate back to the reader.\r\n' + '- Unifies local and cloud reads + schema inference.\r\n' + '- Adds thorough Rust-side local + cloud test coverage.\r\n' + '\r\n' + 'The streaming + parallel reading leads to a 4-8x speed up over the pyarrow reader and the previous non-parallel reader when benchmarking large file (~1 GB) reads, while also resulting in lower memory utilization due to the streaming reading + parsing.\r\n' + '\r\n' + '## TODOs (follow-up PRs)\r\n' + '\r\n' + '- [ ] Add snappy decompression support (need to essentially do something like [this](https://github.com/belltoy/tokio-snappy/blob/master/src/lib.rs))' } }, number: 1501, organization: { avatar_url: 'https://avatars.githubusercontent.com/u/98941975?v=4', description: 'Eventual Computing', events_url: 'https://api.github.com/orgs/Eventual-Inc/events', hooks_url: 'https://api.github.com/orgs/Eventual-Inc/hooks', id: 98941975, issues_url: 'https://api.github.com/orgs/Eventual-Inc/issues', login: 'Eventual-Inc', members_url: 'https://api.github.com/orgs/Eventual-Inc/members{/member}', node_id: 'O_kgDOBeW8Fw', public_members_url: 'https://api.github.com/orgs/Eventual-Inc/public_members{/member}', repos_url: 'https://api.github.com/orgs/Eventual-Inc/repos', url: 'https://api.github.com/orgs/Eventual-Inc' }, pull_request: { _links: { comments: { href: 'https://api.github.com/repos/Eventual-Inc/Daft/issues/1501/comments' }, commits: { href: 'https://api.github.com/repos/Eventual-Inc/Daft/pulls/1501/commits' }, html: { href: 'https://github.com/Eventual-Inc/Daft/pull/1501' }, issue: { href: 'https://api.github.com/repos/Eventual-Inc/Daft/issues/1501' }, review_comment: { href: 'https://api.github.com/repos/Eventual-Inc/Daft/pulls/comments{/number}' }, review_comments: { href: 'https://api.github.com/repos/Eventual-Inc/Daft/pulls/1501/comments' }, self: { href: 'https://api.github.com/repos/Eventual-Inc/Daft/pulls/1501' }, statuses: { href: 'https://api.github.com/repos/Eventual-Inc/Daft/statuses/d0cd093357b690e3461bc457af21064c9a14ee6e' } }, active_lock_reason: null, additions: 1616, assignee: null, assignees: [], author_association: 'CONTRIBUTOR', auto_merge: null, base: