Skip to content

[FEAT][1/2] Support Iceberg renaming of columns #4187

[FEAT][1/2] Support Iceberg renaming of columns

[FEAT][1/2] Support Iceberg renaming of columns #4187

Triggered via pull request March 5, 2024 01:50
Status Success
Total duration 23s
Artifacts

release-drafter.yml

on: pull_request
update_release_draft
5s
update_release_draft
Fit to window
Zoom out
Zoom in

Annotations

2 errors and 1 warning
update_release_draft
Validation Failed: {"resource":"Release","code":"invalid","field":"target_commitish"} { name: 'HttpError', id: '8149824390', status: 422, response: { url: 'https://api.github.com/repos/Eventual-Inc/Daft/releases/144734030', status: 422, headers: { 'access-control-allow-origin': '*', 'access-control-expose-headers': 'ETag, Link, Location, Retry-After, X-GitHub-OTP, X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Used, X-RateLimit-Resource, X-RateLimit-Reset, X-OAuth-Scopes, X-Accepted-OAuth-Scopes, X-Poll-Interval, X-GitHub-Media-Type, X-GitHub-SSO, X-GitHub-Request-Id, Deprecation, Sunset', connection: 'close', 'content-length': '195', 'content-security-policy': "default-src 'none'", 'content-type': 'application/json; charset=utf-8', date: 'Tue, 05 Mar 2024 01:50:27 GMT', 'referrer-policy': 'origin-when-cross-origin, strict-origin-when-cross-origin', server: 'GitHub.com', 'strict-transport-security': 'max-age=31536000; includeSubdomains; preload', vary: 'Accept-Encoding, Accept, X-Requested-With', 'x-accepted-github-permissions': 'contents=write', 'x-content-type-options': 'nosniff', 'x-frame-options': 'deny', 'x-github-api-version-selected': '2022-11-28', 'x-github-media-type': 'github.v3; format=json', 'x-github-request-id': 'F07F:72F8:34FC253:6951E11:65E67A63', 'x-ratelimit-limit': '5000', 'x-ratelimit-remaining': '4894', 'x-ratelimit-reset': '1709605361', 'x-ratelimit-resource': 'core', 'x-ratelimit-used': '106', 'x-xss-protection': '0' }, data: { message: 'Validation Failed', errors: [ { resource: 'Release', code: 'invalid', field: 'target_commitish' } ], documentation_url: 'https://docs.github.com/rest/releases/releases#update-a-release' } }, request: { method: 'PATCH', url: 'https://api.github.com/repos/Eventual-Inc/Daft/releases/144734030', headers: { accept: 'application/vnd.github.v3+json', 'user-agent': 'probot/12.2.5 octokit-core.js/3.5.1 Node.js/16.20.2 (linux; x64)', authorization: 'token [REDACTED]', 'content-type': 'application/json; charset=utf-8' }, body: '{"body":"## Changes\\n\\n## ✨ New Features\\n\\n- [FEAT][2/2] Support Iceberg renaming of \\\\*\\\\*nested\\\\*\\\\* columns @jaychia (#1956)\\n","draft":true,"prerelease":false,"make_latest":"true","name":"v0.2.18","tag_name":"v0.2.18","target_commitish":"refs/pull/1937/merge"}', request: {} }, event: { id: '8149824390', name: 'pull_request', payload: { action: 'edited', changes: { body: { from: '# Summary\r\n' + '\r\n' + 'Support field_id renaming of Parquet files along the codepath:\r\n' + '\r\n' + '1. `IcebergScanOperator`\r\n' + '2. Generates `ScanTasks`, each containing the `field_id_mapping: Arc<{i32: Field}>`\r\n' + '3. Propagated to workers through the `ScanWithTask` instruction object\r\n' + '4. Micropartitions are created with `MicroPartition::from_scan_task`\r\n' + '5. This then calls into `read_parquet_into_micropartition`\r\n' + ' a. If statistics are available, it will create an unloaded MicroPartition by creating a new ScanTask (hydrated with statistics) and then calling `MicroPartition::new_unloaded(new_scan_task)`. \r\n' + ' b. Otherwise, it falls back into `read_parquet_bulk`, which has been modified to correctly handle `field_id_mapping`\r\n' + '\r\n' + 'This PR ensures that when data/statistics are read from Parquet files, we correctly apply renaming according to `field_id_mapping`.\r\n' + '\r\n' + '## Notes\r\n' + '\r\n' + 'A lot of the errors caught/triggered by this PR has to do with mismatches between the fields (names/metadata) on our schemas and on our Series objects.\r\n' + '\r\n'
update_release_draft
HttpError: Validation Failed: {"resource":"Release","code":"invalid","field":"target_commitish"} at /home/runner/work/_actions/release-drafter/release-drafter/v5/dist/index.js:8462:21 at processTicksAndRejections (node:internal/process/task_queues:96:5) at async Job.doExecute (/home/runner/work/_actions/release-drafter/release-drafter/v5/dist/index.js:30793:18) { name: 'AggregateError', event: { id: '8149824390', name: 'pull_request', payload: { action: 'edited', changes: { body: { from: '# Summary\r\n' + '\r\n' + 'Support field_id renaming of Parquet files along the codepath:\r\n' + '\r\n' + '1. `IcebergScanOperator`\r\n' + '2. Generates `ScanTasks`, each containing the `field_id_mapping: Arc<{i32: Field}>`\r\n' + '3. Propagated to workers through the `ScanWithTask` instruction object\r\n' + '4. Micropartitions are created with `MicroPartition::from_scan_task`\r\n' + '5. This then calls into `read_parquet_into_micropartition`\r\n' + ' a. If statistics are available, it will create an unloaded MicroPartition by creating a new ScanTask (hydrated with statistics) and then calling `MicroPartition::new_unloaded(new_scan_task)`. \r\n' + ' b. Otherwise, it falls back into `read_parquet_bulk`, which has been modified to correctly handle `field_id_mapping`\r\n' + '\r\n' + 'This PR ensures that when data/statistics are read from Parquet files, we correctly apply renaming according to `field_id_mapping`.\r\n' + '\r\n' + '## Notes\r\n' + '\r\n' + 'A lot of the errors caught/triggered by this PR has to do with mismatches between the fields (names/metadata) on our schemas and on our Series objects.\r\n' + '\r\n' + 'Keeping those two in sync is fairly challenging with the way our code is currently structured.\r\n' + '\r\n' + 'The approach taken to try and fix this is:\r\n' + '\r\n' + '1. Try to use the same logic for field_id renaming across Series and Schemas\r\n' + '2. When reading data from `Parquet -> arrow2 -> Daft Series/Schema`, perform a post-processing step to remove any field metadata that was retrieved from the Parquet files.\r\n' + '\r\n' + 'However I do think that this is a fairly error-prone situation. Not sure what the best approach is though.\r\n' + '\r\n' + '## Drive-By\r\n' + '\r\n' + 'Refactors to clean-up MicroPartitions/ScanTasks and schemas:\r\n' + '\r\n' + "1. Refactored `MicroPartition::new_unloaded`: it no longer accepts a `schema` argument; instead internally it will just use the ScanTask's `.materialized_schema()`\r\n" + '4. Refactored `read_parquet_into_micropartition` to significantly reduce code deduplication\r\n' + '\r\n' + '## Remaining todos:\r\n' + '\r\n' + '- [x] Fix logic with column pruning (need to apply column pruning after applying the field ID mappings)\r\n' + '- [x] Perform correct renaming for statistics parsing from Parquet metadata\r\n' + '- [x] Perform recursive renaming for Series and for Schema' } }, number: 1937, organization: { avatar_url: 'https://avatars.githubusercontent.com/u/98941975?v=4', description: 'Eventual Computing', events_url: 'https://api.github.com/orgs/Eventual-Inc/events', hooks_url: 'https://api.github.com/orgs/Eventual-Inc/hooks', id: 98941975, issues_url: 'https://api.github.com/orgs/Eventual-Inc/issues', login: 'Eventual-Inc', members_url: 'https://api.github.com/orgs/Eventual-Inc/members{/member}', node_id: 'O_kgDOBeW8Fw', public_members_url: 'https://api.github.com/orgs/Eventual-Inc/public_members{/member}', repos_url: 'https://
update_release_draft
Node.js 16 actions are deprecated. Please update the following actions to use Node.js 20: release-drafter/release-drafter@v5. For more information see: https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20/.