Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] Add smart planning of ScanTasks starting with merging by filesizes #1692

Merged
merged 13 commits into from
Dec 5, 2023

Conversation

jaychia
Copy link
Contributor

@jaychia jaychia commented Dec 4, 2023

Refactors/changes required on ScanTask itself:

  1. Added a ScanTask::merge
  2. Added a ScanTask::partition_spec()
  3. Added some validation in ScanTask::new to assert that all the underlying sources have the same partition spec

I then added a new module daft_scan::scan_task_iterators which contains functions that perform transformations on a Box<dyn Iterator<item = DaftResult<ScanTaskRef>>>.

TODO:

@github-actions github-actions bot added the enhancement New feature or request label Dec 4, 2023
Copy link

codecov bot commented Dec 4, 2023

Codecov Report

Merging #1692 (3834fa2) into main (2fbf885) will increase coverage by 0.16%.
Report is 6 commits behind head on main.
The diff coverage is n/a.

❗ Current head 3834fa2 differs from pull request most recent head 5285b12. Consider uploading reports for the commit 5285b12 to get more accurate results

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1692      +/-   ##
==========================================
+ Coverage   84.89%   85.05%   +0.16%     
==========================================
  Files          55       55              
  Lines        5349     5354       +5     
==========================================
+ Hits         4541     4554      +13     
+ Misses        808      800       -8     

see 7 files with indirect coverage changes

@jaychia jaychia force-pushed the jay/scan-task-iter branch from ffb59a4 to 8b67cf3 Compare December 4, 2023 18:02
src/daft-scan/src/lib.rs Outdated Show resolved Hide resolved
src/daft-scan/src/lib.rs Outdated Show resolved Hide resolved
src/daft-scan/src/lib.rs Outdated Show resolved Hide resolved
src/daft-scan/src/scan_task_iters.rs Outdated Show resolved Hide resolved
Jay Chia and others added 2 commits December 5, 2023 12:48
…s globally (#1700)

1. Adds a `daft.context.set_config(**kwargs)` method for users to set
configuration parameters in Daft
2. Adds a new `common/daft-config` crate to house the `DaftConfig`
struct
3. Thread `PyDaftConfig {config: Arc<DaftConfig>}` through the pyo3
layer for any functions that require access to configuration values: use
`daft.context.get_context().daft_config` to retrieve the current
`PyDaftConfig` from Python


`.set_config()` in action:
<img width="1114" alt="image"
src="https://github.com/Eventual-Inc/Daft/assets/17691182/95b52ab9-a878-4ec3-abf6-f8950fd8009f">

> NOTE: Because we leverage our current `daft.context` Python global
singleton machinery, note that the configs are probably not being
propagated correctly in a multiprocessing/Ray worker environment to
separate Python processes.
>
> For this to work, we will need to correctly initialize any Ray workers
with the current DaftContext, or abandon the DaftContext machinery for
something more suitable.

---------

Co-authored-by: Jay Chia <[email protected]@users.noreply.github.com>
@jaychia jaychia enabled auto-merge (squash) December 5, 2023 20:50
@jaychia jaychia merged commit a53cd51 into main Dec 5, 2023
38 checks passed
@jaychia jaychia deleted the jay/scan-task-iter branch December 5, 2023 22:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants