-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEAT] Add sample function for Dataframe #1770
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1770 +/- ##
==========================================
+ Coverage 84.69% 84.84% +0.15%
==========================================
Files 55 55
Lines 5554 5584 +30
==========================================
+ Hits 4704 4738 +34
+ Misses 850 846 -4
|
daft/execution/physical_plan.py
Outdated
@@ -777,7 +777,7 @@ def sort( | |||
partial_metadatas=None, | |||
) | |||
.add_instruction( | |||
instruction=execution_step.Sample(sort_by=sort_by), | |||
instruction=execution_step.Sample(fraction=0.05, sort_by=sort_by), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I chose 0.05 as the fraction for sampling during sort (used to be a default value of 20), lmk if this is ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the sampling code for sort, it would be better to be able to set a number like 20 rather than take a fraction. For example, if our partition is only 10 elements, we rather take all the elements than just sample 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Enabled sample by fraction and sample by size for table and micropartition, so sort will use sample by size.
pub struct Sample { | ||
// Upstream node. | ||
pub input: Arc<LogicalPlan>, | ||
pub fraction: String, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because of the trait bounds #[derive(Clone, Debug, PartialEq, Eq, Hash)]
i couldn't store fraction as a f64, so i stored it as a string instead, then later on in physical plan I parse it as a f64. Is there a better way to do this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should be able to impl Eq for Sample
and set the behavior for comparing the f64.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it, did the impl for Eq and Hash
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very clean! Nice work! Just a few comments :)
daft/execution/physical_plan.py
Outdated
@@ -777,7 +777,7 @@ def sort( | |||
partial_metadatas=None, | |||
) | |||
.add_instruction( | |||
instruction=execution_step.Sample(sort_by=sort_by), | |||
instruction=execution_step.Sample(size=20, sort_by=sort_by), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We now have some centralized configs, might be a good place to put this constant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added the constant: sample_size_for_sort
in DaftExecutionConfig here https://github.com/Eventual-Inc/Daft/pull/1770/files#diff-9feb783dcfde34fbbab50546b081d87cfb98ec050b89cfbe6ff9fe64e1404029R37 . I didn't add this as a parameter in `set_execution_config' though, I assume that this isn't something that the user would need to change. but lmk if I'm wrong!
Returns: | ||
DataFrame: DataFrame with a fraction of rows. | ||
""" | ||
builder = self._builder.sample(fraction, with_replacement, seed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should ensure that 0.0 <= fraction <= 1.0
somewhere on the dataframe side as well. Right now I only see that check on the execution side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added this check in the public method for the Dataframe API as well
src/daft-plan/src/physical_plan.rs
Outdated
@@ -186,7 +188,8 @@ impl PhysicalPlan { | |||
// TODO(Clark): Estimate row/column pruning to get a better size approximation. | |||
Self::Filter(Filter { input, .. }) | |||
| Self::Limit(Limit { input, .. }) | |||
| Self::Project(Project { input, .. }) => input.approximate_size_bytes(), | |||
| Self::Project(Project { input, .. }) | |||
| Self::Sample(Sample { input, .. }) => input.approximate_size_bytes(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for sample you can probably do input.approximate_size_bytes() * fraction
|
||
# no arguments | ||
with pytest.raises(ValueError, match="Must specify either `fraction` or `size`"): | ||
daft_table.sample() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check for when fraction < 0.0 and when > 1.0
tests/dataframe/test_sample.py
Outdated
def test_sample_over_sample(make_df, valid_data: list[dict[str, float]]) -> None: | ||
df = make_df(valid_data) | ||
|
||
df = df.sample(fraction=2.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should throw an error if the fraction is over 1.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Nice work :)
Closes #1759
Added new sample function based on https://spark.apache.org/docs/3.1.2/api/python/reference/api/pyspark.sql.DataFrame.sample.html
Changes: