[FEAT] Add sample function for Dataframe #1770

colin-ho · 2024-01-09T17:31:06Z

Added new sample function based on https://spark.apache.org/docs/3.1.2/api/python/reference/api/pyspark.sql.DataFrame.sample.html

Changes:

Modified existing sampling logic in table and micropartition to instead accept fraction, with_replacement, and seed.
Added logical + physical ops for sample
Added end to end tests

codecov · 2024-01-09T17:40:28Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (8f33256) 84.69% compared to head (c48958e) 84.84%.
Report is 1 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1770      +/-   ##
==========================================
+ Coverage   84.69%   84.84%   +0.15%     
==========================================
  Files          55       55              
  Lines        5554     5584      +30     
==========================================
+ Hits         4704     4738      +34     
+ Misses        850      846       -4

Files	Coverage Δ
daft/dataframe/dataframe.py	`87.42% <100.00%> (+0.14%)`	⬆️
daft/execution/execution_step.py	`93.19% <100.00%> (+0.08%)`	⬆️
daft/execution/physical_plan.py	`93.43% <100.00%> (+0.03%)`	⬆️
daft/execution/rust_physical_plan_shim.py	`93.25% <100.00%> (+1.30%)`	⬆️
daft/logical/builder.py	`89.56% <100.00%> (+0.27%)`	⬆️
daft/table/micropartition.py	`90.04% <100.00%> (+0.30%)`	⬆️
daft/table/table.py	`84.53% <100.00%> (+0.31%)`	⬆️

... and 1 file with indirect coverage changes

colin-ho · 2024-01-09T17:46:04Z

daft/execution/physical_plan.py

@@ -777,7 +777,7 @@ def sort(
                partial_metadatas=None,
            )
            .add_instruction(
-                instruction=execution_step.Sample(sort_by=sort_by),
+                instruction=execution_step.Sample(fraction=0.05, sort_by=sort_by),


I chose 0.05 as the fraction for sampling during sort (used to be a default value of 20), lmk if this is ok.

For the sampling code for sort, it would be better to be able to set a number like 20 rather than take a fraction. For example, if our partition is only 10 elements, we rather take all the elements than just sample 1.

Enabled sample by fraction and sample by size for table and micropartition, so sort will use sample by size.

colin-ho · 2024-01-09T17:47:30Z

src/daft-plan/src/logical_ops/sample.rs

+pub struct Sample {
+    // Upstream node.
+    pub input: Arc<LogicalPlan>,
+    pub fraction: String,


because of the trait bounds #[derive(Clone, Debug, PartialEq, Eq, Hash)] i couldn't store fraction as a f64, so i stored it as a string instead, then later on in physical plan I parse it as a f64. Is there a better way to do this?

You should be able to impl Eq for Sample and set the behavior for comparing the f64.

got it, did the impl for Eq and Hash

samster25

Very clean! Nice work! Just a few comments :)

samster25 · 2024-01-10T00:08:42Z

daft/execution/physical_plan.py

@@ -777,7 +777,7 @@ def sort(
                partial_metadatas=None,
            )
            .add_instruction(
-                instruction=execution_step.Sample(sort_by=sort_by),
+                instruction=execution_step.Sample(size=20, sort_by=sort_by),


We now have some centralized configs, might be a good place to put this constant.

added the constant: sample_size_for_sort in DaftExecutionConfig here https://github.com/Eventual-Inc/Daft/pull/1770/files#diff-9feb783dcfde34fbbab50546b081d87cfb98ec050b89cfbe6ff9fe64e1404029R37 . I didn't add this as a parameter in `set_execution_config' though, I assume that this isn't something that the user would need to change. but lmk if I'm wrong!

samster25 · 2024-01-10T00:15:07Z

daft/dataframe/dataframe.py

+        Returns:
+            DataFrame: DataFrame with a fraction of rows.
+        """
+        builder = self._builder.sample(fraction, with_replacement, seed)


We should ensure that 0.0 <= fraction <= 1.0 somewhere on the dataframe side as well. Right now I only see that check on the execution side.

added this check in the public method for the Dataframe API as well

samster25 · 2024-01-10T00:16:35Z

src/daft-plan/src/physical_plan.rs

@@ -186,7 +188,8 @@ impl PhysicalPlan {
            // TODO(Clark): Estimate row/column pruning to get a better size approximation.
            Self::Filter(Filter { input, .. })
            | Self::Limit(Limit { input, .. })
-            | Self::Project(Project { input, .. }) => input.approximate_size_bytes(),
+            | Self::Project(Project { input, .. })
+            | Self::Sample(Sample { input, .. }) => input.approximate_size_bytes(),


for sample you can probably do input.approximate_size_bytes() * fraction

samster25 · 2024-01-10T00:18:04Z

tests/table/test_sorting.py

+
+    # no arguments
+    with pytest.raises(ValueError, match="Must specify either `fraction` or `size`"):
+        daft_table.sample()



check for when fraction < 0.0 and when > 1.0

samster25 · 2024-01-10T00:18:47Z

tests/dataframe/test_sample.py

+def test_sample_over_sample(make_df, valid_data: list[dict[str, float]]) -> None:
+    df = make_df(valid_data)
+
+    df = df.sample(fraction=2.0)


we should throw an error if the fraction is over 1.0

samster25

Looks great! Nice work :)

github-actions bot added the enhancement New feature or request label Jan 9, 2024

colin-ho commented Jan 9, 2024

View reviewed changes

colin-ho requested review from samster25, jaychia and clarkzinzow January 9, 2024 18:32

colin-ho added 4 commits January 9, 2024 14:51

add sample function

c7e7245

error handling for fraction parsing

3a296d0

use thread_rng

67ce5fc

add eq and hash impl, add sample by fraction + size

228422c

colin-ho force-pushed the colin/sample branch from ccee7ad to 228422c Compare January 9, 2024 22:52

add tests for arguments

c945441

samster25 reviewed Jan 10, 2024

View reviewed changes

test above 1.0 and add sample size to config

c48958e

colin-ho requested a review from samster25 January 10, 2024 22:50

samster25 approved these changes Jan 11, 2024

View reviewed changes

colin-ho merged commit c4bd1b3 into main Jan 11, 2024
42 checks passed

colin-ho deleted the colin/sample branch January 11, 2024 00:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEAT] Add sample function for Dataframe #1770

[FEAT] Add sample function for Dataframe #1770

colin-ho commented Jan 9, 2024 •

edited

Loading

codecov bot commented Jan 9, 2024 •

edited

Loading

colin-ho Jan 9, 2024

samster25 Jan 9, 2024

colin-ho Jan 9, 2024

colin-ho Jan 9, 2024

samster25 Jan 9, 2024

colin-ho Jan 9, 2024

samster25 left a comment

samster25 Jan 10, 2024

colin-ho Jan 10, 2024

samster25 Jan 10, 2024

colin-ho Jan 10, 2024

samster25 Jan 10, 2024

samster25 Jan 10, 2024

samster25 Jan 10, 2024

samster25 left a comment

[FEAT] Add sample function for Dataframe #1770

[FEAT] Add sample function for Dataframe #1770

Conversation

colin-ho commented Jan 9, 2024 • edited Loading

codecov bot commented Jan 9, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

samster25 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

samster25 left a comment

Choose a reason for hiding this comment

colin-ho commented Jan 9, 2024 •

edited

Loading

codecov bot commented Jan 9, 2024 •

edited

Loading