[FEAT] Use cached preview from `df.collect()` in `df.show()`. #1651

clarkzinzow · 2023-11-21T18:11:12Z

This PR ensures that a preview cached via a df.collect() is used in df.show(), rather than recomputing output tables from scratch.

clarkzinzow · 2023-11-21T18:15:33Z

daft/dataframe/dataframe.py

-            dataframe_num_rows=None,
-        )
-        return DataFrameDisplay(preview, self.schema(), num_rows=n)
-


Moved df.show() to be defined next to df.collect() since them being far away from each other was a bit annoying. 😛

codecov · 2023-11-21T18:23:38Z

Codecov Report

Merging #1651 (e2fb236) into main (ea19d28) will increase coverage by 0.12%.
Report is 3 commits behind head on main.
The diff coverage is 96.29%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1651      +/-   ##
==========================================
+ Coverage   84.88%   85.01%   +0.12%     
==========================================
  Files          55       55              
  Lines        5287     5306      +19     
==========================================
+ Hits         4488     4511      +23     
+ Misses        799      795       -4

Files	Coverage Δ
daft/dataframe/dataframe.py	`87.20% <96.29%> (+0.13%)`	⬆️

... and 2 files with indirect coverage changes

jaychia · 2023-11-21T19:16:21Z

daft/dataframe/dataframe.py

+        preview_partition = self._preview.preview_partition
+        total_rows = self._preview.dataframe_num_rows
+        if preview_partition is None or (
+            n > len(preview_partition) and (total_rows is None or len(preview_partition) < total_rows)


Why is (total_rows is None or len(preview_partition) < total_rows) necessary?

IIRC total_rows is just used to display the message Showing <N> of <total_rows> rows, but .show() doesn't show the total rows anyways so shouldn't this be sufficient?

if preview_partition is None or or n > len(preview_partition): # materialize preview partition elif n < len(preview_partition): # slice preview partition else: assert n == len(preview_partition) # return existing preview partition

This is an obvious optimization that handles the case when the length of the full DataFrame is less than n, in which case the DataFrame preview will have length less than n but we'll still want to use the cached preview, since it already contains the entire DataFrame.

E.g. if we have a DataFrame with 5 total rows that we've already materialized with df.collect(), and we then call df.show(), we'll have a preview of the full DataFrame with 5 rows; this is technically less than the default show() limit (8), but we obviously don't want to recompute the DataFrame in this case, since we already have the full DataFrame as a preview!

I see! Small request on my end to invert the if/elses, which might help make the conditions easier to reason about

# Cached path: try to see if we can return from the cached preview_partition if preview_partition is not None: if len(preview_partition) == n: # Preview partition exists and is exactly the size we need so we return it return preview_partition elif len(preview_partition) > n : # Preview partition exists, but is too large so we slice it return preview_partition.slice(n) elif ( total_rows is not None and n > total_rows and len(preview_partition) == total_rows ): # Preview partition exists and the requested n is larger than total_rows # If the preview partition is already all the rows , we can just return it. return preview_partition # No choice but to materialize preview partition: return materialize_preview()

I ended up truncating n to total_rows, which made all of the logic a good bit more straightforward while keeping us at a single level of conditionals!

jaychia · 2023-11-21T20:33:06Z

daft/dataframe/dataframe.py

+        preview_partition = self._preview.preview_partition
+        total_rows = self._preview.dataframe_num_rows
+        if preview_partition is None or (
+            n > len(preview_partition) and (total_rows is None or len(preview_partition) < total_rows)


I see! Small request on my end to invert the if/elses, which might help make the conditions easier to reason about

# Cached path: try to see if we can return from the cached preview_partition if preview_partition is not None: if len(preview_partition) == n: # Preview partition exists and is exactly the size we need so we return it return preview_partition elif len(preview_partition) > n : # Preview partition exists, but is too large so we slice it return preview_partition.slice(n) elif ( total_rows is not None and n > total_rows and len(preview_partition) == total_rows ): # Preview partition exists and the requested n is larger than total_rows # If the preview partition is already all the rows , we can just return it. return preview_partition # No choice but to materialize preview partition: return materialize_preview()

…ngth is less than the limit.

Used cached preview from df.collect() in df.show().

4285fa3

clarkzinzow requested review from samster25 and jaychia November 21, 2023 18:11

github-actions bot added the enhancement New feature or request label Nov 21, 2023

Add unused cached preview test.

4fbd341

clarkzinzow commented Nov 21, 2023

View reviewed changes

jaychia reviewed Nov 21, 2023

View reviewed changes

jaychia approved these changes Nov 21, 2023

View reviewed changes

Logic clean up, and make num_rows more accurate when the DataFrame le…

e2fb236

…ngth is less than the limit.

clarkzinzow merged commit c25f644 into main Nov 21, 2023
39 checks passed

clarkzinzow deleted the clark/show-cached branch November 21, 2023 22:31

jaychia mentioned this pull request Nov 22, 2023

[FEAT] Print the results of a df.show() to stdout if running in non-interactive mode #1655

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEAT] Use cached preview from `df.collect()` in `df.show()`. #1651

[FEAT] Use cached preview from `df.collect()` in `df.show()`. #1651

clarkzinzow commented Nov 21, 2023

clarkzinzow Nov 21, 2023

codecov bot commented Nov 21, 2023 •

edited

Loading

jaychia Nov 21, 2023

clarkzinzow Nov 21, 2023 •

edited

Loading

jaychia Nov 21, 2023 •

edited

Loading

clarkzinzow Nov 21, 2023

jaychia Nov 21, 2023 •

edited

Loading

[FEAT] Use cached preview from df.collect() in df.show(). #1651

[FEAT] Use cached preview from df.collect() in df.show(). #1651

Conversation

clarkzinzow commented Nov 21, 2023

clarkzinzow Nov 21, 2023

Choose a reason for hiding this comment

codecov bot commented Nov 21, 2023 • edited Loading

Codecov Report

jaychia Nov 21, 2023

Choose a reason for hiding this comment

clarkzinzow Nov 21, 2023 • edited Loading

Choose a reason for hiding this comment

jaychia Nov 21, 2023 • edited Loading

Choose a reason for hiding this comment

clarkzinzow Nov 21, 2023

Choose a reason for hiding this comment

jaychia Nov 21, 2023 • edited Loading

Choose a reason for hiding this comment

[FEAT] Use cached preview from `df.collect()` in `df.show()`. #1651

[FEAT] Use cached preview from `df.collect()` in `df.show()`. #1651

codecov bot commented Nov 21, 2023 •

edited

Loading

clarkzinzow Nov 21, 2023 •

edited

Loading

jaychia Nov 21, 2023 •

edited

Loading

jaychia Nov 21, 2023 •

edited

Loading