-
Notifications
You must be signed in to change notification settings - Fork 175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEAT] Use cached preview from df.collect()
in df.show()
.
#1651
Conversation
dataframe_num_rows=None, | ||
) | ||
return DataFrameDisplay(preview, self.schema(), num_rows=n) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved df.show()
to be defined next to df.collect()
since them being far away from each other was a bit annoying. 😛
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## main #1651 +/- ##
==========================================
+ Coverage 84.88% 85.01% +0.12%
==========================================
Files 55 55
Lines 5287 5306 +19
==========================================
+ Hits 4488 4511 +23
+ Misses 799 795 -4
|
daft/dataframe/dataframe.py
Outdated
preview_partition = self._preview.preview_partition | ||
total_rows = self._preview.dataframe_num_rows | ||
if preview_partition is None or ( | ||
n > len(preview_partition) and (total_rows is None or len(preview_partition) < total_rows) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is (total_rows is None or len(preview_partition) < total_rows)
necessary?
IIRC total_rows is just used to display the message Showing <N> of <total_rows> rows
, but .show()
doesn't show the total rows anyways so shouldn't this be sufficient?
if preview_partition is None or or n > len(preview_partition):
# materialize preview partition
elif n < len(preview_partition):
# slice preview partition
else:
assert n == len(preview_partition)
# return existing preview partition
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an obvious optimization that handles the case when the length of the full DataFrame is less than n
, in which case the DataFrame preview will have length less than n
but we'll still want to use the cached preview, since it already contains the entire DataFrame
.
E.g. if we have a DataFrame with 5 total rows that we've already materialized with df.collect()
, and we then call df.show()
, we'll have a preview of the full DataFrame with 5 rows; this is technically less than the default show()
limit (8), but we obviously don't want to recompute the DataFrame in this case, since we already have the full DataFrame as a preview!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see! Small request on my end to invert the if/elses, which might help make the conditions easier to reason about
# Cached path: try to see if we can return from the cached preview_partition
if preview_partition is not None:
if len(preview_partition) == n:
# Preview partition exists and is exactly the size we need so we return it
return preview_partition
elif len(preview_partition) > n :
# Preview partition exists, but is too large so we slice it
return preview_partition.slice(n)
elif (
total_rows is not None and
n > total_rows and
len(preview_partition) == total_rows
):
# Preview partition exists and the requested n is larger than total_rows
# If the preview partition is already all the rows , we can just return it.
return preview_partition
# No choice but to materialize preview partition:
return materialize_preview()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ended up truncating n
to total_rows
, which made all of the logic a good bit more straightforward while keeping us at a single level of conditionals!
daft/dataframe/dataframe.py
Outdated
preview_partition = self._preview.preview_partition | ||
total_rows = self._preview.dataframe_num_rows | ||
if preview_partition is None or ( | ||
n > len(preview_partition) and (total_rows is None or len(preview_partition) < total_rows) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see! Small request on my end to invert the if/elses, which might help make the conditions easier to reason about
# Cached path: try to see if we can return from the cached preview_partition
if preview_partition is not None:
if len(preview_partition) == n:
# Preview partition exists and is exactly the size we need so we return it
return preview_partition
elif len(preview_partition) > n :
# Preview partition exists, but is too large so we slice it
return preview_partition.slice(n)
elif (
total_rows is not None and
n > total_rows and
len(preview_partition) == total_rows
):
# Preview partition exists and the requested n is larger than total_rows
# If the preview partition is already all the rows , we can just return it.
return preview_partition
# No choice but to materialize preview partition:
return materialize_preview()
…ngth is less than the limit.
This PR ensures that a preview cached via a
df.collect()
is used indf.show()
, rather than recomputing output tables from scratch.