Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] Use cached preview from df.collect() in df.show(). #1651

Merged
merged 3 commits into from
Nov 21, 2023

Conversation

clarkzinzow
Copy link
Contributor

This PR ensures that a preview cached via a df.collect() is used in df.show(), rather than recomputing output tables from scratch.

@github-actions github-actions bot added the enhancement New feature or request label Nov 21, 2023
dataframe_num_rows=None,
)
return DataFrameDisplay(preview, self.schema(), num_rows=n)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved df.show() to be defined next to df.collect() since them being far away from each other was a bit annoying. 😛

Copy link

codecov bot commented Nov 21, 2023

Codecov Report

Merging #1651 (e2fb236) into main (ea19d28) will increase coverage by 0.12%.
Report is 3 commits behind head on main.
The diff coverage is 96.29%.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1651      +/-   ##
==========================================
+ Coverage   84.88%   85.01%   +0.12%     
==========================================
  Files          55       55              
  Lines        5287     5306      +19     
==========================================
+ Hits         4488     4511      +23     
+ Misses        799      795       -4     
Files Coverage Δ
daft/dataframe/dataframe.py 87.20% <96.29%> (+0.13%) ⬆️

... and 2 files with indirect coverage changes

preview_partition = self._preview.preview_partition
total_rows = self._preview.dataframe_num_rows
if preview_partition is None or (
n > len(preview_partition) and (total_rows is None or len(preview_partition) < total_rows)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is (total_rows is None or len(preview_partition) < total_rows) necessary?

IIRC total_rows is just used to display the message Showing <N> of <total_rows> rows, but .show() doesn't show the total rows anyways so shouldn't this be sufficient?

if preview_partition is None or or n > len(preview_partition):
    # materialize preview partition
elif n < len(preview_partition):
    # slice preview partition
else:
    assert n == len(preview_partition)
    # return existing preview partition

Copy link
Contributor Author

@clarkzinzow clarkzinzow Nov 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an obvious optimization that handles the case when the length of the full DataFrame is less than n, in which case the DataFrame preview will have length less than n but we'll still want to use the cached preview, since it already contains the entire DataFrame.

E.g. if we have a DataFrame with 5 total rows that we've already materialized with df.collect(), and we then call df.show(), we'll have a preview of the full DataFrame with 5 rows; this is technically less than the default show() limit (8), but we obviously don't want to recompute the DataFrame in this case, since we already have the full DataFrame as a preview!

Copy link
Contributor

@jaychia jaychia Nov 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see! Small request on my end to invert the if/elses, which might help make the conditions easier to reason about

# Cached path: try to see if we can return from the cached preview_partition
if preview_partition is not None:
    if len(preview_partition) == n:
        # Preview partition exists and is exactly the size we need so we return it
        return preview_partition
    elif len(preview_partition) > n :
        # Preview partition exists, but is too large so we slice it
        return preview_partition.slice(n)
    elif (
        total_rows is not None and
        n > total_rows and
        len(preview_partition) == total_rows
    ):
        # Preview partition exists and the requested n is larger than total_rows
        # If the preview partition is already all the rows , we can just return it.
        return preview_partition

# No choice but to materialize preview partition:
return materialize_preview()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up truncating n to total_rows, which made all of the logic a good bit more straightforward while keeping us at a single level of conditionals!

preview_partition = self._preview.preview_partition
total_rows = self._preview.dataframe_num_rows
if preview_partition is None or (
n > len(preview_partition) and (total_rows is None or len(preview_partition) < total_rows)
Copy link
Contributor

@jaychia jaychia Nov 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see! Small request on my end to invert the if/elses, which might help make the conditions easier to reason about

# Cached path: try to see if we can return from the cached preview_partition
if preview_partition is not None:
    if len(preview_partition) == n:
        # Preview partition exists and is exactly the size we need so we return it
        return preview_partition
    elif len(preview_partition) > n :
        # Preview partition exists, but is too large so we slice it
        return preview_partition.slice(n)
    elif (
        total_rows is not None and
        n > total_rows and
        len(preview_partition) == total_rows
    ):
        # Preview partition exists and the requested n is larger than total_rows
        # If the preview partition is already all the rows , we can just return it.
        return preview_partition

# No choice but to materialize preview partition:
return materialize_preview()

@clarkzinzow clarkzinzow merged commit c25f644 into main Nov 21, 2023
39 checks passed
@clarkzinzow clarkzinzow deleted the clark/show-cached branch November 21, 2023 22:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants