-
Notifications
You must be signed in to change notification settings - Fork 175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CHORE] Populate previews only when show() or __repr__() is called #1889
Conversation
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #1889 +/- ##
=======================================
Coverage 85.55% 85.55%
=======================================
Files 55 55
Lines 6194 6216 +22
=======================================
+ Hits 5299 5318 +19
- Misses 895 898 +3
|
@@ -24,7 +24,7 @@ def test_decimal_parquet_roundtrip() -> None: | |||
df.write_parquet(dirname) | |||
df_readback = daft.read_parquet(dirname).collect() | |||
|
|||
assert str(df.to_pydict()["decimal128"]) == str(df_readback.to_pydict()["decimal128"]) | |||
assert str(df.to_pydict()["decimal128"]) == str(df_readback.to_pydict()["decimal128"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was weird, if i didn't do the indent, it would fail with:
../../daft/api_annotations.py:31: in _wrap
return timed_method(*args, **kwargs)
../../daft/analytics.py:185: in tracked_method
return method(*args, **kwargs)
../../daft/dataframe/dataframe.py:1316: in to_pydict
return result.to_pydict()
../../daft/runners/partitioning.py:214: in to_pydict
merged_partition = self._get_merged_vpartition()
../../daft/runners/pyrunner.py:52: in _get_merged_vpartition
return MicroPartition.concat([part for id, part in ids_and_partitions])
@classmethod
def concat(cls, to_merge: list[MicroPartition]) -> MicroPartition:
micropartitions = []
for t in to_merge:
if not isinstance(t, MicroPartition):
raise TypeError(f"Expected a MicroPartition for concat, got {type(t)}")
micropartitions.append(t._micropartition)
> return MicroPartition._from_pymicropartition(_PyMicroPartition.concat(micropartitions))
E ValueError: DaftError::External Internal IO Error when Opening: /var/folders/nl/4tv7_psn7fq3n8sn6b4cpvw80000gn/T/tmputa6b7lc/ce958d3e-9fdc-48c3-9c8b-1917639037cd-0.parquet:
E Details:
E No such file or directory (os error 2)
```
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh i see, .collect() creates an unloaded micropartition, so it didn't do any actual reading of the parquet data
@@ -24,48 +24,3 @@ def test_show_some(make_df, valid_data, data_source): | |||
elif variant == "arrow": | |||
assert df_display.preview.dataframe_num_rows == len(valid_data) | |||
assert df_display.num_rows == 1 | |||
|
|||
|
|||
def test_show_from_cached_collect(make_df, valid_data): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
concept of 'cached_collect' doesn't exist anymore with this PR so i removed these tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess "cached" would be if we called __repr__()
twice in a row now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! This will give us a nice speedup for cases where we were being really dumb about calculating the preview partitions :)
@@ -24,48 +24,3 @@ def test_show_some(make_df, valid_data, data_source): | |||
elif variant == "arrow": | |||
assert df_display.preview.dataframe_num_rows == len(valid_data) | |||
assert df_display.num_rows == 1 | |||
|
|||
|
|||
def test_show_from_cached_collect(make_df, valid_data): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess "cached" would be if we called __repr__()
twice in a row now?
c6c30a5
to
38fa51b
Compare
Closes #1858
Closes #1859
Changes:
.collect()
_populate_preview
method thats called duringrepr
orshow()