-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CHORE] Improve error messages when calling aggregation methods on dataframe without input columns #1587
[CHORE] Improve error messages when calling aggregation methods on dataframe without input columns #1587
Conversation
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## main #1587 +/- ##
==========================================
- Coverage 84.90% 82.89% -2.01%
==========================================
Files 54 54
Lines 5162 5222 +60
==========================================
- Hits 4383 4329 -54
- Misses 779 893 +114
|
logger.warning( | ||
"No columns specified; performing count on all columns. Specify columns using df.count('col1', 'col2', ...) or use df.count_rows() for row counts." | ||
) | ||
cols = tuple(self.columns) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @samster25 ,
Could you take a look at this and lmk if this aligns with what you expected? Once it's all good with you I'll modify the rest of the agg methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me! Though I wonder if we should corner-case the .count()
aggregation method to just forward the call to .count_rows()
instead 😛 . I believe that in PostgreSQL for example, COUNT(*)
is actually a row count of the entire result set vs COUNT(c)
which does a null-aware count of a column.
These semantics sound good for all the other aggregation-type methods though!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, though .count_rows()
returns an int whereas .count()
should return a dataframe, so instead of simply forwarding the call we could do return DataFrame(self._builder.count())
? which would result in a dataframe that looks like:
+--------+
| count |
| UInt64 |
+--------+
| 3 |
+--------+
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yes good point. The current semantics of this PR (broadcasting the count operation on all columns) makes perfect sense then 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sweet, thanks! just made changes for the rest of the aggregation methods in latest commit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, some of the integration tests are failing with Error: Credentials could not be loaded, please check your action inputs: Could not load credentials from any providers
, example. Do you know what's the cause 😅 ?
the release drafter label test too: [Error: Resource not accessible by integration](https://github.com/Eventual-Inc/Daft/actions/runs/6829508921/job/18575800562?pr=1587#step:2:25)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yes that's security related because these CI tests are actually running on your fork, not on the main repo 😛
We need to figure out a (secure) way to run our integration tests on incoming PRs, or maybe a better policy here around accepting external contributions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @colin-ho,
Thanks for making this PR and taking on the issue. One issue I see is that we should use the python warning module for propagating these user warnings rather than the python logger which will be noisy.
warnings.warn("...")
Also don't worry about the integration tests
for now, currently we rely on an identity provider that requires you to be in the eventual github org to run those correctly.
Got it, replaced the logger with |
Approved! Looks like it's failing style lint checks though: You can automatically apply lints locally by installing |
@jaychia good to merge? |
Merged, congrats on first PR @colin-ho :) |
Fixes #1583.
When a user does not specify columns in df aggregation methods, e.g.
df.count()
: