-
Notifications
You must be signed in to change notification settings - Fork 175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PERF] Micropartition, lazy loading and Column Stats #1470
Conversation
5ca95e1
to
87dd2bd
Compare
impl DaftCompare<&ColumnRangeStatistics> for ColumnRangeStatistics { | ||
type Output = crate::Result<ColumnRangeStatistics>; | ||
fn equal(&self, rhs: &ColumnRangeStatistics) -> Self::Output { | ||
// lower_bound: do they exactly overlap |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this correct? Even in the case when ranges exactly overlap, I think the result is stilI a maybe:
Lhs = [0, 1, 2, 3] -> (min=0, max=3)
Rhs = [2, 1, 0, 3] -> (min=0, max=3)
Result = [0, 1, 0, 1] -> (min=0, max=1)
Shouldn't the correct logic be:
(False, False) // no overlap at all
(False, True) // some overlap
impl ColumnRangeStatistics { | ||
pub fn new(lower: Option<Series>, upper: Option<Series>) -> Result<Self> { | ||
match (lower, upper) { | ||
//TODO: also need to check dtype and length==1, and upper > lower. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we just throw an assert in here? Would be pretty nasty if this assumption isn't met
let lower = | ||
Utf8Array::from(("lower", [lower.as_str()].as_slice())).into_series(); | ||
let upper = | ||
Utf8Array::from(("upper", [upper.as_str()].as_slice())).into_series(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh just realized too that if we ever do any arithmetic on the series, they might get really weird with the names if we don't slot them in the correct spots (e.g. lower + upper
will be called lower
)
Doesn't really matter though?
.as_ref() | ||
.map(|v| v.iter().map(|s| s.as_ref()).collect::<Vec<_>>()); | ||
let urls = params.urls.iter().map(|s| s.as_str()).collect::<Vec<_>>(); | ||
daft_parquet::read::read_parquet_bulk( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Big money move right here 😍
if predicate.is_empty() { | ||
return Ok(Self::new( | ||
self.schema.clone(), | ||
TableState::Loaded(vec![].into()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is empty predicate a full filter instead of a no-op?
E.g. if we perform some predicate pushdown and somehow end up with a filter([])
, shouldn't that be a no-op?
} | ||
} | ||
|
||
fn read_parquet_into_micropartition( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this only used in tests? Maybe we can move into the test
block below
@@ -0,0 +1,45 @@ | |||
use parquet2::{schema::types::TimeUnit, types::int96_to_i64_ns}; | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add references to where we got these functions
Gt => lhs.gt(&rhs), | ||
Plus => &lhs + &rhs, | ||
Minus => &lhs - &rhs, | ||
_ => todo!(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we return a Missing
here instead of todo? Ditto for the todo at the bottom.
No description provided.