Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use LogicalType for TypeSignature Numeric and String, Coercible #13240

Merged
merged 14 commits into from
Nov 6, 2024
6 changes: 6 additions & 0 deletions datafusion/common/src/types/logical.rs
Original file line number Diff line number Diff line change
Expand Up @@ -98,6 +98,12 @@ impl fmt::Debug for dyn LogicalType {
}
}

impl std::fmt::Display for dyn LogicalType {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
write!(f, "{self:?}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can make a distinction between Debug and Display. Currently, we print something like

        let int: Box<dyn LogicalType> = Box::new(Int32);
        println!(format!("{}", int));

------- output ------
LogicalType(Native(Int32), Int32))

I imagine we can print Int32 toInt32, print JSON to JSON ... 🤔
However, I think it may not be the point of this PR. We can do it in another PR.

}
}

impl PartialEq for dyn LogicalType {
fn eq(&self, other: &Self) -> bool {
self.signature().eq(&other.signature())
Expand Down
37 changes: 34 additions & 3 deletions datafusion/common/src/types/native.rs
Original file line number Diff line number Diff line change
Expand Up @@ -348,6 +348,12 @@ impl LogicalType for NativeType {
// mapping solutions to provide backwards compatibility while transitioning from
// the purely physical system to a logical / physical system.

impl From<&DataType> for NativeType {
fn from(value: &DataType) -> Self {
value.clone().into()
}
}

impl From<DataType> for NativeType {
fn from(value: DataType) -> Self {
use NativeType::*;
Expand Down Expand Up @@ -392,8 +398,33 @@ impl From<DataType> for NativeType {
}
}

impl From<&DataType> for NativeType {
fn from(value: &DataType) -> Self {
value.clone().into()
impl NativeType {
#[inline]
pub fn is_numeric(&self) -> bool {
use NativeType::*;
matches!(
self,
UInt8
| UInt16
| UInt32
| UInt64
| Int8
| Int16
| Int32
| Int64
| Float16
| Float32
| Float64
| Decimal(_, _)
)
}

#[inline]
pub fn is_integer(&self) -> bool {
use NativeType::*;
matches!(
self,
UInt8 | UInt16 | UInt32 | UInt64 | Int8 | Int16 | Int32 | Int64
)
}
}
33 changes: 29 additions & 4 deletions datafusion/expr-common/src/signature.rs
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@
//! and return types of functions in DataFusion.

use arrow::datatypes::DataType;
use datafusion_common::types::LogicalTypeRef;

/// Constant that is used as a placeholder for any valid timezone.
/// This is used where a function can accept a timestamp type with any
Expand Down Expand Up @@ -109,7 +110,7 @@ pub enum TypeSignature {
/// For example, `Coercible(vec![DataType::Float64])` accepts
/// arguments like `vec![DataType::Int32]` or `vec![DataType::Float32]`
/// since i32 and f32 can be casted to f64
goldmedal marked this conversation as resolved.
Show resolved Hide resolved
Coercible(Vec<DataType>),
Coercible(Vec<LogicalTypeRef>),
/// Fixed number of arguments of arbitrary types
/// If a function takes 0 argument, its `TypeSignature` should be `Any(0)`
Any(usize),
Expand All @@ -123,8 +124,19 @@ pub enum TypeSignature {
/// Specifies Signatures for array functions
ArraySignature(ArrayFunctionSignature),
/// Fixed number of arguments of numeric types.
/// See <https://docs.rs/arrow/latest/arrow/datatypes/enum.DataType.html#method.is_numeric> to know which type is considered numeric
/// See [`NativeType::is_numeric`] to know which type is considered numeric
///
/// [`NativeType::is_numeric`]: datafusion_common
Numeric(usize),
/// Fixed number of arguments of numeric types.
/// See [`NativeType::is_numeric`] to know which type is considered numeric
/// This signature accepts numeric string
/// Example of functions In Postgres that support numeric string
/// 1. Mathematical Functions, like `abs`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i confirm this works in PostgreSQL (was testing with PostgreSQL v 17)

select abs('-123');

https://www.postgresql.org/docs/17/typeconv-oper.html#id-1.5.9.7.8 suggests why this works
'...' is not a varchar literal in PostgreSQL.
it's "unknown-type" literal, which gets interpreted as float8

Here the system has implicitly resolved the unknown-type literal as type float8 before applying the chosen operator.

indeed, select pg_typeof(abs('-123')); returns double precision

However, this doesn't work in PostgreSQL

select abs(CAST('-123' AS varchar));

This fails with "Query Error: function abs(character varying) does not exist"
Indicating that there is no special coercion rules from varchar to numbers when calling functions like abs().


BTW, coercion rules can be retrieved from PostgreSQL from https://www.postgresql.org/docs/17/catalog-pg-cast.html

PostgreSQL doesn't declare any implicit coercions between (selected) numeric types and (selected) varchar type, and abs() function resolution behavior as described above matches that.

with selected_types(name) AS (VALUES ('int4'), ('int8'), ('float8'), ('varchar'))
select 
	src.typname src_type,
    dst.typname dst_type,
    castcontext,
    case castcontext
    	when 'e' then 'only explicit'
        when 'a' then 'explicit | implicitly in assignment'
        when 'i' then 'implicitly in expressions, as well as the other cases'
        else '???' -- undocumented @ https://www.postgresql.org/docs/17/catalog-pg-cast.html
    end castcontext_explained
from pg_cast
join pg_type src on pg_cast.castsource = src.oid
join pg_type dst on pg_cast.casttarget = dst.oid
where true
and src.oid != dst.oid
and src.typname in (select name from selected_types)
and dst.typname in (select name from selected_types)
order by src.typname, dst.typname
;
src_type dst_type castcontext castcontext_explained
float8 int4 a explicit | implicitly in assignment
float8 int8 a explicit | implicitly in assignment
int4 float8 i implicitly in expressions, as well as the other cases
int4 int8 i implicitly in expressions, as well as the other cases
int8 float8 i implicitly in expressions, as well as the other cases
int8 int4 a explicit | implicitly in assignment

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @findepi.
It makes sense to me. NumericAndNumericString allows abs('-1') but also allows abs('-1'::varchar), which isn't allowed by Postgres.

A similar behavior in DuckDB is with to_timestamp.

D select to_timestamp('-1');
┌──────────────────────────┐
│    to_timestamp('-1')    │
│ timestamp with time zone │
├──────────────────────────┤
│ 1970-01-01 07:59:59+08   │
└──────────────────────────┘
D select to_timestamp('-1'::varchar);
Binder Error: No function matches the given name and argument types 'to_timestamp(VARCHAR)'. You might need to add explicit type casts.
	Candidate functions:
	to_timestamp(DOUBLE) -> TIMESTAMP WITH TIME ZONE

LINE 1: select to_timestamp('-1'::varchar);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to follow PostgreSQL behavior, we would need to model '...' literals as unknown type literals. This would be parser / frontend work. It should have no impact on projects that do not use the DF parser (#12723).
Not sure it's worthwhile to do. For now, let's just focus on coercions and make them reasonably correct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense. We need to model unknown type literal and avoid numeric string to string coercion

/// 2. `to_timestamp`
///
/// [`NativeType::is_numeric`]: datafusion_common
NumericAndNumericString(usize),
/// Fixed number of arguments of all the same string types.
/// The precedence of type from high to low is Utf8View, LargeUtf8 and Utf8.
/// Null is considerd as `Utf8` by default
Expand Down Expand Up @@ -201,7 +213,13 @@ impl TypeSignature {
TypeSignature::Numeric(num) => {
vec![format!("Numeric({num})")]
}
TypeSignature::Exact(types) | TypeSignature::Coercible(types) => {
TypeSignature::NumericAndNumericString(num) => {
vec![format!("NumericAndNumericString({num})")]
}
TypeSignature::Coercible(types) => {
vec![Self::join_types(types, ", ")]
}
TypeSignature::Exact(types) => {
vec![Self::join_types(types, ", ")]
}
TypeSignature::Any(arg_count) => {
Expand Down Expand Up @@ -288,6 +306,13 @@ impl Signature {
}
}

pub fn numeric_and_numeric_string(arg_count: usize, volatility: Volatility) -> Self {
Self {
type_signature: TypeSignature::NumericAndNumericString(arg_count),
volatility,
}
}

/// A specified number of numeric arguments
pub fn string(arg_count: usize, volatility: Volatility) -> Self {
Self {
Expand Down Expand Up @@ -322,7 +347,7 @@ impl Signature {
}
}
/// Target coerce types in order
pub fn coercible(target_types: Vec<DataType>, volatility: Volatility) -> Self {
pub fn coercible(target_types: Vec<LogicalTypeRef>, volatility: Volatility) -> Self {
Self {
type_signature: TypeSignature::Coercible(target_types),
volatility,
Expand Down
Loading