-
Notifications
You must be signed in to change notification settings - Fork 173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(connect): printSchema
#3617
base: main
Are you sure you want to change the base?
Conversation
CodSpeed Performance ReportMerging #3617 will not alter performanceComparing Summary
|
58c8a1e
to
81f2540
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3617 +/- ##
==========================================
+ Coverage 77.84% 77.88% +0.04%
==========================================
Files 718 720 +2
Lines 88250 88600 +350
==========================================
+ Hits 68696 69008 +312
- Misses 19554 19592 +38
|
src/daft-connect/src/display.rs
Outdated
DataType::FixedShapeImage(_, _, _) => "fixed_shape_image".to_string(), | ||
DataType::Tensor(_) => "tensor".to_string(), | ||
DataType::FixedShapeTensor(_, _) => "fixed_shape_tensor".to_string(), | ||
DataType::SparseTensor(_) => "sparse_tensor".to_string(), | ||
DataType::FixedShapeSparseTensor(_, _) => "fixed_shape_sparse_tensor".to_string(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't think these exist in spark (along with unsized ints). We should check if spark connect has a standard around extension or user defined types. If they don't I'd at least want something in the display to indicate that these are not native spark types, but in fact daft datatypes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmmmm https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/types/UserDefinedType.html
can people even define UDTs outside of Java? I don't see a pyspark example
src/daft-connect/src/display.rs
Outdated
DataType::Binary => "binary".to_string(), | ||
DataType::FixedSizeBinary(_) => "fixed_size_binary".to_string(), | ||
DataType::Utf8 => "string".to_string(), | ||
DataType::FixedSizeList(_, _) => "array".to_string(), // Spark calls them arrays |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i would represent this as a custom type similar to the other non native dtypes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if there is no standard thoughts on something like daft[fixed_size_list]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since there seems to be no standard, I'd prefer to separate them into 2 categories.
arrow native datatypes:
for arrow native datatypes such as unsigned integers,fsl, etc, lets go with arrow.<datatype>
such as
- u64 ->
arrow.uint64
, - fsl(u8, 1) ->
arrow.fixed_size_list(1)\n --- element: arrow.u8
(this resembles how arrow does extension types)
custom daft datatypes:
for non arrow native ones such as SparseTensor, Image, and so on, let's prefix them with daft.
- image ->
daft.image(<image_mode>)
ex:daft.image(RGB)
- sparsetensor(u8) ->
daft.sparse_tensor\n --- element: arrow.u8
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be done
DataType::FixedSizeBinary(_) => "arrow.fixed_size_binary".to_string(), | ||
DataType::Utf8 => "string".to_string(), | ||
DataType::FixedSizeList(_, _) => "arrow.fixed_size_list".to_string(), | ||
DataType::List(_) => "arrow.list".to_string(), | ||
DataType::Struct(_) => "struct".to_string(), | ||
DataType::Map { .. } => "map".to_string(), | ||
DataType::Extension(_, _, _) => "daft.extension".to_string(), | ||
DataType::Embedding(_, _) => "daft.embedding".to_string(), | ||
DataType::Image(_) => "daft.image".to_string(), | ||
DataType::FixedShapeImage(_, _, _) => "daft.fixed_shape_image".to_string(), | ||
DataType::Tensor(_) => "daft.tensor".to_string(), | ||
DataType::FixedShapeTensor(_, _) => "daft.fixed_shape_tensor".to_string(), | ||
DataType::SparseTensor(_) => "daft.sparse_tensor".to_string(), | ||
DataType::FixedShapeSparseTensor(_, _) => "daft.fixed_shape_sparse_tensor".to_string(), | ||
#[cfg(feature = "python")] | ||
DataType::Python => "daft.python".to_string(), | ||
DataType::Unknown => "unknown".to_string(), | ||
DataType::UInt8 => "arrow.ubyte".to_string(), | ||
DataType::UInt16 => "arrow.ushort".to_string(), | ||
DataType::UInt32 => "arrow.uint".to_string(), | ||
DataType::UInt64 => "arrow.ulong".to_string(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry if I was unclear in my previous comment, but this is still not right.
arrow types should just be called what they are
DataType::UInt8 => "arrow.uint8".to_string(),
DataType::UInt16 => "arrow.uint16".to_string(),
DataType::UInt32 => "arrow.uint32".to_string(),
DataType::UInt64 => "arrow.uint64".to_string(),
and nested datatypes should match how spark does them
for example, lists have the inner rendered as "element"
data = [{"a": [1,2,3], "b": "hello"}]
spark.createDataFrame(data).printSchema()
root
|-- a: array (nullable = true)
| |-- element: long (containsNull = true)
|-- b: string (nullable = true)
and for structs Struct{ints: i64, strings: utf8}
root
|-- struct: struct (nullable = true)
| |-- ints: integer (nullable = true)
| |-- strings: string (nullable = true)
We'll also want to capture the parameters on them such as FixedSizeList(Int64, 1)
root
|-- a: arrow.fixed_size_list (size = 1, nullable = true)
| |-- element: long (containsNull = true)
or on Image(ImageMode::RGB)
root
|-- a: daft.image (mode = RGB, nullable = true)
TODO
TreeDisplay
?unwrap
sDaft/src/common/display/src/tree.rs
Line 3 in 56e872c
should we make our own?
Example own impl that would need to be tested (don't look at seriously!)