feat(connect): `printSchema` #3617

andrewgazelka · 2024-12-19T10:53:12Z

TODO

should we reuse TreeDisplay?
remove unwraps

pub trait TreeDisplay {

should we make our own?
Example own impl that would need to be tested (don't look at seriously!)

pub fn to_tree_string(schema: &Schema) -> eyre::Result<String> {
    let mut output = String::new();
    // Start with root
    writeln!(&mut output, "root")?;
    // Now print each top-level field
    for (name, field) in &schema.fields {
        print_field(&mut output, name, &field.dtype, /*nullable*/ true, 1)?;
    }
    Ok(output)
}

// A helper function to print a field at a given level of indentation.
// level=1 means a single " |-- " prefix, level=2 means
// " |    |-- " and so on, mimicking Spark's indentation style.
fn print_field(
    w: &mut String, 
    field_name: &str, 
    dtype: &DataType, 
    nullable: bool, 
    level: usize
) -> eyre::Result<()> {
    // Construct the prefix for indentation.
    // Spark indentation levels:
    // level 1:  " |-- "
    // level 2:  " |    |-- "
    // level n:  " |" followed by (4*(n-1)) spaces + "-- "
    let indent = if level == 1 {
        format!(" |-- ")
    } else {
        let spaces = " ".repeat(4*(level-1));
        format!(" |{}-- ", spaces)
    };

    // Get a user-friendly string for dtype
    let dtype_str = type_to_string(dtype);

    writeln!(
        w,
        "{}{}: {} (nullable = {})",
        indent, field_name, dtype_str, nullable
    )?;

    // If the dtype is a struct, we must print its child fields with increased indentation.
    if let DataType::Struct(fields) = dtype {
        for field in fields {
            print_field(w, &field.name, &field.dtype, true, level + 1)?;
        }
    }

    Ok(())
}

fn type_to_string(dtype: &DataType) -> String {
    // We want a nice, human-readable type string.
    // Spark generally prints something like "integer", "string", etc.
    // We'll follow a similar style here:
    match dtype {
        DataType::Null => "null".to_string(),
        DataType::Boolean => "boolean".to_string(),
        DataType::Int8
        | DataType::Int16
        | DataType::Int32
        | DataType::Int64
        | DataType::UInt8
        | DataType::UInt16
        | DataType::UInt32
        | DataType::UInt64 => "integer".to_string(), // Spark doesn't differentiate sizes
        DataType::Float32 | DataType::Float64 => "double".to_string(), // Spark calls all floats double for printing
        DataType::Decimal128(_, _) => "decimal".to_string(),
        DataType::Timestamp(_, _) => "timestamp".to_string(),
        DataType::Date => "date".to_string(),
        DataType::Time(_) => "time".to_string(),
        DataType::Duration(_) => "duration".to_string(),
        DataType::Interval => "interval".to_string(),
        DataType::Binary => "binary".to_string(),
        DataType::FixedSizeBinary(_) => "fixed_size_binary".to_string(),
        DataType::Utf8 => "string".to_string(),
        DataType::FixedSizeList(_, _) => "array".to_string(), // Spark calls them arrays
        DataType::List(_) => "array".to_string(),
        DataType::Struct(_) => "struct".to_string(),
        DataType::Map { .. } => "map".to_string(),
        DataType::Extension(_, _, _) => "extension".to_string(),
        DataType::Embedding(_, _) => "embedding".to_string(),
        DataType::Image(_) => "image".to_string(),
        DataType::FixedShapeImage(_, _, _) => "fixed_shape_image".to_string(),
        DataType::Tensor(_) => "tensor".to_string(),
        DataType::FixedShapeTensor(_, _) => "fixed_shape_tensor".to_string(),
        DataType::SparseTensor(_) => "sparse_tensor".to_string(),
        DataType::FixedShapeSparseTensor(_, _) => "fixed_shape_sparse_tensor".to_string(),
        #[cfg(feature = "python")]
        DataType::Python => "python_object".to_string(),
        DataType::Unknown => "unknown".to_string(),
    }
}

andrewgazelka · 2024-12-19T10:53:32Z

feat(connect): printSchema #3617 👈 (View in Graphite)
main

This stack of pull requests is managed by Graphite. Learn more about stacking.

codspeed-hq · 2024-12-19T11:11:31Z

CodSpeed Performance Report

Merging #3617 will not alter performance

_{Comparing andrew/print-schema (75d1e43) with main (07f6b2c)}

Summary

✅ 27 untouched benchmarks

codecov · 2024-12-19T17:28:31Z

Codecov Report

Attention: Patch coverage is 86.83386% with 42 lines in your changes missing coverage. Please review.

Project coverage is 77.88%. Comparing base (ae74c10) to head (75d1e43).
Report is 2 commits behind head on main.

Files with missing lines	Patch %	Lines
src/daft-connect/src/display.rs	87.63%	35 Missing ⚠️
src/daft-connect/src/lib.rs	80.00%	7 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3617      +/-   ##
==========================================
+ Coverage   77.84%   77.88%   +0.04%     
==========================================
  Files         718      720       +2     
  Lines       88250    88600     +350     
==========================================
+ Hits        68696    69008     +312     
- Misses      19554    19592      +38

Files with missing lines	Coverage Δ
src/daft-connect/src/translation/schema.rs	`100.00% <100.00%> (ø)`
src/daft-connect/src/lib.rs	`65.01% <80.00%> (+2.13%)`	⬆️
src/daft-connect/src/display.rs	`87.63% <87.63%> (ø)`

... and 8 files with indirect coverage changes

tests/connect/test_print_schema.py

src/daft-connect/src/display.rs

universalmind303 · 2024-12-19T17:37:01Z

src/daft-connect/src/display.rs

+        DataType::FixedShapeImage(_, _, _) => "fixed_shape_image".to_string(),
+        DataType::Tensor(_) => "tensor".to_string(),
+        DataType::FixedShapeTensor(_, _) => "fixed_shape_tensor".to_string(),
+        DataType::SparseTensor(_) => "sparse_tensor".to_string(),
+        DataType::FixedShapeSparseTensor(_, _) => "fixed_shape_sparse_tensor".to_string(),


i don't think these exist in spark (along with unsized ints). We should check if spark connect has a standard around extension or user defined types. If they don't I'd at least want something in the display to indicate that these are not native spark types, but in fact daft datatypes.

hmmmmm https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/types/UserDefinedType.html

can people even define UDTs outside of Java? I don't see a pyspark example

src/daft-connect/src/display.rs

universalmind303 · 2024-12-19T17:41:07Z

src/daft-connect/src/display.rs

+        DataType::Binary => "binary".to_string(),
+        DataType::FixedSizeBinary(_) => "fixed_size_binary".to_string(),
+        DataType::Utf8 => "string".to_string(),
+        DataType::FixedSizeList(_, _) => "array".to_string(), // Spark calls them arrays


i would represent this as a custom type similar to the other non native dtypes.

if there is no standard thoughts on something like daft[fixed_size_list]?

since there seems to be no standard, I'd prefer to separate them into 2 categories.

arrow native datatypes:

for arrow native datatypes such as unsigned integers,fsl, etc, lets go with arrow.<datatype> such as

u64 -> arrow.uint64,

fsl(u8, 1) -> arrow.fixed_size_list(1)\n --- element: arrow.u8

(this resembles how arrow does extension types)

custom daft datatypes:

for non arrow native ones such as SparseTensor, Image, and so on, let's prefix them with daft.

image -> daft.image(<image_mode>) ex: daft.image(RGB)

sparsetensor(u8) -> daft.sparse_tensor\n --- element: arrow.u8

should be done

src/daft-connect/Cargo.toml

tests/connect/test_print_schema.py

universalmind303 · 2024-12-19T21:23:40Z

src/daft-connect/src/display.rs

+        DataType::FixedSizeBinary(_) => "arrow.fixed_size_binary".to_string(),
+        DataType::Utf8 => "string".to_string(),
+        DataType::FixedSizeList(_, _) => "arrow.fixed_size_list".to_string(),
+        DataType::List(_) => "arrow.list".to_string(),
+        DataType::Struct(_) => "struct".to_string(),
+        DataType::Map { .. } => "map".to_string(),
+        DataType::Extension(_, _, _) => "daft.extension".to_string(),
+        DataType::Embedding(_, _) => "daft.embedding".to_string(),
+        DataType::Image(_) => "daft.image".to_string(),
+        DataType::FixedShapeImage(_, _, _) => "daft.fixed_shape_image".to_string(),
+        DataType::Tensor(_) => "daft.tensor".to_string(),
+        DataType::FixedShapeTensor(_, _) => "daft.fixed_shape_tensor".to_string(),
+        DataType::SparseTensor(_) => "daft.sparse_tensor".to_string(),
+        DataType::FixedShapeSparseTensor(_, _) => "daft.fixed_shape_sparse_tensor".to_string(),
+        #[cfg(feature = "python")]
+        DataType::Python => "daft.python".to_string(),
+        DataType::Unknown => "unknown".to_string(),
+        DataType::UInt8 => "arrow.ubyte".to_string(),
+        DataType::UInt16 => "arrow.ushort".to_string(),
+        DataType::UInt32 => "arrow.uint".to_string(),
+        DataType::UInt64 => "arrow.ulong".to_string(),


Sorry if I was unclear in my previous comment, but this is still not right.

arrow types should just be called what they are

DataType::UInt8 => "arrow.uint8".to_string(), DataType::UInt16 => "arrow.uint16".to_string(), DataType::UInt32 => "arrow.uint32".to_string(), DataType::UInt64 => "arrow.uint64".to_string(),

and nested datatypes should match how spark does them
for example, lists have the inner rendered as "element"

data = [{"a": [1,2,3], "b": "hello"}] spark.createDataFrame(data).printSchema()

root |-- a: array (nullable = true) | |-- element: long (containsNull = true) |-- b: string (nullable = true)

and for structs Struct{ints: i64, strings: utf8}

root |-- struct: struct (nullable = true) | |-- ints: integer (nullable = true) | |-- strings: string (nullable = true)

We'll also want to capture the parameters on them such as FixedSizeList(Int64, 1)

root |-- a: arrow.fixed_size_list (size = 1, nullable = true) | |-- element: long (containsNull = true)

or on Image(ImageMode::RGB)

root |-- a: daft.image (mode = RGB, nullable = true)

github-actions bot added the feat label Dec 19, 2024

andrewgazelka marked this pull request as ready for review December 19, 2024 16:31

andrewgazelka requested a review from universalmind303 December 19, 2024 16:31

feat(connect): printSchema

81f2540

andrewgazelka force-pushed the andrew/print-schema branch from 58c8a1e to 81f2540 Compare December 19, 2024 17:04

universalmind303 requested changes Dec 19, 2024

View reviewed changes

andrewgazelka added 6 commits December 19, 2024 09:58

fix several comments

cb74a67

str

fb06caf

make linter happy

ab08a82

redo tests

93e9d9f

fix schema tests

6552fa9

rename test

8d3e4b9

universalmind303 reviewed Dec 19, 2024

View reviewed changes

src/daft-connect/Cargo.toml Outdated Show resolved Hide resolved

universalmind303 reviewed Dec 19, 2024

View reviewed changes

tests/connect/test_print_schema.py Show resolved Hide resolved

andrewgazelka added 4 commits December 19, 2024 11:05

rmeove dev-dependencies

f8f76e5

add nested data test

1cab7ea

skip test

0b4a356

improve naming

75d1e43

andrewgazelka requested a review from universalmind303 December 19, 2024 20:58

universalmind303 reviewed Dec 19, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(connect): `printSchema` #3617

feat(connect): `printSchema` #3617

andrewgazelka commented Dec 19, 2024 •

edited

Loading

andrewgazelka commented Dec 19, 2024

codspeed-hq bot commented Dec 19, 2024 •

edited

Loading

codecov bot commented Dec 19, 2024 •

edited

Loading

universalmind303 Dec 19, 2024

andrewgazelka Dec 19, 2024

universalmind303 Dec 19, 2024

andrewgazelka Dec 19, 2024

universalmind303 Dec 19, 2024

andrewgazelka Dec 19, 2024

universalmind303 Dec 19, 2024 •

edited

Loading

feat(connect): printSchema #3617

Are you sure you want to change the base?

feat(connect): printSchema #3617

Conversation

andrewgazelka commented Dec 19, 2024 • edited Loading

TODO

andrewgazelka commented Dec 19, 2024

codspeed-hq bot commented Dec 19, 2024 • edited Loading

CodSpeed Performance Report

Merging #3617 will not alter performance

Summary

codecov bot commented Dec 19, 2024 • edited Loading

Codecov Report

universalmind303 Dec 19, 2024

Choose a reason for hiding this comment

andrewgazelka Dec 19, 2024

Choose a reason for hiding this comment

universalmind303 Dec 19, 2024

Choose a reason for hiding this comment

andrewgazelka Dec 19, 2024

Choose a reason for hiding this comment

universalmind303 Dec 19, 2024

Choose a reason for hiding this comment

arrow native datatypes:

custom daft datatypes:

andrewgazelka Dec 19, 2024

Choose a reason for hiding this comment

universalmind303 Dec 19, 2024 • edited Loading

Choose a reason for hiding this comment

feat(connect): `printSchema` #3617

feat(connect): `printSchema` #3617

andrewgazelka commented Dec 19, 2024 •

edited

Loading

codspeed-hq bot commented Dec 19, 2024 •

edited

Loading

codecov bot commented Dec 19, 2024 •

edited

Loading

universalmind303 Dec 19, 2024 •

edited

Loading