[BUG] Use schema_hints as hints instead of definitive schema #1636

colin-ho · 2023-11-19T23:26:23Z

Addresses #1599

Instead of using schema_hints as a definitive schema, use them as 'hints' as to the intended datatype of each column.
This is implemented via running schema inference first, then applying the 'hints' onto the inferred schema.

Tests:

Added tests for for read_csv, read_json, read_parquet

Feedback greatly appreciated! Let me know if this is the correctly intended behaviour, and also if the code can be optimized/refactored since this is my first time writing Rust!

codecov · 2023-11-19T23:37:14Z

Codecov Report

Merging #1636 (ff67fa7) into main (b679661) will decrease coverage by 0.01%.
The diff coverage is 77.77%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1636      +/-   ##
==========================================
- Coverage   85.07%   85.06%   -0.01%     
==========================================
  Files          55       55              
  Lines        5345     5350       +5     
==========================================
+ Hits         4547     4551       +4     
- Misses        798      799       +1

Files	Coverage Δ
daft/io/common.py	`89.47% <100.00%> (+0.58%)`	⬆️
daft/logical/builder.py	`89.28% <100.00%> (-0.10%)`	⬇️
daft/io/_iceberg.py	`21.27% <0.00%> (ø)`
daft/logical/schema.py	`90.90% <75.00%> (-0.61%)`	⬇️

jaychia

Amazing work, and nice tests!! Two main questions on my end:

Added option to pass in schema into read_parquet_into_micropartition (this was necessary because the schema created from scan operator was not passed in)

I think read_parquet_into_micropartition should stay agnostic to any external schema information. Could we perform schema hint application and coercion after performing a "naive read" in read_parquet_into_micropartition? See: PR comment.

For read_csv, I added a test case to ensure that if has_headers=false, then the schema_hints should be used as definitive schema.

Any reason why you decided to go with these semantics? Wouldn't the code be simpler and easier to reason about if we allowed for partial schema hints as well for has_headers=false? Are there any fundamental limitations preventing us from doing this?

jaychia · 2023-11-21T19:49:40Z

daft/io/common.py

+        # If CSV and no headers, then use the schema hint as the schema
+        if isinstance(file_format_config.config, CsvSourceConfig) and file_format_config.config.has_headers == False:
+            if len(schema) != len(schema_hint):
+                raise ValueError(


Why did you decide to enforce this invariant here?

Wouldn't the code still work naively if we provided partial hints like "column_0": DataType.string(), and those hints were applied as per the rest of the PR?

Based on this existing test case

Daft/tests/dataframe/test_creation.py

Lines 484 to 511 in 06c2ccf

def test_create_dataframe_csv_specify_schema_no_headers(

valid_data: list[dict[str, float]], use_native_downloader

) -> None:

with create_temp_filename() as fname:

with open(fname, "w") as f:

header = list(valid_data[0].keys())

writer = csv.writer(f, delimiter="\t")

writer.writerows([[item[col] for col in header] for item in valid_data])

f.flush()

df = daft.read_csv(

fname,

delimiter="\t",

schema_hints={

"sepal_length": DataType.float64(),

"sepal_width": DataType.float64(),

"petal_length": DataType.float64(),

"petal_width": DataType.float64(),

"variety": DataType.string(),

},

has_headers=False,

use_native_downloader=use_native_downloader,

)

assert df.column_names == COL_NAMES

pd_df = df.to_pandas()

assert list(pd_df.columns) == COL_NAMES

assert len(pd_df) == len(valid_data)

I thought maybe it makes sense for schema_hints to be the definitive schema when csv has no headers, as a way to provide named columns instead of the default "column_0" or "column_1", and this would only work if hints for all columns are provided.

But I also agree that it would be simpler and consistent to remove this invariant and let the user realize that column names will default to "column_0" etc., and they can rename their schema hints accordingly. and I also realize that column names can be changed with .alias 😅

removed these checks in latest commit

jaychia · 2023-11-21T19:52:29Z

src/daft-core/src/schema.rs

@@ -86,6 +86,17 @@ impl Schema {
        }
    }

+    pub fn apply_hints(&self, hints: &Schema) -> DaftResult<Schema> {
+        let mut fields = IndexMap::new();
+        for (name, field) in self.fields.iter() {


Nice! This preserves ordering of the original schema as well which is important.

This is completely fine as-is, but if you wanted you can try to use Rust iterators instead which would help you avoid needing an intermediate mut fields variable.

I think IndexMaps can be "collected" from an iterator and something like this might work:

let applied_fields = self.fields .iter() .map(|(name, field)| match hints.fields.get(name) { None => (name.clone(), field.clone()), Some(hint_field) => (name.clone(), hint_field.clone()), }) .collect::<IndexMap<String, Field>>(); Ok(Schema {fields: applied_fields});

Yup! it works, made the changes. I like it a lot better too, it's more concise and expressive (and more performant? not sure tho will need to learn more about rust)

more performant

Maybe -- depending on how the compiler chooses to optimize it!

Iterators are pretty idiomatic in Rust :)

jaychia · 2023-11-21T20:05:59Z

src/daft-micropartition/src/micropartition.rs

@@ -615,6 +616,7 @@ pub(crate) fn read_csv_into_micropartition(
 pub(crate) fn read_parquet_into_micropartition(
    uris: &[&str],
    columns: Option<&[&str]>,
+    schema: Option<SchemaRef>,


Parquet reads differ a little from CSV reads: for Parquet, the file format itself contains a schema and thus no external schema information is required when reading that file.

Therefore for read_parquet_into_micropartition, we will probably not want to pass in the schema (unlike the CSV reads!)

Instead, we can let read_parquet_into_micropartition perform its own schema inference/data parsing, and then later on we can coerce the resultant MicroPartition into the inferred schema. The overall flow would look something like:

// Naively read Parquet file(s) into a MicroPartition, no schema coercion applied // Note that this all happens lazily because of the nature of MicroPartitions // being a lazy-loading abstraction let mp = read_parquet_into_micropartition(...); let applied_schema = mp.schema().apply(schema_hints); let mp = mp.cast_to_schema(&applied_schema);

ah got it, made this change in latest commit

tests/dataframe/test_creation.py

jaychia · 2023-12-01T04:44:15Z

Hey @colin-ho - we recently merged #1686 which should help the full test suite turn green here!

Looks like there's a small 1-line merge conflict left. Feel free to resolve that and we should be able to get this merged :) apologies for the delay!

colin-ho · 2023-12-01T20:35:32Z

Awesome, just fixed the merge conflict!

Schema hint documentation was out of date after: #1636 This PR fixes our docs Co-authored-by: Jay Chia <[email protected]@users.noreply.github.com>

fix schema hints

bf32432

jaychia reviewed Nov 21, 2023

View reviewed changes

colin-ho added 2 commits November 21, 2023 14:52

address comments

115fcec

fix merge conflict

6ecc859

colin-ho added 2 commits December 1, 2023 12:09

Merge remote-tracking branch 'upstream/main' into schema-hint-behaviour

f5d4c92

remove default DAFT_MICROPARTITIONS variable in test

ff67fa7

samster25 added the bug Something isn't working label Dec 1, 2023

jaychia merged commit 3a7fe3b into Eventual-Inc:main Dec 3, 2023
42 of 44 checks passed

jaychia mentioned this pull request Jan 8, 2024

Fix behavior of schema_hints to be only hints for types, and not a fully-qualified schema #1599

Closed

jaychia mentioned this pull request Feb 20, 2024

[DOCS] Update schema hints documentation #1935

Merged

jaychia added a commit that referenced this pull request Feb 21, 2024

[DOCS] Update schema hints documentation (#1935)

d115e19

Schema hint documentation was out of date after: #1636 This PR fixes our docs Co-authored-by: Jay Chia <[email protected]@users.noreply.github.com>

samster25 pushed a commit that referenced this pull request Feb 27, 2024

[DOCS] Update schema hints documentation (#1935)

cc6420c

Schema hint documentation was out of date after: #1636 This PR fixes our docs Co-authored-by: Jay Chia <[email protected]@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Use schema_hints as hints instead of definitive schema #1636

[BUG] Use schema_hints as hints instead of definitive schema #1636

colin-ho commented Nov 19, 2023 •

edited

Loading

codecov bot commented Nov 19, 2023 •

edited

Loading

jaychia left a comment

jaychia Nov 21, 2023

colin-ho Nov 21, 2023

jaychia Nov 21, 2023

colin-ho Nov 21, 2023

jaychia Nov 22, 2023

jaychia Nov 21, 2023

colin-ho Nov 21, 2023

jaychia commented Dec 1, 2023

colin-ho commented Dec 1, 2023

	def test_create_dataframe_csv_specify_schema_no_headers(
	valid_data: list[dict[str, float]], use_native_downloader
	) -> None:
	with create_temp_filename() as fname:
	with open(fname, "w") as f:
	header = list(valid_data[0].keys())
	writer = csv.writer(f, delimiter="\t")
	writer.writerows([[item[col] for col in header] for item in valid_data])
	f.flush()

	df = daft.read_csv(
	fname,
	delimiter="\t",
	schema_hints={
	"sepal_length": DataType.float64(),
	"sepal_width": DataType.float64(),
	"petal_length": DataType.float64(),
	"petal_width": DataType.float64(),
	"variety": DataType.string(),
	},
	has_headers=False,
	use_native_downloader=use_native_downloader,
	)
	assert df.column_names == COL_NAMES

	pd_df = df.to_pandas()
	assert list(pd_df.columns) == COL_NAMES
	assert len(pd_df) == len(valid_data)

[BUG] Use schema_hints as hints instead of definitive schema #1636

[BUG] Use schema_hints as hints instead of definitive schema #1636

Conversation

colin-ho commented Nov 19, 2023 • edited Loading

codecov bot commented Nov 19, 2023 • edited Loading

Codecov Report

jaychia left a comment

Choose a reason for hiding this comment

jaychia Nov 21, 2023

Choose a reason for hiding this comment

colin-ho Nov 21, 2023

Choose a reason for hiding this comment

jaychia Nov 21, 2023

Choose a reason for hiding this comment

colin-ho Nov 21, 2023

Choose a reason for hiding this comment

jaychia Nov 22, 2023

Choose a reason for hiding this comment

jaychia Nov 21, 2023

Choose a reason for hiding this comment

colin-ho Nov 21, 2023

Choose a reason for hiding this comment

jaychia commented Dec 1, 2023

colin-ho commented Dec 1, 2023

colin-ho commented Nov 19, 2023 •

edited

Loading

codecov bot commented Nov 19, 2023 •

edited

Loading