Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Schema hints not working properly for json reads #1845

Merged
merged 5 commits into from
Feb 9, 2024

Conversation

colin-ho
Copy link
Contributor

@colin-ho colin-ho commented Feb 6, 2024

Schema hints were not being propagated, leading to fields being dropped.

This stemmed from an issue when reading large jsons from s3 where the fields changed only late into the file, so schema inference doesn't pick it up.

@github-actions github-actions bot added the bug Something isn't working label Feb 6, 2024
Copy link

codecov bot commented Feb 6, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (caa7bab) 85.62% compared to head (c0116e5) 85.53%.
Report is 9 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1845      +/-   ##
==========================================
- Coverage   85.62%   85.53%   -0.10%     
==========================================
  Files          55       55              
  Lines        6102     6147      +45     
==========================================
+ Hits         5225     5258      +33     
- Misses        877      889      +12     

see 12 files with indirect coverage changes

@colin-ho colin-ho force-pushed the colin/fix-json-reads branch from a9fa929 to e58d3ef Compare February 7, 2024 10:43
@colin-ho colin-ho requested review from jaychia and samster25 February 8, 2024 13:05
@@ -797,6 +797,45 @@ def test_create_dataframe_json_schema_hints_ignore_random_hint(valid_data: list[
assert len(pd_df) == len(valid_data)


def test_create_dataframe_json_schema_hints_large_file() -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice test 😵

Another maybe easier way could be to have 2 files (Schema inference is only performed on one file by default at the moment):

# File 1
{"foo": 1}

# File 2
{"bar": 2}

Then we pass in schema hints: {"foo": Int32, "bar": Int32} and we should see the correct result for both rows

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh yeah i like that a lot better! will do this instead of generating a 1mb json haha

@colin-ho colin-ho merged commit 573ea12 into main Feb 9, 2024
42 checks passed
@colin-ho colin-ho deleted the colin/fix-json-reads branch February 9, 2024 09:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants