Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Robust against non-utf8 CSVs #8

Closed
mccalluc opened this issue Sep 26, 2024 · 1 comment · Fixed by #29
Closed

Robust against non-utf8 CSVs #8

mccalluc opened this issue Sep 26, 2024 · 1 comment · Fixed by #29

Comments

@mccalluc
Copy link
Contributor

See

Polars has a problem with non-utf8 CSVs.

There's an argument for the library to be more fussy, but a user-facing application should just try to do the least frustrating thing. We might also experiment with approaches here, and then port them back to OpenDP.

@mccalluc
Copy link
Contributor Author

mccalluc commented Oct 1, 2024

Some options:

  • User's responsibility to reencode to UTF-8 first
  • Use "encoding='utf8-lossy'", so we'll see � for any bad characters.
    (Since we're not doing much string processing, and they are mostly used for grouping, not a bad option.)
  • Prompt the user for file encoding... but it's unlikely that they know.
  • DP Creator sniffs file, and includes encoding in generated code.
    • generated code has a helper function that loads lazyframe from CSV with given encoding.
    • OpenDP has a helper function that loads lazyframe from CSV with given encoding.
  • DP Creator includes a sniffer function in generated code.
  • OpenDP has a has a sniffer function that DP Creator will use.

Any approach that tries to reencode the file as UTF-8 temporarily will be a little awkward because we'd want to write to a temporary file, but scan_csv is lazy: How and when do we get rid of the re-encoded file? (... which may contain private data!)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant