Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Guess encoding if default does not work #114

Closed
wants to merge 5 commits into from

Conversation

mslw
Copy link
Contributor

@mslw mslw commented Nov 13, 2023

If reading a tsv file with default encoding fails, roll out a cannon (charset-normalizer) and try to guess the encoding to use.

By default, Path.open() uses locale.getencoding() when reading a file (which means that we implicitly use utf-8, at least on linux). This would fail when reading files with non-ascii characters prepared (with not-uncommon settings) on Windows. There is no perfect way to learn the encoding from a plain text file, but existing tools seem to do a good job.

This PR refactors tabby loader (introducing _parse_tsv_[single, many]) functions that take an optional encoding argument), which allow us to use guessed encoding (but only after the default fails). Closes #112

If reading a tsv file with default encoding fails, roll out a
cannon (charset-normalizer) and try to guess encoding to use.

By default, `Path.open()` will use `locale.getencoding()` when reading
a file (which means that we implicitly use utf-8, at least on
linux). This would fail when reading files with non-ascii characters
prepared (with not-uncommon settings) on Windows. There is no perfect
way to learn the encoding from a plain text file, but existing tools
seem to do a good job.

This commit refactors tabby loader, makes it use guessed encoding (but
only after the default fails) and closes psychoinformatics-de#112

https://charset-normalizer.readthedocs.io
Copy link
Contributor

@mih mih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I would prefer to have the try/except wrap the loader only, and not the extend/append.

This narrows down the try/except to wrap the loader only, and not the
extend/append. It is clearer what is being tried.
@mslw
Copy link
Contributor Author

mslw commented Nov 13, 2023

Updated the try/except, agree that it looks cleaner.

Unfortunately, I just got reminder that a guess is only a guess, and dataset authors table is hard, as there may be only one non-ascii character per entire file. Out of 2 files I had (both with German encoding), I got one misclassified and an (intended) "ü" misread. Maybe what I need is a manual specification (or cp1250/1252 as second priority)...

mslw added 3 commits November 21, 2023 12:20
When an encoding is explicitly specified, it will be used.

Otherwise, default encoding used by Path.open will be tried, and
charset_normalizer will be used to guess if that fails.
Because load functions are used recursively (when load statements are
found in a tabby file), it would be too much hassle to pass the
encoding parameter around - better use `self._encoding`.
@mslw
Copy link
Contributor Author

mslw commented Nov 21, 2023

I've added the possibility to specify encoding as an argument to load_tabby. However, building on top of previously proposed logic, this led to "if encoding given, use it; else try default then guess", and the code seems fairly ugly to me.

Given my 50/50 success with guessing on the two files that prompted the PR (which I suppose was because there were too few non-ascii characters for a good guess), I now think that an explicit declaration is more useful than guessing, so I think I'll close this PR and propose a new one with just the encoding argument added.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Load-tabby is locale-dependent w.r.t file encoding
2 participants