-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Guess encoding if default does not work #114
Conversation
If reading a tsv file with default encoding fails, roll out a cannon (charset-normalizer) and try to guess encoding to use. By default, `Path.open()` will use `locale.getencoding()` when reading a file (which means that we implicitly use utf-8, at least on linux). This would fail when reading files with non-ascii characters prepared (with not-uncommon settings) on Windows. There is no perfect way to learn the encoding from a plain text file, but existing tools seem to do a good job. This commit refactors tabby loader, makes it use guessed encoding (but only after the default fails) and closes psychoinformatics-de#112 https://charset-normalizer.readthedocs.io
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I would prefer to have the try/except wrap the loader only, and not the extend/append.
This narrows down the try/except to wrap the loader only, and not the extend/append. It is clearer what is being tried.
Updated the try/except, agree that it looks cleaner. Unfortunately, I just got reminder that a guess is only a guess, and dataset authors table is hard, as there may be only one non-ascii character per entire file. Out of 2 files I had (both with German encoding), I got one misclassified and an (intended) "ü" misread. Maybe what I need is a manual specification (or cp1250/1252 as second priority)... |
When an encoding is explicitly specified, it will be used. Otherwise, default encoding used by Path.open will be tried, and charset_normalizer will be used to guess if that fails.
Because load functions are used recursively (when load statements are found in a tabby file), it would be too much hassle to pass the encoding parameter around - better use `self._encoding`.
I've added the possibility to specify encoding as an argument to Given my 50/50 success with guessing on the two files that prompted the PR (which I suppose was because there were too few non-ascii characters for a good guess), I now think that an explicit declaration is more useful than guessing, so I think I'll close this PR and propose a new one with just the encoding argument added. |
If reading a tsv file with default encoding fails, roll out a cannon (charset-normalizer) and try to guess the encoding to use.
By default,
Path.open()
useslocale.getencoding()
when reading a file (which means that we implicitly use utf-8, at least on linux). This would fail when reading files with non-ascii characters prepared (with not-uncommon settings) on Windows. There is no perfect way to learn the encoding from a plain text file, but existing tools seem to do a good job.This PR refactors tabby loader (introducing
_parse_tsv_[single, many]
) functions that take an optional encoding argument), which allow us to use guessed encoding (but only after the default fails). Closes #112