Guess encoding if default does not work #114

mslw · 2023-11-13T17:31:26Z

If reading a tsv file with default encoding fails, roll out a cannon (charset-normalizer) and try to guess the encoding to use.

By default, Path.open() uses locale.getencoding() when reading a file (which means that we implicitly use utf-8, at least on linux). This would fail when reading files with non-ascii characters prepared (with not-uncommon settings) on Windows. There is no perfect way to learn the encoding from a plain text file, but existing tools seem to do a good job.

This PR refactors tabby loader (introducing _parse_tsv_[single, many]) functions that take an optional encoding argument), which allow us to use guessed encoding (but only after the default fails). Closes #112

If reading a tsv file with default encoding fails, roll out a cannon (charset-normalizer) and try to guess encoding to use. By default, `Path.open()` will use `locale.getencoding()` when reading a file (which means that we implicitly use utf-8, at least on linux). This would fail when reading files with non-ascii characters prepared (with not-uncommon settings) on Windows. There is no perfect way to learn the encoding from a plain text file, but existing tools seem to do a good job. This commit refactors tabby loader, makes it use guessed encoding (but only after the default fails) and closes psychoinformatics-de#112 https://charset-normalizer.readthedocs.io

mih

Thanks! I would prefer to have the try/except wrap the loader only, and not the extend/append.

This narrows down the try/except to wrap the loader only, and not the extend/append. It is clearer what is being tried.

mslw · 2023-11-13T18:39:21Z

Updated the try/except, agree that it looks cleaner.

Unfortunately, I just got reminder that a guess is only a guess, and dataset authors table is hard, as there may be only one non-ascii character per entire file. Out of 2 files I had (both with German encoding), I got one misclassified and an (intended) "ü" misread. Maybe what I need is a manual specification (or cp1250/1252 as second priority)...

When an encoding is explicitly specified, it will be used. Otherwise, default encoding used by Path.open will be tried, and charset_normalizer will be used to guess if that fails.

Because load functions are used recursively (when load statements are found in a tabby file), it would be too much hassle to pass the encoding parameter around - better use `self._encoding`.

mslw · 2023-11-21T12:59:14Z

I've added the possibility to specify encoding as an argument to load_tabby. However, building on top of previously proposed logic, this led to "if encoding given, use it; else try default then guess", and the code seems fairly ugly to me.

Given my 50/50 success with guessing on the two files that prompted the PR (which I suppose was because there were too few non-ascii characters for a good guess), I now think that an explicit declaration is more useful than guessing, so I think I'll close this PR and propose a new one with just the encoding argument added.

mslw force-pushed the encoding branch from f5b0457 to 71676da Compare November 13, 2023 17:34

mih reviewed Nov 13, 2023

View reviewed changes

Narrow down the try/except

ef7d778

This narrows down the try/except to wrap the loader only, and not the extend/append. It is clearer what is being tried.

mslw added 3 commits November 21, 2023 12:20

Fix a type annotation

8d4b6e1

Add an encoding argument to tabby loader

070937a

When an encoding is explicitly specified, it will be used. Otherwise, default encoding used by Path.open will be tried, and charset_normalizer will be used to guess if that fails.

Make encoding a property of TabbyLoader

f0c44c1

Because load functions are used recursively (when load statements are found in a tabby file), it would be too much hassle to pass the encoding parameter around - better use `self._encoding`.

mslw force-pushed the encoding branch from 0b755d4 to f0c44c1 Compare November 21, 2023 12:50

mslw closed this Nov 21, 2023

mslw mentioned this pull request Nov 21, 2023

Add an encoding parameter to io.load_tabby #116

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Guess encoding if default does not work #114

Guess encoding if default does not work #114

mslw commented Nov 13, 2023

mih left a comment

mslw commented Nov 13, 2023

mslw commented Nov 21, 2023

Guess encoding if default does not work #114

Guess encoding if default does not work #114

Conversation

mslw commented Nov 13, 2023

mih left a comment

Choose a reason for hiding this comment

mslw commented Nov 13, 2023

mslw commented Nov 21, 2023