CWM data contains double-encoded unicode #96

wu-lee · 2024-12-04T17:15:42Z

Problem

Some of the data in the delhi demo has double-enoded text.

For instance, see "Cooperativa de Ahorro y CrÃ©ditos Norte Grande"

We would need to sweep the data for these mojibake issues. The corrections will need to be traced to source (probably upstream) and corrected there.

We might also add an automated check for known mojibake cases, these will be identifiable by characteristic junk characters as above.