Different encodings in Jacy files #57

goodmami · 2017-12-12T19:23:40Z

There are mix of encodings in Jacy's files:

~/grammars/jacy$ find . -type f -exec file -b --mime-encoding {} \; | sort | uniq -c
    206 binary
     20 iso-8859-1
    289 us-ascii
    432 utf-8

The iso-8859-1 ones are probably EUC-JP and not Latin-1. The nkf utility (probably not installed by default: apt install nkf) can guess this for us. Now, just looking at TDL files:

~/grammars/jacy$ find . -name \*.tdl -exec nkf -g {} \; | sort | uniq -c
     19 ASCII
     18 EUC-JP
      1 Shift_JIS
     16 UTF-8

There's even a Shift-JIS one in there. Here's the non-UTF-8 and non-ASCII files:

~/grammars/jacy$ find . -name \*.tdl -exec echo -en {} "\t" \; -exec nkf -g {} \; | sort -k2 | grep -v 'ASCII$\|UTF-8$'
./lex/adjadv-lex.tdl 	EUC-JP
./lex/ambiguous-lex.tdl 	EUC-JP
./lex/aux-stem-lex.tdl 	EUC-JP
./lex/funct-lex.tdl 	EUC-JP
./lex/idiom-kanyouku-lex.tdl 	EUC-JP
./lex/idiom-lex.tdl 	EUC-JP
./lex/light-verbs-lex.tdl 	EUC-JP
./lex/noun-lex.tdl 	EUC-JP
./lex/numbers-lex.tdl 	EUC-JP
./lex/oldlexicon.tdl 	EUC-JP
./lex/p-lex.tdl 	EUC-JP
./lex/pn-lex.tdl 	EUC-JP
./lex/v-ends-lex.tdl 	EUC-JP
./lex/verbstem-lex.tdl 	EUC-JP
./lex/vn-lex.tdl 	EUC-JP
./tmr/class.tdl 	EUC-JP
./tmr/ne2.tdl 	EUC-JP
./tmt.tdl 	EUC-JP
./tmr/ne1.tdl 	Shift_JIS

We should make these all UTF-8 (or ASCII is fine if there's no Japanese or special characters)

The text was updated successfully, but these errors were encountered:

goodmami · 2017-12-12T19:29:59Z

It looks like only tmt.tdl is actually being used in the grammar. The rest are commented out. If they are unused, maybe we could remove them. But if they still have value they should be re-encoded.

goodmami · 2017-12-12T19:38:13Z

And actually tmt.tdl was miscategorized by nkf. It appears to be UTF-8 or even ASCII. Same with the Shift-JIS one, tmr/ne1.tdl. Many of the lex/*.tdl files are in fact EUC-JP, though; their mode lines even specify them as such.

fcbond · 2017-12-13T05:09:21Z

I think we are not loading any of the lex/*.tdl files, so rather than converting them we should probably delete them.

…

On Wed, Dec 13, 2017 at 3:38 AM, Michael Wayne Goodman < ***@***.***> wrote: And actually tmt.tdl was miscategorized by nkf. It appears to be UTF-8 or even ASCII. Same with the Shift-JIS one, tmr/ne1.tdl. Many of the lex/*.tdl files are in fact EUC-JP, though; their mode lines even specify them as such. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#57 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABD8xqUFOyJi6uEpQQruyx0fD1AX2h0Oks5s_tYlgaJpZM4Q_gHJ> .

-- Francis Bond <http://www3.ntu.edu.sg/home/fcbond/> Division of Linguistics and Multilingual Studies Nanyang Technological University

goodmami · 2017-12-14T17:56:37Z

We are still importing tanaka-unknowns.tdl, which has almost 7k lexical entries. It is UTF-8.

;;; from japanese.tdl:
:begin :instance :status lex-entry.
   :include "lexicon.tdl".
   :include "lex/tanaka-unknowns.tdl".  ; <-- here
;  :include "lex/adjadv-lex.tdl".
;  :include "lex/aux-stem-lex.tdl".
;  :include "lex/funct-lex.tdl".
;  :include "lex/idiom-lex.tdl".
;  :include "lex/light-verbs-lex.tdl".
;  :include "lex/noun-lex.tdl".
;  :include "lex/numbers-lex.tdl".
;  :include "lex/p-lex.tdl".
;  :include "lex/pn-lex.tdl".
;  :include "lex/verbstem-lex.tdl".
;  :include "lex/vn-lex.tdl".
;  :include "lex/v-ends-lex.tdl".
;  :include "lex/ambiguous-lex.tdl".
:end :instance.

There are also .rev, .blacklist, and a few other files types. I'll let you deal with deleting the files since I'm not sure what is valuable to keep.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Different encodings in Jacy files #57

Different encodings in Jacy files #57

goodmami commented Dec 12, 2017

goodmami commented Dec 12, 2017

goodmami commented Dec 12, 2017

fcbond commented Dec 13, 2017 via email

goodmami commented Dec 14, 2017

Different encodings in Jacy files #57

Different encodings in Jacy files #57

Comments

goodmami commented Dec 12, 2017

goodmami commented Dec 12, 2017

goodmami commented Dec 12, 2017

fcbond commented Dec 13, 2017 via email

goodmami commented Dec 14, 2017