-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Different encodings in Jacy files #57
Comments
It looks like only |
And actually |
I think we are not loading any of the lex/*.tdl files, so rather than
converting them we should probably delete them.
…On Wed, Dec 13, 2017 at 3:38 AM, Michael Wayne Goodman < ***@***.***> wrote:
And actually tmt.tdl was miscategorized by nkf. It appears to be UTF-8 or
even ASCII. Same with the Shift-JIS one, tmr/ne1.tdl. Many of the
lex/*.tdl files are in fact EUC-JP, though; their mode lines even specify
them as such.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#57 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABD8xqUFOyJi6uEpQQruyx0fD1AX2h0Oks5s_tYlgaJpZM4Q_gHJ>
.
--
Francis Bond <http://www3.ntu.edu.sg/home/fcbond/>
Division of Linguistics and Multilingual Studies
Nanyang Technological University
|
We are still importing ;;; from japanese.tdl:
:begin :instance :status lex-entry.
:include "lexicon.tdl".
:include "lex/tanaka-unknowns.tdl". ; <-- here
; :include "lex/adjadv-lex.tdl".
; :include "lex/aux-stem-lex.tdl".
; :include "lex/funct-lex.tdl".
; :include "lex/idiom-lex.tdl".
; :include "lex/light-verbs-lex.tdl".
; :include "lex/noun-lex.tdl".
; :include "lex/numbers-lex.tdl".
; :include "lex/p-lex.tdl".
; :include "lex/pn-lex.tdl".
; :include "lex/verbstem-lex.tdl".
; :include "lex/vn-lex.tdl".
; :include "lex/v-ends-lex.tdl".
; :include "lex/ambiguous-lex.tdl".
:end :instance. There are also |
There are mix of encodings in Jacy's files:
The iso-8859-1 ones are probably EUC-JP and not Latin-1. The
nkf
utility (probably not installed by default:apt install nkf
) can guess this for us. Now, just looking at TDL files:There's even a Shift-JIS one in there. Here's the non-UTF-8 and non-ASCII files:
We should make these all UTF-8 (or ASCII is fine if there's no Japanese or special characters)
The text was updated successfully, but these errors were encountered: