-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Duplicates in Tanaka data #56
Comments
There is some value in repetition of minor variants, although not, as you
say, when it is wrong.
…On Tue, Oct 24, 2017 at 2:32 PM, Michael Wayne Goodman < ***@***.***> wrote:
There are a lot of English duplicates in the data, but only one Japanese
duplicate and zero translation pair duplicates (but see below).
The Japanese example is:
test suite i-id japanese english
tc-044 30066432 それ は 申し分 ない 。 That is all right.
tc-050 30075798 それ は 申し分 ない 。 That's all right.
The difference in the English is really minimal (the ERG yields the same
results anyway).
For the English duplicates, there are 17,227 items affected (7,380
sentences with 9,847 redundant items). The disparity (compared to Japanese
dupes) is surely due to having multiple Japanese orthographies or other
slight variations with the same translation, which means that most of them
are very close to being duplicate translations as well.
test suite i-id english japanese
tc-006 30009607 You should read such books as you consider important. 君 は
自分 で 重要 だ と 思う 本 を 読む べき だ 。
tc-024 30037259 You should read such books as you consider important. 彼 は
自分 で 重要 だ と 思う 本 を 読む べき だ 。
tc-007 30010880 You should take her illness into consideration. あなた は 彼女
の 病気 を 考慮 す べき だ 。
tc-016 30024950 You should take her illness into consideration. 彼女 が 病気 だ
と 言う こと を 考慮 に 入れる べき です 。
tc-039 30059213 You should take her illness into consideration. 彼女 の 病気 を
考慮 に 入れる べき だ 。
tc-048 30072658 You should take her illness into consideration. あなた は 彼 の
病気 を 考慮 に 入れる べき だ 。
tc-051 30077732 You should take her illness into consideration. あなた は 彼女
の 病気 を 考慮 に 入れる べき だ 。
Note that several of these are bad translations, too (e.g., "He should
read..." for 30037259 or "You should take his illness..." for 30072658).
Otherwise, maybe there's some value in having these slight variations, but
the (already small) corpus has even less variety than I thought
(9,847/150,342 = 6.6% of the corpus is (near-) duplicates), which reduces
its value in training MT models.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#56>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABD8xqqxnlTVgxdVFCg4zLrIUzZv63Xxks5svWhCgaJpZM4QD1Jm>
.
--
Francis Bond <http://www3.ntu.edu.sg/home/fcbond/>
Division of Linguistics and Multilingual Studies
Nanyang Technological University
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
There are a lot of English duplicates in the data, but only one Japanese duplicate and zero translation pair duplicates (but see below).
The Japanese example is:
The difference in the English is really minimal (the ERG yields the same results anyway).
For the English duplicates, there are 17,227 items affected (7,380 sentences with 9,847 redundant items). The disparity (compared to Japanese dupes) is surely due to having multiple Japanese orthographies or other slight variations with the same translation, which means that most of them are very close to being duplicate translations as well.
Note that several of these are bad translations, too (e.g., "He should read..." for 30037259 or "You should take his illness..." for 30072658). Otherwise, maybe there's some value in having these slight variations, but the (already small) corpus has even less variety than I thought (9,847/150,342 = 6.6% of the corpus is (near-) duplicates), which reduces its value in training MT models.
The text was updated successfully, but these errors were encountered: