Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicates in Tanaka data #56

Open
goodmami opened this issue Oct 24, 2017 · 1 comment
Open

Duplicates in Tanaka data #56

goodmami opened this issue Oct 24, 2017 · 1 comment

Comments

@goodmami
Copy link
Member

There are a lot of English duplicates in the data, but only one Japanese duplicate and zero translation pair duplicates (but see below).

The Japanese example is:

test suite i-id japanese english
tc-044 30066432 それ は 申し分 ない 。 That is all right.
tc-050 30075798 それ は 申し分 ない 。 That's all right.

The difference in the English is really minimal (the ERG yields the same results anyway).

For the English duplicates, there are 17,227 items affected (7,380 sentences with 9,847 redundant items). The disparity (compared to Japanese dupes) is surely due to having multiple Japanese orthographies or other slight variations with the same translation, which means that most of them are very close to being duplicate translations as well.

test suite i-id english japanese
tc-006 30009607 You should read such books as you consider important. 君 は 自分 で 重要 だ と 思う 本 を 読む べき だ 。
tc-024 30037259 You should read such books as you consider important. 彼 は 自分 で 重要 だ と 思う 本 を 読む べき だ 。
tc-007 30010880 You should take her illness into consideration. あなた は 彼女 の 病気 を 考慮 す べき だ 。
tc-016 30024950 You should take her illness into consideration. 彼女 が 病気 だ と 言う こと を 考慮 に 入れる べき です 。
tc-039 30059213 You should take her illness into consideration. 彼女 の 病気 を 考慮 に 入れる べき だ 。
tc-048 30072658 You should take her illness into consideration. あなた は 彼 の 病気 を 考慮 に 入れる べき だ 。
tc-051 30077732 You should take her illness into consideration. あなた は 彼女 の 病気 を 考慮 に 入れる べき だ 。

Note that several of these are bad translations, too (e.g., "He should read..." for 30037259 or "You should take his illness..." for 30072658). Otherwise, maybe there's some value in having these slight variations, but the (already small) corpus has even less variety than I thought (9,847/150,342 = 6.6% of the corpus is (near-) duplicates), which reduces its value in training MT models.

@fcbond
Copy link
Member

fcbond commented Oct 24, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants