Duplicates in Tanaka data #56

goodmami · 2017-10-24T04:32:02Z

There are a lot of English duplicates in the data, but only one Japanese duplicate and zero translation pair duplicates (but see below).

The Japanese example is:

test suite	i-id	japanese	english
tc-044	30066432	それは申し分ない。	That is all right.
tc-050	30075798	それは申し分ない。	That's all right.

The difference in the English is really minimal (the ERG yields the same results anyway).

For the English duplicates, there are 17,227 items affected (7,380 sentences with 9,847 redundant items). The disparity (compared to Japanese dupes) is surely due to having multiple Japanese orthographies or other slight variations with the same translation, which means that most of them are very close to being duplicate translations as well.

test suite	i-id	english	japanese
tc-006	30009607	You should read such books as you consider important.	君は自分で重要だと思う本を読むべきだ。
tc-024	30037259	You should read such books as you consider important.	彼は自分で重要だと思う本を読むべきだ。
tc-007	30010880	You should take her illness into consideration.	あなたは彼女の病気を考慮すべきだ。
tc-016	30024950	You should take her illness into consideration.	彼女が病気だと言うことを考慮に入れるべきです。
tc-039	30059213	You should take her illness into consideration.	彼女の病気を考慮に入れるべきだ。
tc-048	30072658	You should take her illness into consideration.	あなたは彼の病気を考慮に入れるべきだ。
tc-051	30077732	You should take her illness into consideration.	あなたは彼女の病気を考慮に入れるべきだ。

Note that several of these are bad translations, too (e.g., "He should read..." for 30037259 or "You should take his illness..." for 30072658). Otherwise, maybe there's some value in having these slight variations, but the (already small) corpus has even less variety than I thought (9,847/150,342 = 6.6% of the corpus is (near-) duplicates), which reduces its value in training MT models.

fcbond · 2017-10-24T05:41:43Z

There is some value in repetition of minor variants, although not, as you say, when it is wrong.

…

On Tue, Oct 24, 2017 at 2:32 PM, Michael Wayne Goodman < ***@***.***> wrote: There are a lot of English duplicates in the data, but only one Japanese duplicate and zero translation pair duplicates (but see below). The Japanese example is: test suite i-id japanese english tc-044 30066432 それは申し分ない。 That is all right. tc-050 30075798 それは申し分ない。 That's all right. The difference in the English is really minimal (the ERG yields the same results anyway). For the English duplicates, there are 17,227 items affected (7,380 sentences with 9,847 redundant items). The disparity (compared to Japanese dupes) is surely due to having multiple Japanese orthographies or other slight variations with the same translation, which means that most of them are very close to being duplicate translations as well. test suite i-id english japanese tc-006 30009607 You should read such books as you consider important. 君は自分で重要だと思う本を読むべきだ。 tc-024 30037259 You should read such books as you consider important. 彼は自分で重要だと思う本を読むべきだ。 tc-007 30010880 You should take her illness into consideration. あなたは彼女の病気を考慮すべきだ。 tc-016 30024950 You should take her illness into consideration. 彼女が病気だと言うことを考慮に入れるべきです。 tc-039 30059213 You should take her illness into consideration. 彼女の病気を考慮に入れるべきだ。 tc-048 30072658 You should take her illness into consideration. あなたは彼の病気を考慮に入れるべきだ。 tc-051 30077732 You should take her illness into consideration. あなたは彼女の病気を考慮に入れるべきだ。 Note that several of these are bad translations, too (e.g., "He should read..." for 30037259 or "You should take his illness..." for 30072658). Otherwise, maybe there's some value in having these slight variations, but the (already small) corpus has even less variety than I thought (9,847/150,342 = 6.6% of the corpus is (near-) duplicates), which reduces its value in training MT models. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#56>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABD8xqqxnlTVgxdVFCg4zLrIUzZv63Xxks5svWhCgaJpZM4QD1Jm> .

-- Francis Bond <http://www3.ntu.edu.sg/home/fcbond/> Division of Linguistics and Multilingual Studies Nanyang Technological University

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicates in Tanaka data #56

Duplicates in Tanaka data #56

goodmami commented Oct 24, 2017

fcbond commented Oct 24, 2017 via email

Duplicates in Tanaka data #56

Duplicates in Tanaka data #56

Comments

goodmami commented Oct 24, 2017

fcbond commented Oct 24, 2017 via email