-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Corpora deduplication #41
Comments
I completely agree with you that we should do the more aggressively deduplicate (hashes on both sides). We did that for our French system and it seemed to help training. Unfortunately, it seems that it is absolutely necessary to select the best matching translation out of all duplicates, because if we don't this is what happens: |
That sounds pretty similar to TF/IDF? My intuition would be that some repetition with small variations isn't bad, but that we amplify this too much because we repeat the dataset already during training and these repetitions get repeated again. What if you would give every sentence a "uniqueness score" and use that to select which sentences to repeat during training? You'd keep the variation thats in the training data, but also get rid of the dominance of the duplicates? (Not that I'm proposing this as an alternative, more thinking out loud!) |
I've been using this script to deduplicate. It does bifixer's deduplication essentially, but now that deduplication is not part of the trainer (the more I think about it, the more I agree with Nick that that was a bad idea) I'd like to integrate it someplace else in empty-train. And I'd love to experiment a bit with what good deduplication would be. #!/usr/bin/env python3
import sys
import subprocess
from xxhash import xxh64
from unidecode import unidecode
from unicodedata import category as cat
from multiprocessing import Pool, cpu_count
from itertools import count, cycle
REMOVE_NON_ALPHA = str.maketrans('', '', ''.join(
chr(i)
for i in range(sys.maxunicode)
if not cat(chr(i)).startswith('L')
))
def read_file(filename):
# Lazy me not wanting to run `paste` first
if ':' in filename:
readers = [read_gzip(name) for name in filename.split(':')]
for row in zip(*readers):
yield '\t'.join(field.rstrip('\n') for field in row) + '\n'
else:
yield from read_gzip(filename)
def read_gzip(filename):
child = subprocess.Popen(['pigz', '-cd', filename], stdout=subprocess.PIPE, encoding='utf-8', bufsize=2**16)
yield from child.stdout
child.wait()
def line_hash_rank(entry):
lineno, line = entry
src, trg = line.rstrip('\n').split('\t', maxsplit=1)
normalized_src = unidecode(src.lower().translate(REMOVE_NON_ALPHA))
normalized_trg = unidecode(trg.lower().translate(REMOVE_NON_ALPHA))
# Deduplicating only on the src side since same src -> different translations
# is detrimental to model, but different src -> same trg is fine.
segment_hash = xxh64(normalized_src).hexdigest()
#segment_hash = xxh64(normalized_src + "\t" + normalized_trg).hexdigest()
charsum = sum(ord(ch) for ch in src + trg)
ranking = round(charsum / (float(len(src + trg))+0.00001), 2) #the 0.00001 is for the case that length is 0 :D
return lineno, segment_hash, ranking
def rank_lines_in_file(filename):
table = {}
with Pool(8) as pool:
for lineno, segment_hash, ranking in pool.imap_unordered(line_hash_rank, enumerate(read_file(filename)), chunksize=2**16):
if segment_hash not in table or table[segment_hash][0] < ranking:
table[segment_hash] = (ranking, lineno)
return table
def rank_lines(files):
table = {}
for fileno, filename in enumerate(files):
for segment_hash, (ranking, lineno) in rank_lines_in_file(filename).items():
if segment_hash not in table or table[segment_hash][0] < ranking:
table[segment_hash] = (ranking, fileno, lineno)
return table
def print_lines(files, table):
lines_per_file = [
[lineno for _, fileno, lineno in table.values() if fileno == n]
for n in range(len(files))
]
for lines in lines_per_file:
lines.sort()
for fileno, filename in enumerate(files):
lineno_it = iter(lines_per_file[fileno])
next_line = next(lineno_it, None)
lineno = -1
if next_line is not None:
for lineno, line in enumerate(read_file(filename)):
if lineno == next_line:
yield line
next_line = next(lineno_it, None)
assert next_line is None or next_line > lineno
assert next_line is None
used_lines = len(lines_per_file[fileno])
total_lines = lineno + 1
percentage = used_lines / total_lines * 100 if total_lines > 0 else 0
print(f"{filename}: used {used_lines} of {total_lines}: {percentage:0.3f}%", file=sys.stderr)
table = rank_lines(sys.argv[1:])
sys.stdout.writelines(print_lines(sys.argv[1:], table)) |
the ranking here is just bifixer rank? We should use the whatever scoring method we used for filtering? (LASER, Bicleaner AI) |
I didn't clean all of the datasets I used with LASER or Bicleaner AI, so I didn't have the scores for all of it. Ideally we would use those scores if we have them, but that would also force you to run all of your data with the same scoring method at the end. We could do something like that, maybe even re-use scores that you used for thresholding for the datasets where you used bicleaner-ai or LASER as a filtering method. Maybe just by using the same caching methods I use for |
Posting @ZJaume 's deduplicators here for posterity sake: #!/usr/bin/env python
from argparse import ArgumentParser, FileType
from unicodedata import category as cat
from unidecode import unidecode
from xxhash import xxh64
import sys
parser = ArgumentParser()
parser.add_argument('-a', '--aggressive', action='store_true', default=False)
args = parser.parse_args()
# Translate table to remove non alphabetic characters
tbl = [chr(i) for i in range(sys.maxunicode) if not cat(chr(i)).startswith('L')]
remove_non_alpha = str.maketrans('', '', ''.join(tbl))
def main():
shashes, thashes = set(), set()
for line in sys.stdin:
sline = line.rstrip('\n')
parts = sline.split('\t')
src = parts[0]
trg = parts[1]
if args.aggressive:
src = unidecode(src.lower().translate(remove_non_alpha))
trg = unidecode(trg.lower().translate(remove_non_alpha))
src_hash = xxh64(src).hexdigest()
trg_hash = xxh64(trg).hexdigest()
sys.stdout.write(f"{sline}\t{src_hash}\t{trg_hash}\n")
if __name__ == "__main__":
main()
#!/usr/bin/env python
import sys
def main():
shashes, thashes = set(), set()
for line in sys.stdin:
parts = line.rstrip("\n").split('\t')
src_hash = parts[2]
trg_hash = parts[3]
if src_hash not in shashes and trg_hash not in thashes:
sys.stdout.write(line)
shashes.add(src_hash)
thashes.add(trg_hash)
if __name__ == "__main__":
main() Usage: |
Many corpora out there is plenty of duplicates. Here are some ideas and approaches I have been using.
The method that we used at ParaCrawl was to deduplicate by Bifixer hash and score. The aggresive dedup option produces a hash for each sentence pair. To compute it, we concatenate src and trg, remove all the punctuation, spaces and unidecode the sentences. So, we keep only one version of sentences that differ only in punctuation, accents or things like that. Bifixer score gives, if I remember correctly, higher score to sentences containing characters with higher unicode values, so we end up keeping, for each hash, the sentence pair that is more rich in punctuation, accents, diactritics etc. Which is probably the best between the duplicates.
Recently we discovered that an even more aggressive way of deduplication helps MT. Some experiments of a colleague in MaCoCu showed that it seems to be better to deduplicate with two hashes separatedly, instead of doing one for the pair. I think beacause it removes many translation inconsistencies (the same sentence aligned with many different ones).
COMET scores in Flores test set translating from English with a NMT model trained with MaCoCu corpora:
What I have right now in my pipeline is, a script to compute the hash by column and another script that throws away each sentence pair that has one hash or another already seen (not the best possible approach but works).
I think, the ideal solution would be one like this (dedup on both sides) but keeping, for each hash, the one that has better Bicleaner, LASER or whatever scoring method we use. So we end up removing inconsistencies but keeping the one that is most promising to be good.
Another thing that I've been thinking of is how to remove the very repetitive stuff that does not get thrown away by this approach. Web corpora is full of this:
Maybe some kind of bigram or trigram based removal, like if one sentence has all of its bigrams or trigrams repeated more than 20 times, throw it out. I'm pretty confident removing this kind of sentence repetitions won't harm quality and will speed up training.
Just brainstorming here to not forget about it.
The text was updated successfully, but these errors were encountered: