Corpora deduplication #41

ZJaume · 2022-10-07T08:13:01Z

Many corpora out there is plenty of duplicates. Here are some ideas and approaches I have been using.

The method that we used at ParaCrawl was to deduplicate by Bifixer hash and score. The aggresive dedup option produces a hash for each sentence pair. To compute it, we concatenate src and trg, remove all the punctuation, spaces and unidecode the sentences. So, we keep only one version of sentences that differ only in punctuation, accents or things like that. Bifixer score gives, if I remember correctly, higher score to sentences containing characters with higher unicode values, so we end up keeping, for each hash, the sentence pair that is more rich in punctuation, accents, diactritics etc. Which is probably the best between the duplicates.

Recently we discovered that an even more aggressive way of deduplication helps MT. Some experiments of a colleague in MaCoCu showed that it seems to be better to deduplicate with two hashes separatedly, instead of doing one for the pair. I think beacause it removes many translation inconsistencies (the same sentence aligned with many different ones).

COMET scores in Flores test set translating from English with a NMT model trained with MaCoCu corpora:

What I have right now in my pipeline is, a script to compute the hash by column and another script that throws away each sentence pair that has one hash or another already seen (not the best possible approach but works).
I think, the ideal solution would be one like this (dedup on both sides) but keeping, for each hash, the one that has better Bicleaner, LASER or whatever scoring method we use. So we end up removing inconsistencies but keeping the one that is most promising to be good.

Another thing that I've been thinking of is how to remove the very repetitive stuff that does not get thrown away by this approach. Web corpora is full of this:

File extension.XMR  Extensao de arquivo.XMR
File extension.LLV  Extensao de arquivo.LLV
File extension.RFC  Extensao de arquivo.RFC
Apartments in San José  Apartamentos em San José
Apartments in Paxi  Apartamentos em Paxi

Maybe some kind of bigram or trigram based removal, like if one sentence has all of its bigrams or trigrams repeated more than 20 times, throw it out. I'm pretty confident removing this kind of sentence repetitions won't harm quality and will speed up training.

Just brainstorming here to not forget about it.

The text was updated successfully, but these errors were encountered:

XapaJIaMnu · 2022-10-07T08:46:19Z

I completely agree with you that we should do the more aggressively deduplicate (hashes on both sides). We did that for our French system and it seemed to help training. Unfortunately, it seems that it is absolutely necessary to select the best matching translation out of all duplicates, because if we don't this is what happens:
browsermt/students#67

jelmervdl · 2022-10-07T09:57:13Z

Maybe some kind of bigram or trigram based removal, like if one sentence has all of its bigrams or trigrams repeated more than 20 times, throw it out.

That sounds pretty similar to TF/IDF?

My intuition would be that some repetition with small variations isn't bad, but that we amplify this too much because we repeat the dataset already during training and these repetitions get repeated again. What if you would give every sentence a "uniqueness score" and use that to select which sentences to repeat during training? You'd keep the variation thats in the training data, but also get rid of the dominance of the duplicates? (Not that I'm proposing this as an alternative, more thinking out loud!)

jelmervdl · 2022-12-12T17:22:47Z

I've been using this script to deduplicate. It does bifixer's deduplication essentially, but now that deduplication is not part of the trainer (the more I think about it, the more I agree with Nick that that was a bad idea) I'd like to integrate it someplace else in empty-train.

And I'd love to experiment a bit with what good deduplication would be.

#!/usr/bin/env python3
import sys
import subprocess
from xxhash import xxh64
from unidecode import unidecode
from unicodedata import category as cat
from multiprocessing import Pool, cpu_count
from itertools import count, cycle

REMOVE_NON_ALPHA = str.maketrans('', '', ''.join(
	chr(i)
	for i in range(sys.maxunicode)
	if not cat(chr(i)).startswith('L')
))


def read_file(filename):
	# Lazy me not wanting to run `paste` first
	if ':' in filename:
		readers = [read_gzip(name) for name in filename.split(':')]
		for row in zip(*readers):
			yield '\t'.join(field.rstrip('\n') for field in row) + '\n'
	else:
		yield from read_gzip(filename)


def read_gzip(filename):
	child = subprocess.Popen(['pigz', '-cd', filename], stdout=subprocess.PIPE, encoding='utf-8', bufsize=2**16)
	yield from child.stdout
	child.wait()


def line_hash_rank(entry):
	lineno, line = entry

	src, trg = line.rstrip('\n').split('\t', maxsplit=1)

	normalized_src = unidecode(src.lower().translate(REMOVE_NON_ALPHA))
	normalized_trg = unidecode(trg.lower().translate(REMOVE_NON_ALPHA))

	# Deduplicating only on the src side since same src -> different translations
	# is detrimental to model, but different src -> same trg is fine.
	segment_hash = xxh64(normalized_src).hexdigest()
	#segment_hash = xxh64(normalized_src + "\t" + normalized_trg).hexdigest()

	charsum = sum(ord(ch) for ch in src + trg)
	ranking = round(charsum / (float(len(src + trg))+0.00001), 2) #the  0.00001 is for the case that length is 0 :D

	return lineno, segment_hash, ranking


def rank_lines_in_file(filename):
	table = {}

	with Pool(8) as pool:
		for lineno, segment_hash, ranking in pool.imap_unordered(line_hash_rank, enumerate(read_file(filename)), chunksize=2**16):
			if segment_hash not in table or table[segment_hash][0] < ranking:
				table[segment_hash] = (ranking, lineno)

	return table


def rank_lines(files):
	table = {}

	for fileno, filename in enumerate(files):
		for segment_hash, (ranking, lineno) in rank_lines_in_file(filename).items():
			if segment_hash not in table or table[segment_hash][0] < ranking:
				table[segment_hash] = (ranking, fileno, lineno)

	return table


def print_lines(files, table):
	lines_per_file = [
		[lineno for _, fileno, lineno in table.values() if fileno == n]
		for n in range(len(files))
	]

	for lines in lines_per_file:
		lines.sort()

	for fileno, filename in enumerate(files):
		lineno_it = iter(lines_per_file[fileno])
		next_line = next(lineno_it, None)

		lineno = -1
		if next_line is not None:
			for lineno, line in enumerate(read_file(filename)):
				if lineno == next_line:
					yield line
					next_line = next(lineno_it, None)
					assert next_line is None or next_line > lineno
		assert next_line is None

		used_lines = len(lines_per_file[fileno])
		total_lines = lineno + 1
		percentage = used_lines / total_lines * 100 if total_lines > 0 else 0
		print(f"{filename}: used {used_lines} of {total_lines}: {percentage:0.3f}%", file=sys.stderr)


table = rank_lines(sys.argv[1:])

sys.stdout.writelines(print_lines(sys.argv[1:], table))

ZJaume · 2022-12-13T15:44:36Z

the ranking here is just bifixer rank? We should use the whatever scoring method we used for filtering? (LASER, Bicleaner AI)

jelmervdl · 2022-12-14T11:49:26Z

I didn't clean all of the datasets I used with LASER or Bicleaner AI, so I didn't have the scores for all of it.

Ideally we would use those scores if we have them, but that would also force you to run all of your data with the same scoring method at the end. We could do something like that, maybe even re-use scores that you used for thresholding for the datasets where you used bicleaner-ai or LASER as a filtering method.

Maybe just by using the same caching methods I use for threshold.py (it stores (hash((src,trg)), score) tuples) so that if you change your dataset slightly because you applied a new filter you don't have to rescore all sentence pairs, just the ones that got added/changed.

XapaJIaMnu · 2023-02-21T12:09:27Z

Posting @ZJaume 's deduplicators here for posterity sake:
hash-seg.py

#!/usr/bin/env python
from argparse import ArgumentParser, FileType
from unicodedata import category as cat
from unidecode import unidecode
from xxhash import xxh64
import sys

parser = ArgumentParser()
parser.add_argument('-a', '--aggressive', action='store_true', default=False)
args = parser.parse_args()

# Translate table to remove non alphabetic characters
tbl = [chr(i) for i in range(sys.maxunicode) if not cat(chr(i)).startswith('L')]
remove_non_alpha = str.maketrans('', '', ''.join(tbl))

def main():
    shashes, thashes = set(), set()
    for line in sys.stdin:
        sline = line.rstrip('\n')
        parts = sline.split('\t')
        src = parts[0]
        trg = parts[1]

        if args.aggressive:
            src = unidecode(src.lower().translate(remove_non_alpha))
            trg = unidecode(trg.lower().translate(remove_non_alpha))

        src_hash = xxh64(src).hexdigest()
        trg_hash = xxh64(trg).hexdigest()

        sys.stdout.write(f"{sline}\t{src_hash}\t{trg_hash}\n")


if __name__ == "__main__":
    main()

superdedup.py

#!/usr/bin/env python
import sys

def main():
    shashes, thashes = set(), set()
    for line in sys.stdin:
        parts = line.rstrip("\n").split('\t')

        src_hash = parts[2]
        trg_hash = parts[3]

        if src_hash not in shashes and trg_hash not in thashes:
            sys.stdout.write(line)
        shashes.add(src_hash)
        thashes.add(trg_hash)


if __name__ == "__main__":
    main()

Usage: cat parallel-data | hash-seg.py -a | superdedup.py
This should remove near duplicates as well with lowercase and numbers. At some point we should have a bicleaner-ai based scores solution for deciding which pair to keep but that would be incredibly costly, to score all your datasets with bicleaner

jelmervdl mentioned this issue Dec 5, 2022

Trainer clean up #52

Merged

1 task

jelmervdl added enhancement New feature or request component:execution Related to run.py, and running filters on the full dataset labels Dec 12, 2022

jelmervdl mentioned this issue Jan 10, 2023

Workflow #59

Open

XapaJIaMnu mentioned this issue Feb 22, 2023

Add a deduplicator #71

Merged

XapaJIaMnu closed this as completed in #71 May 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Corpora deduplication #41

Corpora deduplication #41

ZJaume commented Oct 7, 2022 •

edited

Loading

XapaJIaMnu commented Oct 7, 2022

jelmervdl commented Oct 7, 2022 •

edited

Loading

jelmervdl commented Dec 12, 2022

ZJaume commented Dec 13, 2022

jelmervdl commented Dec 14, 2022

XapaJIaMnu commented Feb 21, 2023 •

edited by ZJaume

Loading

Corpora deduplication #41

Corpora deduplication #41

Comments

ZJaume commented Oct 7, 2022 • edited Loading

XapaJIaMnu commented Oct 7, 2022

jelmervdl commented Oct 7, 2022 • edited Loading

jelmervdl commented Dec 12, 2022

ZJaume commented Dec 13, 2022

jelmervdl commented Dec 14, 2022

XapaJIaMnu commented Feb 21, 2023 • edited by ZJaume Loading

ZJaume commented Oct 7, 2022 •

edited

Loading

jelmervdl commented Oct 7, 2022 •

edited

Loading

XapaJIaMnu commented Feb 21, 2023 •

edited by ZJaume

Loading