Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Corpora deduplication #41

Closed
ZJaume opened this issue Oct 7, 2022 · 6 comments · Fixed by #71
Closed

Corpora deduplication #41

ZJaume opened this issue Oct 7, 2022 · 6 comments · Fixed by #71
Labels
component:execution Related to run.py, and running filters on the full dataset enhancement New feature or request

Comments

@ZJaume
Copy link
Collaborator

ZJaume commented Oct 7, 2022

Many corpora out there is plenty of duplicates. Here are some ideas and approaches I have been using.

The method that we used at ParaCrawl was to deduplicate by Bifixer hash and score. The aggresive dedup option produces a hash for each sentence pair. To compute it, we concatenate src and trg, remove all the punctuation, spaces and unidecode the sentences. So, we keep only one version of sentences that differ only in punctuation, accents or things like that. Bifixer score gives, if I remember correctly, higher score to sentences containing characters with higher unicode values, so we end up keeping, for each hash, the sentence pair that is more rich in punctuation, accents, diactritics etc. Which is probably the best between the duplicates.

Recently we discovered that an even more aggressive way of deduplication helps MT. Some experiments of a colleague in MaCoCu showed that it seems to be better to deduplicate with two hashes separatedly, instead of doing one for the pair. I think beacause it removes many translation inconsistencies (the same sentence aligned with many different ones).

COMET scores in Flores test set translating from English with a NMT model trained with MaCoCu corpora:
imatge

What I have right now in my pipeline is, a script to compute the hash by column and another script that throws away each sentence pair that has one hash or another already seen (not the best possible approach but works).
I think, the ideal solution would be one like this (dedup on both sides) but keeping, for each hash, the one that has better Bicleaner, LASER or whatever scoring method we use. So we end up removing inconsistencies but keeping the one that is most promising to be good.

Another thing that I've been thinking of is how to remove the very repetitive stuff that does not get thrown away by this approach. Web corpora is full of this:

File extension.XMR  Extensao de arquivo.XMR
File extension.LLV  Extensao de arquivo.LLV
File extension.RFC  Extensao de arquivo.RFC
Apartments in San José  Apartamentos em San José
Apartments in Paxi  Apartamentos em Paxi

Maybe some kind of bigram or trigram based removal, like if one sentence has all of its bigrams or trigrams repeated more than 20 times, throw it out. I'm pretty confident removing this kind of sentence repetitions won't harm quality and will speed up training.

Just brainstorming here to not forget about it.

@XapaJIaMnu
Copy link
Collaborator

I completely agree with you that we should do the more aggressively deduplicate (hashes on both sides). We did that for our French system and it seemed to help training. Unfortunately, it seems that it is absolutely necessary to select the best matching translation out of all duplicates, because if we don't this is what happens:
browsermt/students#67

@jelmervdl
Copy link
Collaborator

jelmervdl commented Oct 7, 2022

Maybe some kind of bigram or trigram based removal, like if one sentence has all of its bigrams or trigrams repeated more than 20 times, throw it out.

That sounds pretty similar to TF/IDF?

My intuition would be that some repetition with small variations isn't bad, but that we amplify this too much because we repeat the dataset already during training and these repetitions get repeated again. What if you would give every sentence a "uniqueness score" and use that to select which sentences to repeat during training? You'd keep the variation thats in the training data, but also get rid of the dominance of the duplicates? (Not that I'm proposing this as an alternative, more thinking out loud!)

@jelmervdl jelmervdl mentioned this issue Dec 5, 2022
1 task
@jelmervdl
Copy link
Collaborator

I've been using this script to deduplicate. It does bifixer's deduplication essentially, but now that deduplication is not part of the trainer (the more I think about it, the more I agree with Nick that that was a bad idea) I'd like to integrate it someplace else in empty-train.

And I'd love to experiment a bit with what good deduplication would be.

#!/usr/bin/env python3
import sys
import subprocess
from xxhash import xxh64
from unidecode import unidecode
from unicodedata import category as cat
from multiprocessing import Pool, cpu_count
from itertools import count, cycle

REMOVE_NON_ALPHA = str.maketrans('', '', ''.join(
	chr(i)
	for i in range(sys.maxunicode)
	if not cat(chr(i)).startswith('L')
))


def read_file(filename):
	# Lazy me not wanting to run `paste` first
	if ':' in filename:
		readers = [read_gzip(name) for name in filename.split(':')]
		for row in zip(*readers):
			yield '\t'.join(field.rstrip('\n') for field in row) + '\n'
	else:
		yield from read_gzip(filename)


def read_gzip(filename):
	child = subprocess.Popen(['pigz', '-cd', filename], stdout=subprocess.PIPE, encoding='utf-8', bufsize=2**16)
	yield from child.stdout
	child.wait()


def line_hash_rank(entry):
	lineno, line = entry

	src, trg = line.rstrip('\n').split('\t', maxsplit=1)

	normalized_src = unidecode(src.lower().translate(REMOVE_NON_ALPHA))
	normalized_trg = unidecode(trg.lower().translate(REMOVE_NON_ALPHA))

	# Deduplicating only on the src side since same src -> different translations
	# is detrimental to model, but different src -> same trg is fine.
	segment_hash = xxh64(normalized_src).hexdigest()
	#segment_hash = xxh64(normalized_src + "\t" + normalized_trg).hexdigest()

	charsum = sum(ord(ch) for ch in src + trg)
	ranking = round(charsum / (float(len(src + trg))+0.00001), 2) #the  0.00001 is for the case that length is 0 :D

	return lineno, segment_hash, ranking


def rank_lines_in_file(filename):
	table = {}

	with Pool(8) as pool:
		for lineno, segment_hash, ranking in pool.imap_unordered(line_hash_rank, enumerate(read_file(filename)), chunksize=2**16):
			if segment_hash not in table or table[segment_hash][0] < ranking:
				table[segment_hash] = (ranking, lineno)

	return table


def rank_lines(files):
	table = {}

	for fileno, filename in enumerate(files):
		for segment_hash, (ranking, lineno) in rank_lines_in_file(filename).items():
			if segment_hash not in table or table[segment_hash][0] < ranking:
				table[segment_hash] = (ranking, fileno, lineno)

	return table


def print_lines(files, table):
	lines_per_file = [
		[lineno for _, fileno, lineno in table.values() if fileno == n]
		for n in range(len(files))
	]

	for lines in lines_per_file:
		lines.sort()

	for fileno, filename in enumerate(files):
		lineno_it = iter(lines_per_file[fileno])
		next_line = next(lineno_it, None)

		lineno = -1
		if next_line is not None:
			for lineno, line in enumerate(read_file(filename)):
				if lineno == next_line:
					yield line
					next_line = next(lineno_it, None)
					assert next_line is None or next_line > lineno
		assert next_line is None

		used_lines = len(lines_per_file[fileno])
		total_lines = lineno + 1
		percentage = used_lines / total_lines * 100 if total_lines > 0 else 0
		print(f"{filename}: used {used_lines} of {total_lines}: {percentage:0.3f}%", file=sys.stderr)


table = rank_lines(sys.argv[1:])

sys.stdout.writelines(print_lines(sys.argv[1:], table))

@jelmervdl jelmervdl added enhancement New feature or request component:execution Related to run.py, and running filters on the full dataset labels Dec 12, 2022
@ZJaume
Copy link
Collaborator Author

ZJaume commented Dec 13, 2022

the ranking here is just bifixer rank? We should use the whatever scoring method we used for filtering? (LASER, Bicleaner AI)

@jelmervdl
Copy link
Collaborator

I didn't clean all of the datasets I used with LASER or Bicleaner AI, so I didn't have the scores for all of it.

Ideally we would use those scores if we have them, but that would also force you to run all of your data with the same scoring method at the end. We could do something like that, maybe even re-use scores that you used for thresholding for the datasets where you used bicleaner-ai or LASER as a filtering method.

Maybe just by using the same caching methods I use for threshold.py (it stores (hash((src,trg)), score) tuples) so that if you change your dataset slightly because you applied a new filter you don't have to rescore all sentence pairs, just the ones that got added/changed.

@jelmervdl jelmervdl mentioned this issue Jan 10, 2023
@XapaJIaMnu
Copy link
Collaborator

XapaJIaMnu commented Feb 21, 2023

Posting @ZJaume 's deduplicators here for posterity sake:
hash-seg.py

#!/usr/bin/env python
from argparse import ArgumentParser, FileType
from unicodedata import category as cat
from unidecode import unidecode
from xxhash import xxh64
import sys

parser = ArgumentParser()
parser.add_argument('-a', '--aggressive', action='store_true', default=False)
args = parser.parse_args()

# Translate table to remove non alphabetic characters
tbl = [chr(i) for i in range(sys.maxunicode) if not cat(chr(i)).startswith('L')]
remove_non_alpha = str.maketrans('', '', ''.join(tbl))

def main():
    shashes, thashes = set(), set()
    for line in sys.stdin:
        sline = line.rstrip('\n')
        parts = sline.split('\t')
        src = parts[0]
        trg = parts[1]

        if args.aggressive:
            src = unidecode(src.lower().translate(remove_non_alpha))
            trg = unidecode(trg.lower().translate(remove_non_alpha))

        src_hash = xxh64(src).hexdigest()
        trg_hash = xxh64(trg).hexdigest()

        sys.stdout.write(f"{sline}\t{src_hash}\t{trg_hash}\n")


if __name__ == "__main__":
    main()

superdedup.py

#!/usr/bin/env python
import sys

def main():
    shashes, thashes = set(), set()
    for line in sys.stdin:
        parts = line.rstrip("\n").split('\t')

        src_hash = parts[2]
        trg_hash = parts[3]

        if src_hash not in shashes and trg_hash not in thashes:
            sys.stdout.write(line)
        shashes.add(src_hash)
        thashes.add(trg_hash)


if __name__ == "__main__":
    main()

Usage: cat parallel-data | hash-seg.py -a | superdedup.py
This should remove near duplicates as well with lowercase and numbers. At some point we should have a bicleaner-ai based scores solution for deciding which pair to keep but that would be incredibly costly, to score all your datasets with bicleaner

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component:execution Related to run.py, and running filters on the full dataset enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants