-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deduplication helpers for lexicon modification jobs #458
base: main
Are you sure you want to change the base?
Changes from all commits
5c4e2a6
dee0609
ba03170
6f75911
db4e426
73ff8e6
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -130,16 +130,22 @@ class MergeLexiconJob(Job): | |
will create a new lexicon that might be incompatible to previously generated alignments. | ||
""" | ||
|
||
def __init__(self, bliss_lexica, sort_phonemes=False, sort_lemmata=False, compressed=True): | ||
__sis_hash_exclude__ = {"deduplicate_lemmata": False} | ||
|
||
def __init__( | ||
self, bliss_lexica, sort_phonemes=False, sort_lemmata=False, compressed=True, deduplicate_lemmata=False | ||
): | ||
""" | ||
:param list[Path] bliss_lexica: list of bliss lexicon files (plain or gz) | ||
:param bool sort_phonemes: sort phoneme inventory alphabetically | ||
:param bool sort_lemmata: sort lemmata alphabetically based on first orth entry | ||
:param bool compressed: compress final lexicon | ||
:param bool deduplicate_lemmata: whether to deduplicate lemmatas, only applied when sort_lemmata=True | ||
""" | ||
self.lexica = bliss_lexica | ||
self.sort_phonemes = sort_phonemes | ||
self.sort_lemmata = sort_lemmata | ||
self.deduplicate_lemmata = deduplicate_lemmata | ||
|
||
self.out_bliss_lexicon = self.output_path("lexicon.xml.gz" if compressed else "lexicon.xml") | ||
|
||
|
@@ -178,6 +184,10 @@ def run(self): | |
for lemma in lex.lemmata: | ||
# sort by first orth entry | ||
orth_key = lemma.orth[0] if lemma.orth else "" | ||
if self.deduplicate_lemmata: | ||
# don't add the lemma when there's already an equal lemma | ||
if len(lemma_dict[orth_key]) > 0 and lemma == lemma_dict[orth_key][0]: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I realize, this There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, this is the use case what I was referring to, I might not have made this clear. Right now I'm even more inclined to remove the In general, thinking about use cases as you said, I couldn't think of any actual use case for |
||
continue | ||
lemma_dict[orth_key].append(lemma) | ||
merged_lex.lemmata = list(itertools.chain(*[lemma_dict[key] for key in sorted(lemma_dict.keys())])) | ||
else: | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -3,10 +3,13 @@ | |
|
||
For format details visit: `https://www-i6.informatik.rwth-aachen.de/rwth-asr/manual/index.php/Lexicon`_ | ||
""" | ||
from __future__ import annotations | ||
|
||
__all__ = ["Lemma", "Lexicon"] | ||
|
||
from collections import OrderedDict | ||
from typing import Optional, List | ||
import itertools | ||
from typing import Optional, List, Set | ||
import xml.etree.ElementTree as ET | ||
|
||
from i6_core.util import uopen | ||
|
@@ -104,6 +107,42 @@ def from_element(cls, e): | |
synt = None if not synt else synt[0] | ||
return Lemma(orth, phon, synt, eval, special) | ||
|
||
def _equals(self, other: Lemma, *, same_order: bool = True) -> bool: | ||
""" | ||
Check for lemma equality. | ||
|
||
:param other: Other lemma to compare :param:`self` to. | ||
:param same_order: Whether the order in the different lemma elements matters or not. | ||
:return: Whether :param:`self` and :param:`other` are equal or not. | ||
""" | ||
if same_order: | ||
return ( | ||
self.orth == other.orth | ||
and self.phon == other.phon | ||
and self.special == other.special | ||
and self.synt == other.synt | ||
and self.eval == other.eval | ||
) | ||
else: | ||
if self.synt is not None and other.synt is not None: | ||
equal_synt = set(self.synt) == set(other.synt) | ||
else: | ||
equal_synt = self.synt == other.synt | ||
|
||
return ( | ||
set(self.orth) == set(other.orth) | ||
and set(self.phon) == set(other.phon) | ||
and self.special == other.special | ||
and equal_synt | ||
and set(itertools.chain(*self.eval)) == set(itertools.chain(*other.eval)) | ||
) | ||
|
||
def __eq__(self, other: Lemma) -> bool: | ||
return self._equals(other, same_order=False) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there a reason you want It probably depends on the use case of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This should be discussed, but in my view a lemma is equal to another even if it has different orth order, or different pronunciation order. This is why it's set to Edit: the use case is basically the one specified in I think not enforcing a strict ordering is the best way of comparing two lemmas, but I would understand that some users might want strict ordering comparison, which is why I added the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it depends on the use case. Or if you say, only What actually is your use case? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My use case is the following: in However, I wanted to give the freedom to let the user be able to decide whether the order in their orths and prons matters or not. Maybe some users might want to store the orths and prons of a lexicon in lexicographical order, and thus the order would matter. Maybe some other users already have a specific ordering, and therefore the addition of a lemma with a different order helps them notice that something's wrong with their pipeline. Summarizing: If you think it won't be used and only There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So, if you think there is no clear non-ambiguous definition of an equal-relation, thus having such flag makes sense, then I'm not sure if defining Now There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree, What do others think? Is it too overkill to have At every comment I'm more inclined to remove There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In general, we should not add code/logic for cases which are only hypothetical and not used currently, but only for things we really are using currently. Thus only the logic But let's wait what other think about this. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would consider how it is treated in RASR:
So for phon, synt, and eval it is (in my opinion and as far as my understanding goes) clear what to do. For orth, which is probably mostly what you are interested in, it is not immediately obvious if the order should be considered. |
||
|
||
def __ne__(self, other: Lemma) -> bool: | ||
return not self.__eq__(other) | ||
|
||
|
||
class Lexicon: | ||
""" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you think there are possibly cases where
same_order=True
makes sense, maybe this should not just be a bool but sth likededuplicate_special_lemmata_type
ordeduplicate_special_lemmata_opts
?