-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alignment passthrough #26
Merged
Merged
Changes from 25 commits
Commits
Show all changes
38 commits
Select commit
Hold shift + click to select a range
4ea23bb
wip
XapaJIaMnu a144ae2
Initial implementation of noise augmenters
XapaJIaMnu be4ebe5
Simplify code a bit
jelmervdl 7dcc2a4
Fix tests
jelmervdl 7a1189f
Fix possible bug in test
jelmervdl 6da73ff
Add specific tests for the three modes
jelmervdl 55b443c
Add alignment info to the simple tests
jelmervdl 42b850e
Make placeholder modifier produce (corrected) alignment pairs
jelmervdl 831d564
Make sure `get_placeholding_candidates` returns the original instances
jelmervdl 2e3d29f
update tests
jelmervdl 38e54de
Merge branch 'main' into alignment-passthrough
jelmervdl 723760e
Attempt to improve the alignment fix-up
jelmervdl 17733d4
Fix unit tests
jelmervdl 5c768e4
Implement retokenize modifier
jelmervdl e0adec6
Merge remote-tracking branch 'origin/main' into alignment-passthrough
jelmervdl 294a18d
Let PlaceholderModifier use Retokenizer implementation for now
jelmervdl d8b1b10
Add unittest for spm retokenize in placeholders
jelmervdl 704bd65
Add test to confirm that even when no placeholder is added, retokeniz…
jelmervdl 38a3cae
Efficiency: don't bother calculating candidates if prob = 0.
jelmervdl 0c4868f
Add tests covering spaces tokenizer
jelmervdl aab72a4
Document the `spm_vocab` option of the `Tags` modifier
jelmervdl 973906a
Be nicer about issues with the alignment info
jelmervdl 6b4abe0
Explain the `StopIteration` bit
jelmervdl c200c9c
Remove unreachable else
jelmervdl 126587d
Remove debug code
jelmervdl 106d832
Document and rename methods
jelmervdl b9ad9f6
Skip trainer backtrace test for now
jelmervdl b822e8c
Only print alignment info when spm_vocab is passed in
jelmervdl ef3c780
Make `retokenize` a little less `O(n^2)`
jelmervdl 7069872
Replace placeholder-specific end-to-end tests with specific test for …
jelmervdl 6b62198
Use `Path` in type signature of modifiers to resolve relative paths
jelmervdl 7a80d2c
Rewrite end-to-end tests
jelmervdl 9603208
Rewrite DatasetReader to not always produce n+1 lines
jelmervdl a2248ad
Add option for batch size
jelmervdl 2f72e76
Add some comments to the tests
jelmervdl 4779dd6
Fix missing sentencepiece dependency
jelmervdl a17af46
Fix other pyproject.toml entries while we're at it
jelmervdl 2479f09
Make trainer skip lines that can't be processed by modifier
jelmervdl File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,25 @@ | ||
from typing import Optional, List | ||
from opustrainer.types import Pair, TokenList | ||
|
||
|
||
def parse_alignments(pairs:str, src_tokens:Optional[TokenList]=None, trg_tokens:Optional[TokenList]=None) -> List[Pair]: | ||
pairs = [ | ||
Pair(int(a), int(b)) for a, b in | ||
(pair.split('-', maxsplit=1) for pair in pairs.split()) | ||
] | ||
|
||
if src_tokens is not None and trg_tokens is not None: | ||
invalid_pairs = [ | ||
pair | ||
for pair in pairs | ||
if pair.src < 0 or pair.src >= len(src_tokens) | ||
or pair.trg < 0 or pair.trg >= len(trg_tokens) | ||
] | ||
if invalid_pairs: | ||
raise ValueError('Out-of-bound alignment pairs: ' + ' '.join(map(repr, invalid_pairs))) | ||
|
||
return pairs | ||
|
||
|
||
def format_alignments(pairs:List[Pair]) -> str: | ||
return ' '.join(f'{pair.src}-{pair.trg}' for pair in pairs) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,16 +1 @@ | ||
from abc import ABC, abstractmethod | ||
from typing import Dict, Any, List | ||
|
||
|
||
class Modifier(ABC): | ||
probability: float | ||
|
||
def __init__(self, probability:float, **kwargs:Dict[str,Any]): | ||
self.probability = probability | ||
|
||
def validate(self, context:List['Modifier']) -> None: | ||
pass | ||
|
||
@abstractmethod | ||
def __call__(self, line: str) -> str: | ||
pass | ||
from opustrainer.types import Modifier |
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,76 @@ | ||
from typing import List, Protocol, Dict, NamedTuple, TypeVar, Callable, Union, Tuple, Optional, Any | ||
from itertools import count | ||
|
||
from opustrainer.types import Pair, TokenList, TokenMapping, Tokenizer, Detokenizer | ||
from opustrainer.alignments import parse_alignments, format_alignments | ||
from opustrainer.tokenizers import make_tokenizer, make_detokenizer | ||
from opustrainer.modifiers import Modifier | ||
from opustrainer import logger | ||
|
||
|
||
def overlaps(r1:slice, r2:slice) -> bool: | ||
jelmervdl marked this conversation as resolved.
Show resolved
Hide resolved
|
||
"""True if slice 1 (partially or fully) overlaps with slice 2.""" | ||
# (a,b), (x,y) = r1, r2 | ||
# [a b] | a < y | x < b | ||
# [x y] = F | F | T | ||
# [x y] = T | T | T | ||
# [x y] = T | T | T | ||
# [x y] = T | T | T | ||
# [x y] = F | T | F | ||
return r1.start < r2.stop and r2.start < r1.stop | ||
|
||
|
||
class Retokenizer(NamedTuple): | ||
detokenizer: Detokenizer | ||
tokenizer: Tokenizer | ||
|
||
def retokenize(self, tokens:TokenList) -> Tuple[str,TokenList,TokenMapping]: | ||
jelmervdl marked this conversation as resolved.
Show resolved
Hide resolved
|
||
detokenized, old_token_spans = self.detokenizer.detokenize(tokens) | ||
new_tokens, new_token_spans = self.tokenizer.tokenize(detokenized) | ||
|
||
old_to_new_mapping = [[] for _ in range(len(old_token_spans))] | ||
|
||
#TODO: This can be done much more efficiently | ||
for i, old_token_span in enumerate(old_token_spans): | ||
for j, new_token_span in enumerate(new_token_spans): | ||
if overlaps(old_token_span, new_token_span): | ||
old_to_new_mapping[i].append(j) | ||
|
||
return detokenized, new_tokens, old_to_new_mapping | ||
|
||
|
||
def make_retokenizer(spec:Dict[str,str]) -> Retokenizer: | ||
return Retokenizer( | ||
detokenizer=make_detokenizer(spec.get('detokenize', 'spaces')), | ||
tokenizer=make_tokenizer(spec.get('tokenize', 'spaces')) | ||
) | ||
|
||
|
||
def compute_mapping(src_mapping:TokenMapping, trg_mapping:TokenMapping, alignments:List[Pair]) -> List[Pair]: | ||
remapped = set() | ||
jelmervdl marked this conversation as resolved.
Show resolved
Hide resolved
|
||
for old_src_idx, old_trg_idx in alignments: | ||
for src_idx in src_mapping[old_src_idx]: | ||
for trg_idx in trg_mapping[old_trg_idx]: | ||
remapped.add(Pair(src_idx, trg_idx)) | ||
return sorted(remapped) | ||
|
||
|
||
class RetokenizeModifier(Modifier): | ||
src: Retokenizer | ||
trg: Retokenizer | ||
|
||
def __init__(self, probability: float=0.0, src:dict=dict(), trg:dict=dict()): | ||
super().__init__(probability) # probability is very much ignored lol. | ||
self.src = make_retokenizer(src) | ||
self.trg = make_retokenizer(trg) | ||
|
||
def __call__(self, line:str) -> str: | ||
src, trg, alignments = line.split('\t') | ||
src_tokens = src.split() | ||
trg_tokens = trg.split() | ||
pairs = parse_alignments(alignments, src_tokens, trg_tokens) | ||
new_src, new_src_tokens, src_mapping = self.src.retokenize(src_tokens) | ||
new_trg, new_trg_tokens, trg_mapping = self.trg.retokenize(trg_tokens) | ||
remapped_pairs = compute_mapping(src_mapping, trg_mapping, pairs) | ||
return '\t'.join((new_src, new_trg, format_alignments(remapped_pairs))) | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,118 @@ | ||
import re | ||
from typing import Tuple, List, TypeVar, Callable, Union, Dict, Optional | ||
|
||
import sacremoses | ||
from sentencepiece import SentencePieceProcessor | ||
|
||
from opustrainer.types import TokenList, Tokenizer, Detokenizer | ||
|
||
|
||
DETOKENIZERS = { | ||
'moses': lambda lang: MosesDetokenizer(lang), | ||
'spaces': lambda: SpaceDetokenizer(), | ||
} | ||
|
||
TOKENIZERS = { | ||
'moses': lambda lang: MosesTokenizer(lang), | ||
'spm': lambda vocab: SentencePieceTokenizer(vocab), | ||
'spaces': lambda: SpaceTokenizer(), | ||
} | ||
|
||
|
||
class SpaceTokenizer: | ||
def tokenize(self, text:str) -> Tuple[TokenList, List[slice]]: | ||
graemenail marked this conversation as resolved.
Show resolved
Hide resolved
|
||
tokens: TokenList = [] | ||
spans: List[slice] = [] | ||
for match in re.finditer(r'[^\s]+', text): | ||
tokens.append(match.group(0)) | ||
spans.append(slice(match.start(0), match.end(0))) | ||
return tokens, spans | ||
|
||
|
||
class SpaceDetokenizer: | ||
def detokenize(self, tokens:TokenList) -> Tuple[str,List[slice]]: | ||
spans = [] | ||
offset = 0 | ||
for token in tokens: | ||
spans.append(slice(offset, offset + len(token))) | ||
offset += len(token) + 1 # space | ||
return ' '.join(tokens), spans | ||
|
||
|
||
class MosesTokenizer: | ||
tokenizer: sacremoses.MosesTokenizer | ||
|
||
def __init__(self, lang:str, custom_nonbreaking_prefixes:Optional[str]=None): | ||
self.tokenizer = sacremoses.MosesTokenizer(lang, custom_nonbreaking_prefixes) | ||
|
||
def tokenize(self, text:str) -> Tuple[TokenList, List[slice]]: | ||
tokens = self.tokenizer.tokenize(text, escape=False) | ||
spans: List[slice] = [] | ||
offset = 0 | ||
for token in tokens: | ||
offset = text.find(token, offset) | ||
if offset == -1: | ||
raise RuntimeError(f"Could not find token '{token}' in original text") | ||
spans.append(slice(offset, offset + len(token))) | ||
offset += len(token) | ||
return tokens, spans | ||
|
||
|
||
class MosesDetokenizer: | ||
detokenizer: sacremoses.MosesDetokenizer | ||
|
||
def __init__(self, lang:str): | ||
self.detokenizer = sacremoses.MosesDetokenizer(lang) | ||
|
||
def detokenize(self, tokens:TokenList) -> Tuple[str,List[slice]]: | ||
text = self.detokenizer.detokenize(tokens) | ||
spans = [] | ||
offset = 0 | ||
for token in tokens: | ||
offset = text.find(token, offset) | ||
if offset == -1: | ||
raise RuntimeError(f"Could not find token '{token}' in detokenized text") | ||
spans.append(slice(offset, offset + len(token))) | ||
offset += len(token) | ||
return text, spans | ||
|
||
|
||
class SentencePieceTokenizer: | ||
jelmervdl marked this conversation as resolved.
Show resolved
Hide resolved
|
||
spm: SentencePieceProcessor | ||
|
||
def __init__(self, vocab:str): | ||
self.spm = SentencePieceProcessor(vocab) | ||
|
||
def tokenize(self, text:str) -> Tuple[TokenList,List[slice]]: | ||
# interestingly, piece.begin and piece.end are unicode offsets, not byte | ||
# offsets as the documentation would suggest. When byte-fallback happens, | ||
# there will be pieces where piece.begin and piece.end are the same value | ||
# but they are technically necessary to encode the following pieces. | ||
# e.g: | ||
# > x.encode('🤣', out_type='immutable_proto').pieces | ||
# { piece: "▁" id: 275 surface: "" begin: 0 end: 0 } | ||
# { piece: "<0xF0>" id: 247 surface: "" begin: 0 end: 0 } | ||
# { piece: "<0x9F>" id: 166 surface: "" begin: 0 end: 0 } | ||
# { piece: "<0xA4>" id: 171 surface: "" begin: 0 end: 0 } | ||
# { piece: "<0xA3>" id: 170 surface: "🤣" begin: 0 end: 1 } | ||
# > x.decode([247,166,171,170]) | ||
# '🤣' | ||
spans = [ | ||
slice(piece.begin, piece.end) | ||
for piece in self.spm.encode(text.encode(), out_type='immutable_proto').pieces | ||
] | ||
tokens = [text[span] for span in spans] | ||
return tokens, spans | ||
|
||
|
||
T = TypeVar('T', bound=Union[Tokenizer,Detokenizer]) | ||
|
||
def _make(implementations: Dict[str,Callable[...,T]], spec:str) -> T: | ||
name, *args = spec.split(':') | ||
return implementations[name](*args) | ||
|
||
def make_detokenizer(spec:str) -> Detokenizer: | ||
return _make(DETOKENIZERS, spec) | ||
|
||
def make_tokenizer(spec:str) -> Tokenizer: | ||
return _make(TOKENIZERS, spec) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,56 @@ | ||
from abc import ABC, abstractmethod | ||
from typing import NamedTuple, Dict, List, Tuple, Optional, Any, Protocol | ||
|
||
|
||
TokenList = List[str] # todo: bytes? | ||
|
||
TokenMapping = List[List[int]] | ||
|
||
|
||
class Tokenizer(Protocol): | ||
"""Turns a string into a list of tokens""" | ||
def tokenize(self, text:str) -> Tuple[TokenList,List[slice]]: | ||
... | ||
|
||
|
||
class Detokenizer(Protocol): | ||
"""Turns a list of tokens into a string""" | ||
def detokenize(self, tokens:TokenList) -> Tuple[str, List[slice]]: | ||
... | ||
|
||
|
||
class Pair(NamedTuple): | ||
"""Alignment pair between source and target token indices""" | ||
src:int | ||
trg:int | ||
|
||
|
||
class SentencePair(NamedTuple): | ||
"""Semantic representation of a single line from a data source.""" | ||
src: TokenList | ||
trg: TokenList | ||
|
||
# alignments is an empty list if alignment data is available in the dataset | ||
# but there are no aligned tokens in this pair. It is None if this dataset | ||
# does not have alignment info. | ||
alignments: Optional[List[Pair]] | ||
|
||
|
||
class Modifier(ABC): | ||
"""Line modifier""" | ||
probability: float | ||
|
||
def __init__(self, probability:float, **kwargs:Dict[str,Any]): | ||
self.probability = probability | ||
|
||
def validate(self, context:List['Modifier']) -> None: | ||
"""Opportunity for the modifier to see where in the modifier list it is | ||
placed and flag any issues to the logger. E.g. if you place a modifier that | ||
inserts special tokens before an UpperCase modifier, the latter might | ||
modify those special tokens as well. Here you can shout about that. | ||
""" | ||
pass | ||
|
||
@abstractmethod | ||
def __call__(self, line: str) -> str: | ||
pass |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't want to Raise during training, let's have all of those be a
warn_once
. I had a few training runs fail because of a one off error in the alignments will kill a whole training run. We should have a fail condition for alignments are no accurate. (And for some reason, sometimes, fast align did produce invalid alignments...)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I agree. What should we do with lines that have out-of-bound pairs? I think we should just throw out all alignment pairs in that case for that sentence because there's no guarantee that the rest of them are not off-by-one or something.
That would automatically also exclude them from the Tags modifier, and their alignment info wouldn't be passed on to Marian, so Marian would not be trained with broken data (or worse, bork because of an index error).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would throw out the lines, if we can do it cleanly. If alignment wasn't produced, this likely means the sentence is bad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, euhm. I can do that, but it has to happen at the modifier-is-applied-to-batch level in trainer.py because that's the only place we can remove lines at the moment. I'll change that.
There's a discussion somewhere with Graeme about changing modifiers to return a list of lines, which then can just add and remove lines. But that's probably too much also add into this pull request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or maybe then merge as is and start another PR aiming at sanitizing input.