Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alignment passthrough #26

Merged
merged 38 commits into from
Aug 14, 2023
Merged
Show file tree
Hide file tree
Changes from 25 commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
4ea23bb
wip
XapaJIaMnu Apr 3, 2023
a144ae2
Initial implementation of noise augmenters
XapaJIaMnu Apr 3, 2023
be4ebe5
Simplify code a bit
jelmervdl Apr 4, 2023
7dcc2a4
Fix tests
jelmervdl Apr 4, 2023
7a1189f
Fix possible bug in test
jelmervdl Apr 4, 2023
6da73ff
Add specific tests for the three modes
jelmervdl Apr 4, 2023
55b443c
Add alignment info to the simple tests
jelmervdl Apr 4, 2023
42b850e
Make placeholder modifier produce (corrected) alignment pairs
jelmervdl Apr 4, 2023
831d564
Make sure `get_placeholding_candidates` returns the original instances
jelmervdl Apr 5, 2023
2e3d29f
update tests
jelmervdl Apr 5, 2023
38e54de
Merge branch 'main' into alignment-passthrough
jelmervdl Jul 6, 2023
723760e
Attempt to improve the alignment fix-up
jelmervdl Jul 6, 2023
17733d4
Fix unit tests
jelmervdl Jul 7, 2023
5c768e4
Implement retokenize modifier
jelmervdl Jul 24, 2023
e0adec6
Merge remote-tracking branch 'origin/main' into alignment-passthrough
jelmervdl Jul 25, 2023
294a18d
Let PlaceholderModifier use Retokenizer implementation for now
jelmervdl Jul 27, 2023
d8b1b10
Add unittest for spm retokenize in placeholders
jelmervdl Jul 27, 2023
704bd65
Add test to confirm that even when no placeholder is added, retokeniz…
jelmervdl Jul 27, 2023
38a3cae
Efficiency: don't bother calculating candidates if prob = 0.
jelmervdl Jul 27, 2023
0c4868f
Add tests covering spaces tokenizer
jelmervdl Jul 27, 2023
aab72a4
Document the `spm_vocab` option of the `Tags` modifier
jelmervdl Jul 27, 2023
973906a
Be nicer about issues with the alignment info
jelmervdl Jul 28, 2023
6b4abe0
Explain the `StopIteration` bit
jelmervdl Jul 28, 2023
c200c9c
Remove unreachable else
jelmervdl Jul 28, 2023
126587d
Remove debug code
jelmervdl Jul 28, 2023
106d832
Document and rename methods
jelmervdl Jul 28, 2023
b9ad9f6
Skip trainer backtrace test for now
jelmervdl Jul 28, 2023
b822e8c
Only print alignment info when spm_vocab is passed in
jelmervdl Aug 7, 2023
ef3c780
Make `retokenize` a little less `O(n^2)`
jelmervdl Aug 7, 2023
7069872
Replace placeholder-specific end-to-end tests with specific test for …
jelmervdl Aug 7, 2023
6b62198
Use `Path` in type signature of modifiers to resolve relative paths
jelmervdl Aug 9, 2023
7a80d2c
Rewrite end-to-end tests
jelmervdl Aug 9, 2023
9603208
Rewrite DatasetReader to not always produce n+1 lines
jelmervdl Aug 9, 2023
a2248ad
Add option for batch size
jelmervdl Aug 9, 2023
2f72e76
Add some comments to the tests
jelmervdl Aug 9, 2023
4779dd6
Fix missing sentencepiece dependency
jelmervdl Aug 9, 2023
a17af46
Fix other pyproject.toml entries while we're at it
jelmervdl Aug 9, 2023
2479f09
Make trainer skip lines that can't be processed by modifier
jelmervdl Aug 14, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 8 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -165,21 +165,28 @@ modifiers:
#### Tags
Adds a placeholder tag to the source sentence that can be used by the model to hint how it should translate that word. The word to hint is chosen at random from the target sentence. Only words with a 1-to-1 mapping between source and target are considered.

This modifier needs a third column in the training data with per-word alignment information.
This modifier needs a third column in the training data with per-word (technically: space separated token) alignment information.

```yaml
- Tags: 0.05
custom_detok_src: null
custom_detok_trg: zh
spm_vocab: path/to/vocab.enzh.spm
template: "__source__ {src} __target__ {trg} __done__"
```

All options are optional.

You can specify custom detokenizer languages using `custom_detok_src` and `custom_detok_trg` if the dataset you're reading from has been tokenized by the Moses tokenizer. This can be helpful to do for languages that do not use spaces to delimit words. The default tokenisation strategy is splitting/joining by spaces.

The `spm_vocab` option can be used to recompute the alignment info to match the tokenisation from the sentencepiece vocabulary. This is mostly useful for Marian, which takes untokenised input but expects the alignment info to match the sentencepiece tokenisation it performs.

The format for telling the translation model the intention to translate a word in a certain way can be controlled by `template`. Here `{src}` and `{trg}` are replaced by the selected words from the source and target side of the sentence pair.

**Note**: Due to how most modifiers are implemented, they will have a normalising effect on spaces. Sequences of spaces will be collapsed into a single space. This is also true for the *Tags* modifier.

**Note**: Even if the probability of the *Tags* modifier is set to 0, it will apply detokenisation and optionally re-computation of the alignment on every sentence pair, regardless whether it was picked out to be modified or not.

#### Prefix
Prepends a random subsection of the target sentence before the source sentence.

Expand Down
Binary file added contrib/test-data/vocab.zhen.spm
Binary file not shown.
25 changes: 25 additions & 0 deletions src/opustrainer/alignments.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
from typing import Optional, List
from opustrainer.types import Pair, TokenList


def parse_alignments(pairs:str, src_tokens:Optional[TokenList]=None, trg_tokens:Optional[TokenList]=None) -> List[Pair]:
pairs = [
Pair(int(a), int(b)) for a, b in
(pair.split('-', maxsplit=1) for pair in pairs.split())
]

if src_tokens is not None and trg_tokens is not None:
invalid_pairs = [
pair
for pair in pairs
if pair.src < 0 or pair.src >= len(src_tokens)
or pair.trg < 0 or pair.trg >= len(trg_tokens)
]
if invalid_pairs:
raise ValueError('Out-of-bound alignment pairs: ' + ' '.join(map(repr, invalid_pairs)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want to Raise during training, let's have all of those be a warn_once. I had a few training runs fail because of a one off error in the alignments will kill a whole training run. We should have a fail condition for alignments are no accurate. (And for some reason, sometimes, fast align did produce invalid alignments...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I agree. What should we do with lines that have out-of-bound pairs? I think we should just throw out all alignment pairs in that case for that sentence because there's no guarantee that the rest of them are not off-by-one or something.

That would automatically also exclude them from the Tags modifier, and their alignment info wouldn't be passed on to Marian, so Marian would not be trained with broken data (or worse, bork because of an index error).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would throw out the lines, if we can do it cleanly. If alignment wasn't produced, this likely means the sentence is bad.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, euhm. I can do that, but it has to happen at the modifier-is-applied-to-batch level in trainer.py because that's the only place we can remove lines at the moment. I'll change that.

There's a discussion somewhere with Graeme about changing modifiers to return a list of lines, which then can just add and remove lines. But that's probably too much also add into this pull request.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or maybe then merge as is and start another PR aiming at sanitizing input.


return pairs


def format_alignments(pairs:List[Pair]) -> str:
return ' '.join(f'{pair.src}-{pair.trg}' for pair in pairs)
17 changes: 1 addition & 16 deletions src/opustrainer/modifiers/__init__.py
Original file line number Diff line number Diff line change
@@ -1,16 +1 @@
from abc import ABC, abstractmethod
from typing import Dict, Any, List


class Modifier(ABC):
probability: float

def __init__(self, probability:float, **kwargs:Dict[str,Any]):
self.probability = probability

def validate(self, context:List['Modifier']) -> None:
pass

@abstractmethod
def __call__(self, line: str) -> str:
pass
from opustrainer.types import Modifier
225 changes: 138 additions & 87 deletions src/opustrainer/modifiers/placeholders.py

Large diffs are not rendered by default.

76 changes: 76 additions & 0 deletions src/opustrainer/modifiers/retokenize.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
from typing import List, Protocol, Dict, NamedTuple, TypeVar, Callable, Union, Tuple, Optional, Any
from itertools import count

from opustrainer.types import Pair, TokenList, TokenMapping, Tokenizer, Detokenizer
from opustrainer.alignments import parse_alignments, format_alignments
from opustrainer.tokenizers import make_tokenizer, make_detokenizer
from opustrainer.modifiers import Modifier
from opustrainer import logger


def overlaps(r1:slice, r2:slice) -> bool:
jelmervdl marked this conversation as resolved.
Show resolved Hide resolved
"""True if slice 1 (partially or fully) overlaps with slice 2."""
# (a,b), (x,y) = r1, r2
# [a b] | a < y | x < b
# [x y] = F | F | T
# [x y] = T | T | T
# [x y] = T | T | T
# [x y] = T | T | T
# [x y] = F | T | F
return r1.start < r2.stop and r2.start < r1.stop


class Retokenizer(NamedTuple):
detokenizer: Detokenizer
tokenizer: Tokenizer

def retokenize(self, tokens:TokenList) -> Tuple[str,TokenList,TokenMapping]:
jelmervdl marked this conversation as resolved.
Show resolved Hide resolved
detokenized, old_token_spans = self.detokenizer.detokenize(tokens)
new_tokens, new_token_spans = self.tokenizer.tokenize(detokenized)

old_to_new_mapping = [[] for _ in range(len(old_token_spans))]

#TODO: This can be done much more efficiently
for i, old_token_span in enumerate(old_token_spans):
for j, new_token_span in enumerate(new_token_spans):
if overlaps(old_token_span, new_token_span):
old_to_new_mapping[i].append(j)

return detokenized, new_tokens, old_to_new_mapping


def make_retokenizer(spec:Dict[str,str]) -> Retokenizer:
return Retokenizer(
detokenizer=make_detokenizer(spec.get('detokenize', 'spaces')),
tokenizer=make_tokenizer(spec.get('tokenize', 'spaces'))
)


def compute_mapping(src_mapping:TokenMapping, trg_mapping:TokenMapping, alignments:List[Pair]) -> List[Pair]:
remapped = set()
jelmervdl marked this conversation as resolved.
Show resolved Hide resolved
for old_src_idx, old_trg_idx in alignments:
for src_idx in src_mapping[old_src_idx]:
for trg_idx in trg_mapping[old_trg_idx]:
remapped.add(Pair(src_idx, trg_idx))
return sorted(remapped)


class RetokenizeModifier(Modifier):
src: Retokenizer
trg: Retokenizer

def __init__(self, probability: float=0.0, src:dict=dict(), trg:dict=dict()):
super().__init__(probability) # probability is very much ignored lol.
self.src = make_retokenizer(src)
self.trg = make_retokenizer(trg)

def __call__(self, line:str) -> str:
src, trg, alignments = line.split('\t')
src_tokens = src.split()
trg_tokens = trg.split()
pairs = parse_alignments(alignments, src_tokens, trg_tokens)
new_src, new_src_tokens, src_mapping = self.src.retokenize(src_tokens)
new_trg, new_trg_tokens, trg_mapping = self.trg.retokenize(trg_tokens)
remapped_pairs = compute_mapping(src_mapping, trg_mapping, pairs)
return '\t'.join((new_src, new_trg, format_alignments(remapped_pairs)))

118 changes: 118 additions & 0 deletions src/opustrainer/tokenizers.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
import re
from typing import Tuple, List, TypeVar, Callable, Union, Dict, Optional

import sacremoses
from sentencepiece import SentencePieceProcessor

from opustrainer.types import TokenList, Tokenizer, Detokenizer


DETOKENIZERS = {
'moses': lambda lang: MosesDetokenizer(lang),
'spaces': lambda: SpaceDetokenizer(),
}

TOKENIZERS = {
'moses': lambda lang: MosesTokenizer(lang),
'spm': lambda vocab: SentencePieceTokenizer(vocab),
'spaces': lambda: SpaceTokenizer(),
}


class SpaceTokenizer:
def tokenize(self, text:str) -> Tuple[TokenList, List[slice]]:
graemenail marked this conversation as resolved.
Show resolved Hide resolved
tokens: TokenList = []
spans: List[slice] = []
for match in re.finditer(r'[^\s]+', text):
tokens.append(match.group(0))
spans.append(slice(match.start(0), match.end(0)))
return tokens, spans


class SpaceDetokenizer:
def detokenize(self, tokens:TokenList) -> Tuple[str,List[slice]]:
spans = []
offset = 0
for token in tokens:
spans.append(slice(offset, offset + len(token)))
offset += len(token) + 1 # space
return ' '.join(tokens), spans


class MosesTokenizer:
tokenizer: sacremoses.MosesTokenizer

def __init__(self, lang:str, custom_nonbreaking_prefixes:Optional[str]=None):
self.tokenizer = sacremoses.MosesTokenizer(lang, custom_nonbreaking_prefixes)

def tokenize(self, text:str) -> Tuple[TokenList, List[slice]]:
tokens = self.tokenizer.tokenize(text, escape=False)
spans: List[slice] = []
offset = 0
for token in tokens:
offset = text.find(token, offset)
if offset == -1:
raise RuntimeError(f"Could not find token '{token}' in original text")
spans.append(slice(offset, offset + len(token)))
offset += len(token)
return tokens, spans


class MosesDetokenizer:
detokenizer: sacremoses.MosesDetokenizer

def __init__(self, lang:str):
self.detokenizer = sacremoses.MosesDetokenizer(lang)

def detokenize(self, tokens:TokenList) -> Tuple[str,List[slice]]:
text = self.detokenizer.detokenize(tokens)
spans = []
offset = 0
for token in tokens:
offset = text.find(token, offset)
if offset == -1:
raise RuntimeError(f"Could not find token '{token}' in detokenized text")
spans.append(slice(offset, offset + len(token)))
offset += len(token)
return text, spans


class SentencePieceTokenizer:
jelmervdl marked this conversation as resolved.
Show resolved Hide resolved
spm: SentencePieceProcessor

def __init__(self, vocab:str):
self.spm = SentencePieceProcessor(vocab)

def tokenize(self, text:str) -> Tuple[TokenList,List[slice]]:
# interestingly, piece.begin and piece.end are unicode offsets, not byte
# offsets as the documentation would suggest. When byte-fallback happens,
# there will be pieces where piece.begin and piece.end are the same value
# but they are technically necessary to encode the following pieces.
# e.g:
# > x.encode('🤣', out_type='immutable_proto').pieces
# { piece: "▁" id: 275 surface: "" begin: 0 end: 0 }
# { piece: "<0xF0>" id: 247 surface: "" begin: 0 end: 0 }
# { piece: "<0x9F>" id: 166 surface: "" begin: 0 end: 0 }
# { piece: "<0xA4>" id: 171 surface: "" begin: 0 end: 0 }
# { piece: "<0xA3>" id: 170 surface: "🤣" begin: 0 end: 1 }
# > x.decode([247,166,171,170])
# '🤣'
spans = [
slice(piece.begin, piece.end)
for piece in self.spm.encode(text.encode(), out_type='immutable_proto').pieces
]
tokens = [text[span] for span in spans]
return tokens, spans


T = TypeVar('T', bound=Union[Tokenizer,Detokenizer])

def _make(implementations: Dict[str,Callable[...,T]], spec:str) -> T:
name, *args = spec.split(':')
return implementations[name](*args)

def make_detokenizer(spec:str) -> Detokenizer:
return _make(DETOKENIZERS, spec)

def make_tokenizer(spec:str) -> Tokenizer:
return _make(TOKENIZERS, spec)
14 changes: 7 additions & 7 deletions src/opustrainer/trainer.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@
from opustrainer.modifiers.surface import UpperCaseModifier, TitleCaseModifier
from opustrainer.modifiers.placeholders import PlaceholderTagModifier
from opustrainer.modifiers.typos import TypoModifier
from opustrainer.modifiers.retokenize import RetokenizeModifier
from opustrainer import logger

def ignore_sigint():
Expand All @@ -36,18 +37,15 @@ def ignore_sigint():
signal.signal(signal.SIGINT, signal.SIG_IGN)


# Path to something that can shuffle data. Called with seed, output-path, input-files
# TODO: Ideally this also deduplicates the src side of the sentence pairs it shuffles ;)
PATH_TO_SHUFFLE = os.path.dirname(os.path.realpath(__file__)) + "/shuffle.py"

# Available batch modifiers
# TODO: Import these lazy, on demand?
MODIFIERS = {
'UpperCase': UpperCaseModifier,
'TitleCase': TitleCaseModifier,
'Tags': PlaceholderTagModifier,
'Typos': TypoModifier,
'Prefix': PrefixModifier
'Prefix': PrefixModifier,
'Retokenize': RetokenizeModifier,
}

@dataclass(frozen=True)
Expand Down Expand Up @@ -171,7 +169,8 @@ def _open(self):
# feasible to just write to a named pipe (or even stdout) instead of
# a temporary file, and let the trainer read directly from that. Not
# sure if that has any performance or stability benefits/drawbacks.
subprocess.check_call([sys.executable, PATH_TO_SHUFFLE,
subprocess.check_call([sys.executable,
'-m', 'opustrainer.shuffle',
*(['--temporary-directory', self.tmpdir] if self.tmpdir else []),
*([] if self.shuffle else ['--no-shuffle']),
str(self.seed),
Expand Down Expand Up @@ -247,7 +246,8 @@ def _open_async(self, seed:int):
self._pending = ShuffledFile(
seed=seed,
file=cast(TextIO, fh),
proc=subprocess.Popen([sys.executable, PATH_TO_SHUFFLE,
proc=subprocess.Popen([sys.executable,
'-m', 'opustrainer.shuffle',
*(['--temporary-directory', self.tmpdir] if self.tmpdir else []),
*([] if self.shuffle else ['--no-shuffle']),
str(seed),
Expand Down
56 changes: 56 additions & 0 deletions src/opustrainer/types.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
from abc import ABC, abstractmethod
from typing import NamedTuple, Dict, List, Tuple, Optional, Any, Protocol


TokenList = List[str] # todo: bytes?

TokenMapping = List[List[int]]


class Tokenizer(Protocol):
"""Turns a string into a list of tokens"""
def tokenize(self, text:str) -> Tuple[TokenList,List[slice]]:
...


class Detokenizer(Protocol):
"""Turns a list of tokens into a string"""
def detokenize(self, tokens:TokenList) -> Tuple[str, List[slice]]:
...


class Pair(NamedTuple):
"""Alignment pair between source and target token indices"""
src:int
trg:int


class SentencePair(NamedTuple):
"""Semantic representation of a single line from a data source."""
src: TokenList
trg: TokenList

# alignments is an empty list if alignment data is available in the dataset
# but there are no aligned tokens in this pair. It is None if this dataset
# does not have alignment info.
alignments: Optional[List[Pair]]


class Modifier(ABC):
"""Line modifier"""
probability: float

def __init__(self, probability:float, **kwargs:Dict[str,Any]):
self.probability = probability

def validate(self, context:List['Modifier']) -> None:
"""Opportunity for the modifier to see where in the modifier list it is
placed and flag any issues to the logger. E.g. if you place a modifier that
inserts special tokens before an UpperCase modifier, the latter might
modify those special tokens as well. Here you can shout about that.
"""
pass

@abstractmethod
def __call__(self, line: str) -> str:
pass
Loading