-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alignment passthrough #26
Conversation
Also fixes the order. Does change the tests though.
# Conflicts: # contrib/test-data/clean.enzh.ref.06.4.none # contrib/test-data/clean.enzh.ref.06.4.trg # contrib/test-data/clean.zhen.ref.06.4.04.04.src # contrib/test-data/clean.zhen.ref.06.4.src # contrib/test_enzh_config.yml # contrib/test_enzh_noise_config.yml # src/opustrainer/modifiers/placeholders.py # tests/test_placeholders.py
Not end-to-end tests yet
Should be mostly compatible, except that the end-to-end tests are failing? Did I change the order of random calls?
out = tokenizer('\t'.join([ | ||
'This is a simple test statement 🤣 .', | ||
#^0 ^1 ^2 ^3 ^4 ^5 ^6 ^7 | ||
'这 是 一个 简单 的 测试 语 句 🤣 。', | ||
#^0 ^1 ^2 ^3 ^4 ^5 ^6 ^7 ^8 ^9 | ||
'0-0 1-1 2-2 3-3 3-4 4-5 5-6 5-7 6-8 7-9', | ||
])) | ||
self.assertEqual(out, '\t'.join([ | ||
'This is a simple test statement 🤣.', | ||
#[This][ is][ a][ simple][ test][ statement][ ][] [] [] [🤣][.] | ||
#^0 ^1 ^2 ^3 ^4 ^5 ^6 ^7 ^8 ^9 ^10 ^11 | ||
'这是一个简单的测试语句 🤣 。', | ||
#[这][是][一][个][简][单][的][测][试][语][句] [ ] [] [] [] [🤣][ 。] | ||
#^0 ^1 ^2 ^3 ^4 ^5 ^6 ^7 ^8 ^9 ^10 ^11 ^12 ^13 ^14 ^15 ^16 | ||
'0-0 1-1 2-2 2-3 3-4 3-5 3-6 4-7 4-8 5-9 5-10 10-15 11-16', | ||
# 0-0 [This] [这] 0-0 | ||
# 1-1 [is] [是] 1-1 | ||
# 2-2 [a] [一个] 2-2 2-3 | ||
# 3-3 [simple] [简单] 3-4 3-5 | ||
# 3-4 [simple] [的] 3-6 | ||
# 4-5 [test] [测试] 4-7 4-8 | ||
# 5-6 [statement] [语] 5-9 | ||
# 5-7 [statement] [句] 5-10 (6-11) | ||
# 6-8 [🤣] [🤣] (7-12 8-13 9-14) 10-15 | ||
# 7-9 [.] [。] 11-16 | ||
])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably the best example of what this code does: Input is Moses tokens for both sides of the sentence pair.
- Both are detokenized using their own language-specific Moses detokenizers.
- That detokenized sentence is then returned.
- However, it also uses sentencepiece with the specified vocab to retokenize that detokenized sentence.
- It then computes how the original moses tokens and the new sentencepiece tokens overlap, and adjusts the alignment pairs.
In the assertEqual
comments above, the [..]
bits are SPM tokens. The table at the bottom is a mapping between old alignment pairs and their tokens to new alignment pairs that make up the line (34) above it.
Interesting finding while doing this: the byte-fallback necessary to encode the emoji cannot be expressed with a surface form or a slice since SPM operates on unicode in its Python interface. So tokens 7, 8 and 9 are zero-length. I chose to not add alignment info for those, hoping that only the final token of that byte-fallback sequence will be sufficient for the model to learn that the emoji faces align. Because I'm lazy. But in theory it should be feasible to detect this and add alignment pairs for all the parts of the byte-fallback sequence.
For other 1->N mappings (e.g. moses [一个]
maps to spm [一][个]
) I chose to generate alignment pairs for all combinations.
(I'd really, really like to replace the hidden end-to-end tests in TODO: Regenerate the end-to-end test files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks mostly good. I would like to have example of the new modifiers in the end-to-end tests and the documentation.
One thing that I forgot to tell you is that i tested decoding with pre-spm'd text and it works, because we have one-to-one token match. This means that we should be able to apply spm modifications here and not inside marian during training and then we can solve the problem of the sentences that are getting marked as tainted due to modifications.
src/opustrainer/alignments.py
Outdated
or pair.trg < 0 or pair.trg >= len(trg_tokens) | ||
] | ||
if invalid_pairs: | ||
raise ValueError('Out-of-bound alignment pairs: ' + ' '.join(map(repr, invalid_pairs))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't want to Raise during training, let's have all of those be a warn_once
. I had a few training runs fail because of a one off error in the alignments will kill a whole training run. We should have a fail condition for alignments are no accurate. (And for some reason, sometimes, fast align did produce invalid alignments...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I agree. What should we do with lines that have out-of-bound pairs? I think we should just throw out all alignment pairs in that case for that sentence because there's no guarantee that the rest of them are not off-by-one or something.
That would automatically also exclude them from the Tags modifier, and their alignment info wouldn't be passed on to Marian, so Marian would not be trained with broken data (or worse, bork because of an index error).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would throw out the lines, if we can do it cleanly. If alignment wasn't produced, this likely means the sentence is bad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, euhm. I can do that, but it has to happen at the modifier-is-applied-to-batch level in trainer.py because that's the only place we can remove lines at the moment. I'll change that.
There's a discussion somewhere with Graeme about changing modifiers to return a list of lines, which then can just add and remove lines. But that's probably too much also add into this pull request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or maybe then merge as is and start another PR aiming at sanitizing input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good; just some small comments beyond @XapaJIaMnu's review
Oh cool! How would you pass this to Marian? Do you just give it the spm token ids as numbers? I initially had much bigger plans for this change, but moved them over to #29 and tried to keep this change backwards compatible as much as possible. Having an option to actually produce spm token ids and specify things like spm sampling parameters would fit easily in that structure. |
I was overeager here, hmm. It works for simple sentences, but not for complex ones because |
I need a new modifier that actually crashes deterministically to test this.
Nicely backwards compatible!
This makes `spm_vocab` behave the same as the dataset paths in a yaml file without manual annotations.
These tests are more specifically for expected end-to-end behaviours
The epoch counter was increased one step too late for our code
When the batch size is bigger than the dataset, it will produce more data than the condition. Not sure whether we should treat this as a bug yet.
Sorry this pull request exploded a bit once I tried to fix the end-to-end tests and discovered some bugs thanks to them. I think it is mergeable now, assuming we still train marian with full text (re marian-nmt/marian-dev#1003) and we do the "oh this can be better" data type rewrite later (re #29). Comments are welcome. Also on the |
except Exception as exc: | ||
raise CurriculumLoaderError(f"could not initialize modifier '{name}': {exc!s}") from exc | ||
|
||
def _match_modifier_signature(self, fn:Type[Modifier], settings:Dict[str, Any], basepath:str) -> Dict[str,Any]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wow, it's like you are writing C++ here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think I could do this in C++, not enough reflection APIs. But if I could do this with templates, I'd probably would have had better type guarantees than whatever my python LSP is coming up with.
self.assertEqual(process.returncode, returncode, msg=process.stderr) | ||
|
||
# Useful to generate the data, hehe | ||
if False and path.endswith('.expected.out'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, so we keep the output generating code in here and just change it to True
when the tests change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Or I can comment it out. Not sure why I did the if False thing, probably out of lazyness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we leave the produce spm pieces directly
bit for when the PR with marian is merged?
Also, after the rework of the tests, we get 110 sentences, as opposed to 200. It's still not an exact match of the number of lines in a batch, but closer. Should we just write this in the notes and call it a day?
How difficult would it be to drop invalid sentences with a warning? Last time i looked it got a bit hairy. Maybe just return ""
would just drop the sentence altogether?
I opened #30 for this.
Just implemented that. With backtraces for debugging and all :) |
try: | ||
yield fn(item) | ||
except Exception as exc: | ||
raise Exception(f'Exception while processing item {n}: {item!r}') from exc | ||
logger.log(f'Exception while processing line, skipping: {item!r}', 'WARNING', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should ideally be log_once
otherwise we would get spammed everytime we loop through the dataset. (I assume the exception kwards would be the same always)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be, but it might also not be hashable. I'll experiment.
This should add the missing functionality to train a student with alignment without having to use hacks.
Right now I'm implementing this as an extension of the Placeholder filter for compatibility reasons. But it can be used as its own stand-alone filter,
Retokenize
.The
tokenizer
parameter of that thing is a bit of a lie, as it always gives you the detokenized output. But it will use thetokenizer
parameter to tokenize that detokenized output so it can compute the correct alignment between the new tokens and their indices.