-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use SentencePair
struct instead of str
internally
#29
Comments
#21 has a bit of a unique implementation in that it hides two training examples in one ( What may be conceptually nicer is have noise behaving more like a dataset, since it is a source of 'data' sources:
- dataset: "clean-alt"
columns:
- path: data.tsv.gz
column: 0
type: moses:en
- path: data.tsv.gz
column: 1
type: moses:de
- path: alignments.gz
column: 0
type: alignments
- noise: "noise"
ranges:
- "Basic Latin"
- "Emoji"
start:
- clean-alt 0.99
- noise 0.01
- crawled 0.00
- until noise X # until X 'epochs' of noise This would require:
(It may also be useful to have modifiers emit multiple SentencePairs as well) |
I proposed Noise as a dataset to Nick as well, I agree with you that his makes the most sense.
This would cover our current need for having modifiers emit multiple SentencePairs, but would not provide a solution for modifiers that remove SentencePairs (e.g. as the Tags modifier should do when it encounters bad alignment info.) But that can also be solved by making modifiers be I'm trying to come up with a scenario in which you'd want your modifier to behave like |
Yes, keeping a mapping of non-modifiable is more flexible. Perhaps just Obviously I found this hard too; maybe you're training a bi-directional model and you want e.g |
Just to add to this, it may useful to have a modifier ingest |
This comes mostly from me working on re-alignment. I'm moving the grand design I had in #26 to here, and make that pull request more about just supporting alignment info passthrough.
Additional niceties from this:
__source__
and__target__
tag and other modifiers can easily skip around those without having to have a complete list of tokens they can't touch.So here's the plan:
split()
and' '.join()
.tokens
strategy for the trainer. Same default, so when unspecified tokens will be passed to the trainer as-is. However, you can use this to detokenize moses tokens or retokenize them into marian's "I want plain text, but the alignment info needs to be on the spm token level".Implementation wise:
SentencePair
type (see below) which holds the tokens and alignment infoSentencePair
, and it will be easy to make sure alignment info stays valid while manipulating such a pair.Retokenize
modifier that can be used to change between tokenisations. This for example can help with tokenising Chinese in case it wasn't tokenised like words, or to add tokenisation to a dataset that didn't have it (but you should just update the file then, right?)yaml:
Current yaml:
... will be interpreted as
Supported types so far:
text
which uses theSpaceTokenizer
which just doestext.split()
and' '.join(tokens)
.moses:{lang}
uses sacremosesspm:{vocab}
uses sentencepiece. Right now the actual tokens that come out of this arent used except for adjusting all the alignment indices so they map from moses tokens to if the text were spm tokens (no sampling though)alignments
just parses{m}-{n}
pairs intolist[Pair]
.optional-alignments
does the same, except it will returnNone
if there is no third column. If there is an empty third column, it will return[]
. (TODO: it this necessary? This is just here so we can still automatically deal with 2-col and 3-col data without having to specify it in the yaml explicitly.)Possible alternative yaml (the above can be a shorthand for this):
Also something for specifying which tokens the trainer uses:
Sentence Pair structure:
The text was updated successfully, but these errors were encountered: