You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I looked at it in the past, but should look again?
I think we can integrate parts of Opusfilters into this project:
Readers
Support for more data formats would probably be very welcome?
I'm a bit sceptical here because at some point some of the wrapper tooling will need to be written in a more performant language, and if the component where it all begins, a dataset reader, is the complicated slow bit… well hmm.
Filters
Seems pretty obvious, except that currently these are pretty tightly connected to the readers in that the readers produce batches of dataframes (or things that look like dataframes) and then filters operate on those dataframes.
Also, filters are combined using the yaml file. Each filter step specifies inputs and outputs. There is no continuous flow from first to last step like in empty-train assumes. It has filter steps like training a model on an external dataset.
To integrate into the rest of the filters each filter would need to be a subprocess with stdin/stdout. That sounds slow? And the yaml configuration would need to be specified for each step. That sounds painful? This was the bit where I got stuck last time.
The text was updated successfully, but these errors were encountered:
I haven't looked well at doing it the other way around: what if we integrate all of our filters into OpusFilter, and use OpusFilter as the configuration format/overall runtime (i.e. take on the role of run.py). It already has support for running only the N-th step in a pipeline and parallel processing.
I have no idea about efficiency compared to how run.py works, data seems to flow a lot more through various python data structures in OpusFilter. But easiest way to find out is just to experiment a bit with it.
I can't install a reasonably up-to-date version of opusfilter on macOS because of mingruimingrui/fast-mosestokenizer#3 and installing fast-mosestokenizer from source is a pain.
Edit: installing opusfilter with --no-deps and then manually installing all the other deps manually works but. Now I'm running into the issue that some filters take complex arguments, so there is no easy way to use the existing parameter interface for some of the filters.
Also earlier work that derived the parameters from the class definitions doesn't work with the newer version because the type information is incomplete or too complex. I.e. some arguments are not just a float or a string, but a list of nested things.
I looked at it in the past, but should look again?
I think we can integrate parts of Opusfilters into this project:
Readers
Support for more data formats would probably be very welcome?
I'm a bit sceptical here because at some point some of the wrapper tooling will need to be written in a more performant language, and if the component where it all begins, a dataset reader, is the complicated slow bit… well hmm.
Filters
Seems pretty obvious, except that currently these are pretty tightly connected to the readers in that the readers produce batches of dataframes (or things that look like dataframes) and then filters operate on those dataframes.
Also, filters are combined using the yaml file. Each filter step specifies inputs and outputs. There is no continuous flow from first to last step like in empty-train assumes. It has filter steps like training a model on an external dataset.
To integrate into the rest of the filters each filter would need to be a subprocess with stdin/stdout. That sounds slow? And the yaml configuration would need to be specified for each step. That sounds painful? This was the bit where I got stuck last time.
The text was updated successfully, but these errors were encountered: