Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Opusfilters integration #40

Closed
jelmervdl opened this issue Oct 6, 2022 · 2 comments · Fixed by #47 · May be fixed by #43
Closed

Opusfilters integration #40

jelmervdl opened this issue Oct 6, 2022 · 2 comments · Fixed by #47 · May be fixed by #43
Assignees
Labels
help wanted Extra attention is needed pipe dream Someday…

Comments

@jelmervdl
Copy link
Collaborator

I looked at it in the past, but should look again?

I think we can integrate parts of Opusfilters into this project:

Readers

Support for more data formats would probably be very welcome?

I'm a bit sceptical here because at some point some of the wrapper tooling will need to be written in a more performant language, and if the component where it all begins, a dataset reader, is the complicated slow bit… well hmm.

Filters

Seems pretty obvious, except that currently these are pretty tightly connected to the readers in that the readers produce batches of dataframes (or things that look like dataframes) and then filters operate on those dataframes.

Also, filters are combined using the yaml file. Each filter step specifies inputs and outputs. There is no continuous flow from first to last step like in empty-train assumes. It has filter steps like training a model on an external dataset.

To integrate into the rest of the filters each filter would need to be a subprocess with stdin/stdout. That sounds slow? And the yaml configuration would need to be specified for each step. That sounds painful? This was the bit where I got stuck last time.

@jelmervdl jelmervdl added help wanted Extra attention is needed pipe dream Someday… labels Oct 6, 2022
@jelmervdl
Copy link
Collaborator Author

I haven't looked well at doing it the other way around: what if we integrate all of our filters into OpusFilter, and use OpusFilter as the configuration format/overall runtime (i.e. take on the role of run.py). It already has support for running only the N-th step in a pipeline and parallel processing.

I have no idea about efficiency compared to how run.py works, data seems to flow a lot more through various python data structures in OpusFilter. But easiest way to find out is just to experiment a bit with it.

@jelmervdl
Copy link
Collaborator Author

jelmervdl commented Oct 14, 2022

I can't install a reasonably up-to-date version of opusfilter on macOS because of mingruimingrui/fast-mosestokenizer#3 and installing fast-mosestokenizer from source is a pain.

Edit: installing opusfilter with --no-deps and then manually installing all the other deps manually works but. Now I'm running into the issue that some filters take complex arguments, so there is no easy way to use the existing parameter interface for some of the filters.

Also earlier work that derived the parameters from the class definitions doesn't work with the newer version because the type information is incomplete or too complex. I.e. some arguments are not just a float or a string, but a list of nested things.

@jelmervdl jelmervdl linked a pull request Oct 18, 2022 that will close this issue
@jelmervdl jelmervdl self-assigned this Nov 9, 2022
@jelmervdl jelmervdl removed a link to a pull request Nov 9, 2022
This was linked to pull requests Nov 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed pipe dream Someday…
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant