Opusfilters integration #40

jelmervdl · 2022-10-06T13:47:44Z

I looked at it in the past, but should look again?

I think we can integrate parts of Opusfilters into this project:

Readers

Support for more data formats would probably be very welcome?

I'm a bit sceptical here because at some point some of the wrapper tooling will need to be written in a more performant language, and if the component where it all begins, a dataset reader, is the complicated slow bit… well hmm.

Filters

Seems pretty obvious, except that currently these are pretty tightly connected to the readers in that the readers produce batches of dataframes (or things that look like dataframes) and then filters operate on those dataframes.

Also, filters are combined using the yaml file. Each filter step specifies inputs and outputs. There is no continuous flow from first to last step like in empty-train assumes. It has filter steps like training a model on an external dataset.

To integrate into the rest of the filters each filter would need to be a subprocess with stdin/stdout. That sounds slow? And the yaml configuration would need to be specified for each step. That sounds painful? This was the bit where I got stuck last time.

jelmervdl · 2022-10-12T09:40:46Z

I haven't looked well at doing it the other way around: what if we integrate all of our filters into OpusFilter, and use OpusFilter as the configuration format/overall runtime (i.e. take on the role of run.py). It already has support for running only the N-th step in a pipeline and parallel processing.

I have no idea about efficiency compared to how run.py works, data seems to flow a lot more through various python data structures in OpusFilter. But easiest way to find out is just to experiment a bit with it.

jelmervdl · 2022-10-14T15:05:02Z

I can't install a reasonably up-to-date version of opusfilter on macOS because of mingruimingrui/fast-mosestokenizer#3 and installing fast-mosestokenizer from source is a pain.

Edit: installing opusfilter with --no-deps and then manually installing all the other deps manually works but. Now I'm running into the issue that some filters take complex arguments, so there is no easy way to use the existing parameter interface for some of the filters.

Also earlier work that derived the parameters from the class definitions doesn't work with the newer version because the type information is incomplete or too complex. I.e. some arguments are not just a float or a string, but a list of nested things.

jelmervdl added help wanted Extra attention is needed pipe dream Someday… labels Oct 6, 2022

jelmervdl linked a pull request Oct 18, 2022 that will close this issue

Opusfilter backing #43

Draft

jelmervdl self-assigned this Nov 9, 2022

jelmervdl removed a link to a pull request Nov 9, 2022

Opusfilter backing #43

Draft

This was linked to pull requests Nov 9, 2022

Opusfilter filters #47

Merged

Opusfilter backing #43

Draft

jelmervdl closed this as completed in #47 Mar 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Opusfilters integration #40

Opusfilters integration #40

jelmervdl commented Oct 6, 2022

jelmervdl commented Oct 12, 2022

jelmervdl commented Oct 14, 2022 •

edited

Loading

Opusfilters integration #40

Opusfilters integration #40

Comments

jelmervdl commented Oct 6, 2022

Readers

Filters

jelmervdl commented Oct 12, 2022

jelmervdl commented Oct 14, 2022 • edited Loading

jelmervdl commented Oct 14, 2022 •

edited

Loading