-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
avoid translation of Named Entities #127
Comments
It's an issue we're aware of, but don't have a solution for yet. We're thinking along the same lines though! Our plan was adding support for placeholders, e.g. placeholders in the input sentence would be translated as is into the output sentence (but in the proper position). We could then replace some or all named entities, urls, email addresses, etc, with placeholders and put them back in after translation. Problem with this approach is that the model has to be trained with placeholder support. So this won't work with our current models. What you could try is to use the support for HTML translation that's in bergamot-translator (the library backing translateLocally.) I just pushed a commit to the main branch to make that accessible from the command line. With that version, you should be able to do something like: echo "The train leaves for <span>London St. Pancras</span> at quarter past six." | ./translateLocally -m eng-fin-tiny --html
Juna lähtee <span>Lontoo St. Pancras</span> neljännestä yli kuusi. HTML support is not really meant for this, but it might get you at least half way. You can add |
Thank you. It is working with the html support in all languages except estonian. The html support looks to be broken in the estonian model. I'm doing the preprocessing with a spacey model that is able to detect full names. I then add the span and regex then back to the original after the translation. I'm also see what looks like a memory leakage though it can be worked around by restarting the sub process every 1000 iterations. I am still working on the scripts and will post an example when it is stable.
Example of the estonian issues. hmm.
|
I think you're seeing the results of using alignment scores for inserting HTML, and why it isn't ideal for your use case. What it basically does is look per output token which source token aligns best according to some alignment model. There's no guarantee in there that there's a 1-to-1 mapping, and the HTML reconstruction is allowed to duplicate elements if it thinks that a span in the input sentence got split up in the translated sentence. You might want to do some post-processing to decide which ones of the spans is the actual named entity. |
I noticed that named entities like company names are getting translated.
I was thinking of running a preprocessor model like from spacy.io to flag all the named entities. I then want to avoid translating those.
I am wondering if there is an official way to prevent translation within the text sent to translatelocally.
For example: China Nonferrous Gold Limited -> Kiina Non Iron Gold Limited (finnish from Opus-mt student)
Using the student models is the best solution for translating large amounts of text with limited computer power. I am playing around with translating a site I am building to many languages but just 80k paragraphs was going to take months on a single computer. Here I can do it in one night.
The text was updated successfully, but these errors were encountered: