Tokenizer Internationalization - German #4

clusterfudge · 2016-01-08T17:13:26Z

We should test to see if the EnglishTokenizer impl is sufficient for German, and if not, add an additional tokenizer. EnglishTokenizer is based on porter stemmer.

freundTech · 2016-01-20T17:21:27Z

I had a quick look at the tokenizer and it looks like it should work with german if abbreviations_list gets adjusted and clitics get removed.

Formal german doesn't have any clitics, so that won't be needed (German dialects are a different topic ;) )

clusterfudge · 2016-01-26T06:57:33Z

Awesome! Happy to review any pull requests. I don't have a process in place for reviewing localizations in languages I don't understand, so this will likely be a bit of a process.

hinzundcode · 2016-05-18T15:58:13Z

In German there are "compound verbs" like "ausschalten" (to turn off) or "herunterladen" (to download) and you have to seperate them as you conjugate them. For example:

"Schalte das Licht aus" (turn the light off)
"Lade die Datei herunter" (download this file)

Now I want to define an Intent that listens for "Schalte das Licht aus" and "würdest du bitte das Licht ausschalten" (would you like to turn off the light), so I'd like to define "ausschalten" as an entity and "schalte" + "aus". Is it possible to combine two seperate words to a single entity with the current version of adapt?

timaschew · 2019-08-09T20:03:43Z

Any news on this?

clusterfudge · 2019-08-09T20:15:56Z

Hey guys, sorry, this totally fell off my radar. It sounds like from @hinzundcode 's post that potentially the english tokenizer is not sufficient for his case. If true, we'd need a couple of things to get this working

a german tokenizer, associated tests
documentation for upstream systems on how to consume the new tokenizer (I'm largely thinking of @mycroft here)

I don't have the expertise to work on the former, so we'd be looking for support for the community here.

acidjunk · 2019-08-09T21:07:00Z

The same is true for Dutch. #13

acidjunk · 2019-08-09T21:22:40Z

All the other language seem stuck, Spanish seems to have progressed the most. #5
But when I look at the current src there is only English, without any hooks/guidance how to start translating them. I can very easily translate the english strings in https://github.com/MycroftAI/adapt/blob/master/adapt/tools/text/tokenizer.py but multilingual support seems missing, judging on the layout of folders and files.

Shouldn't there be a way to see that tokenizer.py is the EN variant? (except for the EnglishTokenizer class the rest of the file contains top level vars that have content in it with translations needs).

E.g. I would expect somthing like: adapt/tools/text/en/tokenizer.py and adapt/tools/text/de/tokenizer.py

More then happy to create a PR with a NL tokenizer (and I might even be able to help with a German version), but without multilingual support it feels a bit useless. I might be missing some essential design clue, any pointers in the right direction are appreciated.

acidjunk · 2019-08-19T20:05:54Z

Also not a single word about translating this in the docs: https://mycroft.ai/documentation/adapt/

Not sure how to continue, without (community) support.

@clusterfudge : Would be nice to at least remove the "READY" labels as they are somewhat confusing.

ghost · 2020-03-01T17:56:27Z

I'm also interested in a German tokenizer as I'm localizing one of my applications to German at the moment. I'm forced to disable intent parsing for the German version which is a pitty because it does add value to my product.

Looks like this project is dead though any alternatives? @acidjunk could you solve it?

clusterfudge · 2020-03-02T04:06:58Z

Sorry for the delayed followup here; removing the READY label is probably a good call, @acidjunk .

In order to hit READY, what we likely need is a well-specified interface for Tokenizer. There's also likely a chunk of project-management work on my part to lay out the work for each language. I'll attempt to put something like that up in the next week.

One field on this tracking table will be indicating whether or not bag-of-words classification works for the language in question. This will require language-fluency and a good comprehension of bag-of-words confirmation. If anyone feels like they meet the criteria for this, feel free to speak up!

acidjunk · 2020-06-30T23:22:33Z

I'm not very actively following MyCroft stuff anymore (mostly due to the lack of delivery of the MkII: my domotica is already complete controllable by Siri in Dutch for the last 2 years.

I'm fluent in English, Ducth and German so if there is work with a good defined scope regarding getting language support better: I can help. Just shoot.

clusterfudge added the ready label Mar 22, 2016

clusterfudge removed the ready label Jun 23, 2021

clusterfudge added the Deferred post-1.0 label Sep 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizer Internationalization - German #4

Tokenizer Internationalization - German #4

clusterfudge commented Jan 8, 2016

freundTech commented Jan 20, 2016

clusterfudge commented Jan 26, 2016

hinzundcode commented May 18, 2016 •

edited

Loading

timaschew commented Aug 9, 2019

clusterfudge commented Aug 9, 2019

acidjunk commented Aug 9, 2019

acidjunk commented Aug 9, 2019

acidjunk commented Aug 19, 2019

ghost commented Mar 1, 2020

clusterfudge commented Mar 2, 2020

acidjunk commented Jun 30, 2020

Tokenizer Internationalization - German #4

Tokenizer Internationalization - German #4

Comments

clusterfudge commented Jan 8, 2016

freundTech commented Jan 20, 2016

clusterfudge commented Jan 26, 2016

hinzundcode commented May 18, 2016 • edited Loading

timaschew commented Aug 9, 2019

clusterfudge commented Aug 9, 2019

acidjunk commented Aug 9, 2019

acidjunk commented Aug 9, 2019

acidjunk commented Aug 19, 2019

ghost commented Mar 1, 2020

clusterfudge commented Mar 2, 2020

acidjunk commented Jun 30, 2020

hinzundcode commented May 18, 2016 •

edited

Loading