-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tokenizer Internationalization - German #4
Comments
I had a quick look at the tokenizer and it looks like it should work with german if abbreviations_list gets adjusted and clitics get removed. Formal german doesn't have any clitics, so that won't be needed (German dialects are a different topic ;) ) |
Awesome! Happy to review any pull requests. I don't have a process in place for reviewing localizations in languages I don't understand, so this will likely be a bit of a process. |
In German there are "compound verbs" like "ausschalten" (to turn off) or "herunterladen" (to download) and you have to seperate them as you conjugate them. For example: "Schalte das Licht aus" (turn the light off) Now I want to define an Intent that listens for "Schalte das Licht aus" and "würdest du bitte das Licht ausschalten" (would you like to turn off the light), so I'd like to define "ausschalten" as an entity and "schalte" + "aus". Is it possible to combine two seperate words to a single entity with the current version of adapt? |
Any news on this? |
Hey guys, sorry, this totally fell off my radar. It sounds like from @hinzundcode 's post that potentially the english tokenizer is not sufficient for his case. If true, we'd need a couple of things to get this working
I don't have the expertise to work on the former, so we'd be looking for support for the community here. |
The same is true for Dutch. #13 |
All the other language seem stuck, Spanish seems to have progressed the most. #5 Shouldn't there be a way to see that tokenizer.py is the EN variant? (except for the EnglishTokenizer class the rest of the file contains top level vars that have content in it with translations needs). E.g. I would expect somthing like: More then happy to create a PR with a NL tokenizer (and I might even be able to help with a German version), but without multilingual support it feels a bit useless. I might be missing some essential design clue, any pointers in the right direction are appreciated. |
Also not a single word about translating this in the docs: https://mycroft.ai/documentation/adapt/ Not sure how to continue, without (community) support. @clusterfudge : Would be nice to at least remove the "READY" labels as they are somewhat confusing. |
I'm also interested in a German tokenizer as I'm localizing one of my applications to German at the moment. I'm forced to disable intent parsing for the German version which is a pitty because it does add value to my product. Looks like this project is dead though any alternatives? @acidjunk could you solve it? |
Sorry for the delayed followup here; removing the READY label is probably a good call, @acidjunk . In order to hit READY, what we likely need is a well-specified interface for Tokenizer. There's also likely a chunk of project-management work on my part to lay out the work for each language. I'll attempt to put something like that up in the next week. One field on this tracking table will be indicating whether or not bag-of-words classification works for the language in question. This will require language-fluency and a good comprehension of bag-of-words confirmation. If anyone feels like they meet the criteria for this, feel free to speak up! |
I'm not very actively following MyCroft stuff anymore (mostly due to the lack of delivery of the MkII: my domotica is already complete controllable by Siri in Dutch for the last 2 years. I'm fluent in English, Ducth and German so if there is work with a good defined scope regarding getting language support better: I can help. Just shoot. |
We should test to see if the EnglishTokenizer impl is sufficient for German, and if not, add an additional tokenizer. EnglishTokenizer is based on porter stemmer.
The text was updated successfully, but these errors were encountered: