Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenizer Internationalization - Spanish #5

Open
clusterfudge opened this issue Jan 8, 2016 · 11 comments
Open

Tokenizer Internationalization - Spanish #5

clusterfudge opened this issue Jan 8, 2016 · 11 comments

Comments

@clusterfudge
Copy link
Collaborator

We should test to see if the EnglishTokenizer impl is sufficient for Spanish, and if not, add an additional tokenizer. EnglishTokenizer is based on porter stemmer.

@ghost
Copy link

ghost commented Jan 8, 2016

What is needed in order to test it? I am not familiar with adapt's design... and I am reading the README.md at this moment... should I translate the strings or something else is needed?

@seanfitzgeraldsc
Copy link

First, you'll need to validate whether or not the EnglishTokenizer is sufficient. I would do this by creating spanish versions of the examples and playing with them. Specifically, the tokenizer is punctuation aware and splits an utterance (sentence or phrase) into individual tokens (usually words).

If the english tokenizer does not work well, you'll need to look for an equivalent to the Porter Stemmer algorithm for Spanish and implement it. The latter can be picked up by someone else, if that's beyond your scope. Validating whether or not the existing tokenizer is sufficient is a great first step.

Thanks!

@ghost
Copy link

ghost commented Jan 8, 2016

I see, I am willing to do this, I can't at this very moment... but I will do some experiments later. Expect to read many questions because it's very likely I am getting lost!

cheers!

@mcicolella
Copy link

Hi, if you need help to reimplement the Porter Stemmer algorithm for Spanish or other languages take a look at https://github.com/OleanderSoftware/OleanderStemmingLibrary
It's a very good lib.

@adocampo
Copy link

I do not know if I did what's is supposed to do, but I've just modified the source code of the multi_intent_parser.py to "understand" spanish words.
http://pastebin.com/bEJqCKuj

You can try those sentences: "pon algo de música de los clash", "quiero escuchar algo de música de los clash", "qué tiempo hace en seattle", and it seems it returns a JSON.

That's whats its needed?

@clusterfudge
Copy link
Collaborator Author

So, this is definitely some helpful work! I think we'd want to have samples per language, maybe separated by folders. To really verify that this stuff works for spanish, we'd need the unit tests translated to spanish, and even better, localization work done on the unit tests so that the language stays the same, but they load different data files for different languages. That would give me high confidence that the language itself works with the tokenizer, but that may be an unrealistic goal. Can you try translating some of the engine tests?

@clusterfudge
Copy link
Collaborator Author

thanks for contributing!

@adocampo
Copy link

adocampo commented Mar 3, 2016

Can you try translating some of the engine tests?

Of course I can... could you please point me to the engines? I only saw this one https://github.com/MycroftAI/adapt/blob/master/test/IntentEngineTest.py and I doubt I can do something with it...

@clusterfudge
Copy link
Collaborator Author

That would be the test I was referencing. Swapping out the vocabulary/utterances for spanish equivalents would be acceptable to me, but completely unverifiable (as I only took about 2 years of spanish, 20 years ago).

@adocampo
Copy link

adocampo commented Mar 3, 2016

Ok, I only translated the utterance sentence (line 36) and the two expressions "tree" (line 34) and "house" (line 43)
http://pastebin.com/PkZJ4Gmq

I don't know if this is what you need, and perhaps the utterance sentence can be translated into spanish different depending if it is imperative (as I've translated it), infinitive or other tense...

Hope it helps!

@drawveloper
Copy link

Should I open a new issue for Portuguese?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants