Tokenizer Internationalization - Spanish #5

clusterfudge · 2016-01-08T17:13:49Z

We should test to see if the EnglishTokenizer impl is sufficient for Spanish, and if not, add an additional tokenizer. EnglishTokenizer is based on porter stemmer.

ghost · 2016-01-08T20:47:19Z

What is needed in order to test it? I am not familiar with adapt's design... and I am reading the README.md at this moment... should I translate the strings or something else is needed?

seanfitzgeraldsc · 2016-01-08T20:55:50Z

First, you'll need to validate whether or not the EnglishTokenizer is sufficient. I would do this by creating spanish versions of the examples and playing with them. Specifically, the tokenizer is punctuation aware and splits an utterance (sentence or phrase) into individual tokens (usually words).

If the english tokenizer does not work well, you'll need to look for an equivalent to the Porter Stemmer algorithm for Spanish and implement it. The latter can be picked up by someone else, if that's beyond your scope. Validating whether or not the existing tokenizer is sufficient is a great first step.

Thanks!

ghost · 2016-01-08T21:05:19Z

I see, I am willing to do this, I can't at this very moment... but I will do some experiments later. Expect to read many questions because it's very likely I am getting lost!

cheers!

mcicolella · 2016-01-09T10:02:04Z

Hi, if you need help to reimplement the Porter Stemmer algorithm for Spanish or other languages take a look at https://github.com/OleanderSoftware/OleanderStemmingLibrary
It's a very good lib.

adocampo · 2016-02-28T18:13:16Z

I do not know if I did what's is supposed to do, but I've just modified the source code of the multi_intent_parser.py to "understand" spanish words.
http://pastebin.com/bEJqCKuj

You can try those sentences: "pon algo de música de los clash", "quiero escuchar algo de música de los clash", "qué tiempo hace en seattle", and it seems it returns a JSON.

That's whats its needed?

clusterfudge · 2016-02-29T18:15:03Z

So, this is definitely some helpful work! I think we'd want to have samples per language, maybe separated by folders. To really verify that this stuff works for spanish, we'd need the unit tests translated to spanish, and even better, localization work done on the unit tests so that the language stays the same, but they load different data files for different languages. That would give me high confidence that the language itself works with the tokenizer, but that may be an unrealistic goal. Can you try translating some of the engine tests?

clusterfudge · 2016-02-29T18:15:09Z

thanks for contributing!

adocampo · 2016-03-03T08:24:32Z

Can you try translating some of the engine tests?

Of course I can... could you please point me to the engines? I only saw this one https://github.com/MycroftAI/adapt/blob/master/test/IntentEngineTest.py and I doubt I can do something with it...

clusterfudge · 2016-03-03T08:26:08Z

That would be the test I was referencing. Swapping out the vocabulary/utterances for spanish equivalents would be acceptable to me, but completely unverifiable (as I only took about 2 years of spanish, 20 years ago).

adocampo · 2016-03-03T09:05:10Z

Ok, I only translated the utterance sentence (line 36) and the two expressions "tree" (line 34) and "house" (line 43)
http://pastebin.com/PkZJ4Gmq

I don't know if this is what you need, and perhaps the utterance sentence can be translated into spanish different depending if it is imperative (as I've translated it), infinitive or other tense...

Hope it helps!

drawveloper · 2016-05-21T10:26:27Z

Should I open a new issue for Portuguese?

clusterfudge added the ready label Mar 22, 2016

acidjunk mentioned this issue Apr 15, 2016

What would be needed for tranlation to Dutch? #13

Open

acidjunk mentioned this issue Aug 9, 2019

Tokenizer Internationalization - German #4

Open

clusterfudge removed the ready label Jun 23, 2021

clusterfudge added the Deferred post-1.0 label Sep 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizer Internationalization - Spanish #5

Tokenizer Internationalization - Spanish #5

clusterfudge commented Jan 8, 2016

ghost commented Jan 8, 2016

seanfitzgeraldsc commented Jan 8, 2016

ghost commented Jan 8, 2016

mcicolella commented Jan 9, 2016

adocampo commented Feb 28, 2016

clusterfudge commented Feb 29, 2016

clusterfudge commented Feb 29, 2016

adocampo commented Mar 3, 2016

clusterfudge commented Mar 3, 2016

adocampo commented Mar 3, 2016

drawveloper commented May 21, 2016

Tokenizer Internationalization - Spanish #5

Tokenizer Internationalization - Spanish #5

Comments

clusterfudge commented Jan 8, 2016

ghost commented Jan 8, 2016

seanfitzgeraldsc commented Jan 8, 2016

ghost commented Jan 8, 2016

mcicolella commented Jan 9, 2016

adocampo commented Feb 28, 2016

clusterfudge commented Feb 29, 2016

clusterfudge commented Feb 29, 2016

adocampo commented Mar 3, 2016

clusterfudge commented Mar 3, 2016

adocampo commented Mar 3, 2016

drawveloper commented May 21, 2016