Use bigram for spell checking #110

ierezell · 2021-06-02T13:59:24Z

Hi, first of all thanks for this very nice piece of software !

I'm using the symspellpy port and it's working perfectly.

However, on some cases (in french for exemple) I have chat messages like
randé vs instead of rendez vous or even je suit instead of je suis.

The later always is in my bigram dictionary and not the former.

So I was thinking about checking against all bigrams to have better spell checking than only single words which are in the unigram list and I was wondering if some kind of similar behaviour was already in symspell or if it was planned to be.

Thanks again,
Have a wonderful day

The text was updated successfully, but these errors were encountered:

wolfgarbe · 2021-06-02T14:21:52Z

SymSpell.LookupCompound should do exactly this. It uses the optional bigram dictionary (load with symSpell.LoadBigramDictionary) in order to use sentence level context information for selecting the best spelling correction for multiple input terms. But I haven't tested it for French.

ierezell · 2021-06-02T14:45:31Z

Hi @wolfgarbe, thanks a lot for the fast answer !

I did exactly that (symSpell.LoadBigramDictionary) with symspellpy (maybe the implementation differs ?).
I created my own bigram dictionary from the google n-grams (btw I can offer the code in python if needed).

However, some chatbot sentence (really really bad writting) is not corrected correctly.

Here is an exemple, I hope it helps.

je peut pas recevoir mes 3 enfants avec leurs enfants cecqui fait 3 bukbes perce wue ils sont plus que 8 pas logique ni justeo
        |                                               |             |       |   |
je peut pas recevoir mes 3 enfants avec leurs enfants ce qui fait 3 bulbes perce que ils sont plus que 8 pas logique ni juste
        |                                               |             |       |   |
        |                                               |             |   "perce" exists, "que" exists but "perce que" is 
        |                                               |             |   not a bigram in the dict it should be "parce que"
        |                                               |             |
        |                                               |       Not the good word but it's ok, i will check with custom logic
        |                                          Perfect
      "je" and "peut" are valid unigrams but "je peut" is not a bigram, it should be "je peux" which is in the bigrams.

Thanks again for your time,

Have a great day

wolfgarbe · 2021-06-02T15:04:34Z

If you attach the French frequency dictionary and the bigram dictionary files to the issue in plain text format, I could have a look what goes wrong (in SymSpell and/or the port)

ierezell · 2021-06-02T15:25:52Z

Here are the bigrams and unigrams dictionaries and obtained with the script bellow.
bigram.txt
unigram.txt

Note that the extension is .txt because github don't allow posting .py files.
I took only the most recent count for each uni or bigrams. Also I limited to the 80 000 most frequent unigrams and 160 000 bigrams.
google_ngrams.txt

Thanks again a lot for your help !

ierezell · 2021-06-02T23:03:47Z

Sorry to bump again but I played with it more to make it work as the english version.

I tried to put space in random places and I realized my first bigrams and unigrams version was too loose, but I made it a bit more strict (no bi-grams of space + word or word + space and only word of at least one character which is in a french dictionnary)

Even with that I cannot get the exemple above to work but it fixed most of the random spaces/random splitting errors.

When all will be fixed I will replicate this for the other languages of the google n-grams and I will give you the files so that this framework can support more languages built-in.

Have a great week

wolfgarbe · 2021-06-03T09:59:20Z

When all will be fixed I will replicate this for the other languages of the google n-grams and I will give you the files so that this framework can support more languages built-in.

That's great.

I will try to figure out why your examples above do not work, and if there is a way to improve SymSpell to support such cases. But that could last some days as I'm currently quite busy with some other projects.

ierezell · 2021-06-03T13:29:14Z

Hi @wolfgarbe,

I will try to figure out why your examples above do not work, and if there is a way to improve SymSpell to support such cases. But that could last some days as I'm currently quite busy with some other projects.

Don't worry and thanks a lot for your time and dedication, it's really nice !

I will think a bit more about how to clean and collect data for other languages like russian which have one characters words or chineese...

Also another related sentence :

"when will she arrive" : quand va-t-elle arriver was written quand va telleariver and corrected with quand va telle river
My problem is that the word arriver is more frequent than river and I thought it would be corrected with that.
Also elle arriver is a bigram,telle river is not and correcting with quand va elle arriver would be perfect

Else it's harder to retrieve the real sentence (with a language model for exemple).

For now SymSpell was the most complete spellchecker for my need but I will maybe add a phonetic or POS layer (chatbot text is really awfully spelled). Do you plan to have this kind of improvements ?

Have a great day

wolfgarbe · 2021-06-05T09:31:59Z

I will maybe add a phonetic or POS layer. Do you plan to have this kind of improvements ?

Implementing a weighted edit distance giving a higher rank to character pairs which are close to each other on the keyboard layout or which sound similar (e.g. Soundex or other phonetic algorithms which identify different spellings of the same sound) would certainly be a good improvement. But I don't think that I will find time to implement this near-term.

But there are at least two SymSpell ports who have already implemented a weighted edit distance:

https://github.com/MighTguY/customized-symspell
https://github.com/searchhub/preDict

ierezell · 2021-06-29T17:19:56Z

Hi @wolfgarbe, sorry to bump this thread again... Do you have any news about the possible improvements on the bigrams corrections ?

I would be glad to help if some contribution is possible or needed.

Have a great day

wolfgarbe · 2021-06-30T07:21:19Z

Unfortunately, I have not yet found the time, but it is still on my mind.

ierezell · 2021-07-22T08:33:54Z

Hi @wolfgarbe,
I'm really sorry to bump again... I'm sure you have tons on your hands, so could you point me the good place in the code so I can debug this and do a PR fix ?

It's starting to be urgent for me so I will put some hours in it :)

Thanks in advance and thanks again for this great library !
Have a great day

ierezell · 2022-08-01T18:39:10Z

Hello @wolfgarbe, as I also raised the issue on symspellpy, we might have found were it came from and it could be a fix.

mammothb/symspellpy#107

ierezell mentioned this issue Jul 22, 2021

Correction not using bi-grams mammothb/symspellpy#92

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use bigram for spell checking #110

Use bigram for spell checking #110

ierezell commented Jun 2, 2021 •

edited

Loading

wolfgarbe commented Jun 2, 2021 •

edited

Loading

ierezell commented Jun 2, 2021 •

edited

Loading

wolfgarbe commented Jun 2, 2021

ierezell commented Jun 2, 2021

ierezell commented Jun 2, 2021

wolfgarbe commented Jun 3, 2021

ierezell commented Jun 3, 2021 •

edited

Loading

wolfgarbe commented Jun 5, 2021 •

edited

Loading

ierezell commented Jun 29, 2021

wolfgarbe commented Jun 30, 2021

ierezell commented Jul 22, 2021

ierezell commented Aug 1, 2022

Use bigram for spell checking #110

Use bigram for spell checking #110

Comments

ierezell commented Jun 2, 2021 • edited Loading

wolfgarbe commented Jun 2, 2021 • edited Loading

ierezell commented Jun 2, 2021 • edited Loading

wolfgarbe commented Jun 2, 2021

ierezell commented Jun 2, 2021

ierezell commented Jun 2, 2021

wolfgarbe commented Jun 3, 2021

ierezell commented Jun 3, 2021 • edited Loading

wolfgarbe commented Jun 5, 2021 • edited Loading

ierezell commented Jun 29, 2021

wolfgarbe commented Jun 30, 2021

ierezell commented Jul 22, 2021

ierezell commented Aug 1, 2022

ierezell commented Jun 2, 2021 •

edited

Loading

wolfgarbe commented Jun 2, 2021 •

edited

Loading

ierezell commented Jun 2, 2021 •

edited

Loading

ierezell commented Jun 3, 2021 •

edited

Loading

wolfgarbe commented Jun 5, 2021 •

edited

Loading