-
Notifications
You must be signed in to change notification settings - Fork 298
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use bigram for spell checking #110
Comments
SymSpell.LookupCompound should do exactly this. It uses the optional bigram dictionary (load with symSpell.LoadBigramDictionary) in order to use sentence level context information for selecting the best spelling correction for multiple input terms. But I haven't tested it for French. |
Hi @wolfgarbe, thanks a lot for the fast answer ! I did exactly that ( However, some chatbot sentence (really really bad writting) is not corrected correctly. Here is an exemple, I hope it helps.
Thanks again for your time, Have a great day |
If you attach the French frequency dictionary and the bigram dictionary files to the issue in plain text format, I could have a look what goes wrong (in SymSpell and/or the port) |
Here are the bigrams and unigrams dictionaries and obtained with the script bellow. Note that the extension is Thanks again a lot for your help ! |
Sorry to bump again but I played with it more to make it work as the english version. I tried to put space in random places and I realized my first bigrams and unigrams version was too loose, but I made it a bit more strict (no bi-grams of space + word or word + space and only word of at least one character which is in a french dictionnary) Even with that I cannot get the exemple above to work but it fixed most of the random spaces/random splitting errors. When all will be fixed I will replicate this for the other languages of the google n-grams and I will give you the files so that this framework can support more languages built-in. Have a great week |
That's great. I will try to figure out why your examples above do not work, and if there is a way to improve SymSpell to support such cases. But that could last some days as I'm currently quite busy with some other projects. |
Hi @wolfgarbe,
Don't worry and thanks a lot for your time and dedication, it's really nice ! I will think a bit more about how to clean and collect data for other languages like russian which have one characters words or chineese... Also another related sentence : "when will she arrive" : Else it's harder to retrieve the real sentence (with a language model for exemple). For now SymSpell was the most complete spellchecker for my need but I will maybe add a phonetic or POS layer (chatbot text is really awfully spelled). Do you plan to have this kind of improvements ? Have a great day |
Implementing a weighted edit distance giving a higher rank to character pairs which are close to each other on the keyboard layout or which sound similar (e.g. Soundex or other phonetic algorithms which identify different spellings of the same sound) would certainly be a good improvement. But I don't think that I will find time to implement this near-term. But there are at least two SymSpell ports who have already implemented a weighted edit distance: https://github.com/MighTguY/customized-symspell |
Hi @wolfgarbe, sorry to bump this thread again... Do you have any news about the possible improvements on the bigrams corrections ? I would be glad to help if some contribution is possible or needed. Have a great day |
Unfortunately, I have not yet found the time, but it is still on my mind. |
Hi @wolfgarbe, It's starting to be urgent for me so I will put some hours in it :) Thanks in advance and thanks again for this great library ! |
Hello @wolfgarbe, as I also raised the issue on symspellpy, we might have found were it came from and it could be a fix. |
Hi, first of all thanks for this very nice piece of software !
I'm using the symspellpy port and it's working perfectly.
However, on some cases (in french for exemple) I have chat messages like
randé vs
instead ofrendez vous
or evenje suit
instead ofje suis
.The later always is in my bigram dictionary and not the former.
So I was thinking about checking against all bigrams to have better spell checking than only single words which are in the unigram list and I was wondering if some kind of similar behaviour was already in symspell or if it was planned to be.
Thanks again,
Have a wonderful day
The text was updated successfully, but these errors were encountered: