Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some dictionaries are not compatible #34

Open
bartosz-antosik opened this issue Jan 13, 2017 · 4 comments
Open

Some dictionaries are not compatible #34

bartosz-antosik opened this issue Jan 13, 2017 · 4 comments

Comments

@bartosz-antosik
Copy link

I have figured out the way to install other languages' dictionaries which was a clear waste of time, because you give a quite detailed explanation on this in one of the issues. Never mind.

This lead however to a conclusion that some dictionaries act weird.

Once you install Polish dictionary (rename it to pl_PL.aff & pl_PL.dic respectively & put to languages directory) and set "language" to "pl_PL" in workspace's spellchecker.json it displays after reload an Error message:

"Extension host terminated unexpectedly. Please reload the window to recover."

I tried debugging, but it acts strangely - the sub-window started to debug the extension disappears after some time and there seems to be no any information in debug Console. Setting up breakpoints does not work - seems like they do catch execution about extension's initialization and then they do not catch anything about the time the window disappears.

I presume the Polish dictionary files are OK, they come from the GitHub you mentioned (https://github.com/wooorm/dictionaries/) and they also worked fine with Sublime Text's spell checker extension which uses the same dictionary format.

I have done same test with French dictionary and it works fine.

Maybe you could have a look at this?

If there is anything I could do to help please let me know.

P.S. It is unrelated I think, but Spanish dictionary that comes with your extension seems to have some HTML atop each of three files (es_ANY.aff, es_ANY.dic & es_ANY README.txt). I have no idea whether it disturbs it's operation but seems strange compared to both en dictionaries (and Polish dictionary as well).

@bartosz-antosik
Copy link
Author

bartosz-antosik commented Feb 6, 2017

I went deeper with debugging on the above.

This extension uses hunspell-spellchecker module (https://github.com/GitbookIO/hunspell-spellchecker). This module has few inherent problems. It loads a whole dictionary into the memory (into a associative table, a.k.a. dictionary, object to be precise). It uses the affixes (.aff file) to create ALL variants of the words found in the dictionary (.dic file) and then store them in the memory.

  1. It takes a lot of time;
  2. It takes a lot of memory;
  3. memory consumption causes hunspell-spellchecker (and in turn the extension) to crash under dictionaries with more expanded flexion system.

Ad. 2. When running with English dictionary ("en_US") that comes with the extension memory consumption is in peaks 500 MB and constantly above 250 MB. Far too much.
Ad. 3. It crashes under Polish language dictionary after reaching about 1.5 GB memory consumed (there are reports about other dictionaries as well) with "JavaScript heap out of memory" message.

There are other JavaScript implementations of spell checkers that use hunspell dictionaries (like Typo.js, https://github.com/cfinke/Typo.js/) but they share the same problem (see issues list about Portugese language). There are few nice JavaScript bindings to hunspell native binaries which use the dictionaries in more sophisticated way (like node-spellchecker, used by ATOM editor, https://github.com/atom/node-spellchecker) but they are native and are not supported in elegant way in VS Code.

It seems that there is no good resolution for the above and so far VS Code cannot have a decent spell checker!

P.S. The Spanish dictionary contained in the extension IS faulty in the sense described above. It does get parsed into the associative array but the HTML part also gets parsed so the dictionary used by the extension is full of HTML elements treated as words to spell check...

@wooorm
Copy link

wooorm commented Mar 6, 2017

Have you tried nspell? I made it to work with those dictionaries, and to my knowledge it’s the most complete JS-only spell checker for Node!

@bartosz-antosik
Copy link
Author

bartosz-antosik commented Mar 6, 2017

Yes I did. I have described it in more details here:

microsoft/vscode#20266

@swyphcosmo
Copy link
Owner

P.S. The Spanish dictionary contained in the extension IS faulty in the sense described above. It does get parsed into the associative array but the HTML part also gets parsed so the dictionary used by the extension is full of HTML elements treated as words to spell check...

@bartosz-antosik Thanks so much for this info. I'm sorry that I never read it closely enough until today to see the problems with the Spanish dictionary. I just fixed it in #79. I'll be releasing a new version soon.

I haven't caught up with microsoft/vscode#20266 completely, but there seems to be a lot of awesome conversation over there. When I originally wrote this extension, I hoped for a native solution to avoid the huge memory footprint of some languages, but shipping binaries was not easy at the time. I've been meaning to migrate this from hunspell to nspell for all of it's improvements and the language bundles, but I never made the time to do it. I started doing less document writing in my day-to-day so the current implementation worked good enough for me, and seems like many others. I'll try to look into the conversion again soon. There are a couple other open suggestions that nspell already has solutions for, e.g. add word to dictionary rather than ignoring words, that the hunspell library does not support.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants