-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow local models to be used in spacy_initialize() #180
Comments
Great idea! Should be pretty straightforward to implement. |
Great. If you need help with testing please let me know as happy to help. Just some notes to assist if a new chunk is needed in the documentation at some point. Most of what an R user needs to know to edit the spaCy pipeline and train their own model is in Chapters Three and Four of Ines Montani’s free online course at: https://course.spacy.io/ . When creating an EntityRuler for the pipeline I found the The spaCy pattern matcher tool is really useful for testing matches when writing patterns https://explosion.ai/demos/matcher It is really easy to attach a domain specific vector space model to a new spaCy model. Details here: https://spacy.io/usage/vectors-similarity . I use fastText to do that as it is very easy to use ( https://fasttext.cc/ ). There are a couple of R packages for fastText but it seems |
I think we need to give a bit of thought on this. For example, the methods provided by |
We don't plan to provide tools for modifying or training language models, but if a user has custom language models, we agree that spacyr should allow these to be used. Right now this does not seem to be working, using I downloaded this from https://github.com/explosion/spacy-models/releases/tag/en_core_web_sm-3.4.0: (base) KB-MBP-14:spacyr kbenoit$ ls -l ~/tmp/en_core_web_sm-3.4.0.tar.gz
-rw-r--r--@ 1 kbenoit staff 12803030 1 Sep 18:17 /Users/kbenoit/tmp/en_core_web_sm-3.4.0.tar.gz > library("spacyr")
> spacy_initialize(model = "~/tmp/en_core_web_sm-3.4.0.tar.gz")
Found 'spacy_condaenv'. spacyr will use this environment
Error in py_run_file_impl(file, local, convert) :
OSError: [E050] Can't find model '~/tmp/en_core_web_sm-3.4.0.tar.gz'. It doesn't seem to be a Python package or a valid path to a data directory.
> spacy_initialize(model = "~/tmp/en_core_web_sm")
Python space is already attached. If you want to switch to a different Python, please restart R.
Error in py_run_file_impl(file, local, convert) :
OSError: [E050] Can't find model '~/tmp/en_core_web_sm'. It doesn't seem to be a Python package or a valid path to a data directory. |
Hi. I wonder whether there might be any updates on this issue? I would like to use the PyMUSAS RuleBasedTagger within |
I have a suggestion for an enhancement. It is becoming very easy to create domain specific models in spaCy with the matcher and Phrasematcher and the Entity Ruler. Prodigy is also turning out to be really powerful for improving the out of the box NER matching and for classification in specific domains.
At present spacyr allows the user to download models from spacy (e.g. the en model). Bearing in mind the discussion on full model names #177 would it be possible to add an arg to install a model from a local path. I took a look inside the code but as far as I could tell the calls to install are made through commands to the sys(?).
My use case on this is that I used spacyr to extract noun phrases containing species names for Antarctic and the Arctic from scientific and patent texts to examine bioinnovation in the polar regions. However, that approach drops relevant results and so training a large model with an entity ruler etc was necessary. It would be good to use that model inside spacyr.
I also think we might see R users wanting to use other spacy models such as for biomedical text https://github.com/allenai/scispacy and the spacy universe suggests other models such as for legal text classification (blackstone) etc. So, I think adding this arg would open up a lot of flexibility for users.
Many thanks again for all the work on the package!
The text was updated successfully, but these errors were encountered: