Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow local models to be used in spacy_initialize() #180

Open
poldham opened this issue Nov 12, 2019 · 5 comments
Open

Allow local models to be used in spacy_initialize() #180

poldham opened this issue Nov 12, 2019 · 5 comments
Labels
documentation Documentation and instructions enhancement

Comments

@poldham
Copy link

poldham commented Nov 12, 2019

I have a suggestion for an enhancement. It is becoming very easy to create domain specific models in spaCy with the matcher and Phrasematcher and the Entity Ruler. Prodigy is also turning out to be really powerful for improving the out of the box NER matching and for classification in specific domains.

At present spacyr allows the user to download models from spacy (e.g. the en model). Bearing in mind the discussion on full model names #177 would it be possible to add an arg to install a model from a local path. I took a look inside the code but as far as I could tell the calls to install are made through commands to the sys(?).

My use case on this is that I used spacyr to extract noun phrases containing species names for Antarctic and the Arctic from scientific and patent texts to examine bioinnovation in the polar regions. However, that approach drops relevant results and so training a large model with an entity ruler etc was necessary. It would be good to use that model inside spacyr.

I also think we might see R users wanting to use other spacy models such as for biomedical text https://github.com/allenai/scispacy and the spacy universe suggests other models such as for legal text classification (blackstone) etc. So, I think adding this arg would open up a lot of flexibility for users.

Many thanks again for all the work on the package!

@kbenoit
Copy link
Collaborator

kbenoit commented Nov 12, 2019

Great idea! Should be pretty straightforward to implement.

@poldham
Copy link
Author

poldham commented Nov 12, 2019

Great. If you need help with testing please let me know as happy to help.

Just some notes to assist if a new chunk is needed in the documentation at some point. Most of what an R user needs to know to edit the spaCy pipeline and train their own model is in Chapters Three and Four of Ines Montani’s free online course at: https://course.spacy.io/ .

When creating an EntityRuler for the pipeline I found the jsonlite package in R very useful for writing out the patterns (https://github.com/jeroen/jsonlite).

The spaCy pattern matcher tool is really useful for testing matches when writing patterns https://explosion.ai/demos/matcher

It is really easy to attach a domain specific vector space model to a new spaCy model. Details here: https://spacy.io/usage/vectors-similarity . I use fastText to do that as it is very easy to use ( https://fasttext.cc/ ). There are a couple of R packages for fastText but it seems
simpler to just do it directly. The word vectors tutorial is here: https://fasttext.cc/docs/en/unsupervised-tutorial.html . Users may of course prefer other vector models (word2vec etc.)

@amatsuo
Copy link
Collaborator

amatsuo commented Nov 12, 2019

I think we need to give a bit of thought on this.

For example, the methods provided by scispacy require a step to add a pipeline to the workflow. I haven't thought about doing it in spacyr. There should be an intuitive way to call that from R/spacyr but I am not sure what the function should look like.

@kbenoit kbenoit changed the title possible enhancement: path to local model Allow local models to be used in spacy_initialize() Sep 1, 2022
@kbenoit
Copy link
Collaborator

kbenoit commented Sep 1, 2022

We don't plan to provide tools for modifying or training language models, but if a user has custom language models, we agree that spacyr should allow these to be used. Right now this does not seem to be working, using model to specify a local path to a local model.

I downloaded this from https://github.com/explosion/spacy-models/releases/tag/en_core_web_sm-3.4.0:

(base) KB-MBP-14:spacyr kbenoit$ ls -l ~/tmp/en_core_web_sm-3.4.0.tar.gz 
-rw-r--r--@ 1 kbenoit  staff  12803030  1 Sep 18:17 /Users/kbenoit/tmp/en_core_web_sm-3.4.0.tar.gz
> library("spacyr")
> spacy_initialize(model = "~/tmp/en_core_web_sm-3.4.0.tar.gz")
Found 'spacy_condaenv'. spacyr will use this environment
 Error in py_run_file_impl(file, local, convert) : 
OSError: [E050] Can't find model '~/tmp/en_core_web_sm-3.4.0.tar.gz'. It doesn't seem to be a Python package or a valid path to a data directory. 
> spacy_initialize(model = "~/tmp/en_core_web_sm")
Python space is already attached.  If you want to switch to a different Python, please restart R.
 Error in py_run_file_impl(file, local, convert) : 
OSError: [E050] Can't find model '~/tmp/en_core_web_sm'. It doesn't seem to be a Python package or a valid path to a data directory.

@MarkMcGlashan
Copy link

MarkMcGlashan commented Nov 25, 2024

Hi. I wonder whether there might be any updates on this issue?

I would like to use the PyMUSAS RuleBasedTagger within spacyr but understand that this requires loading a separate spacy pipeline and adding this to the main pipeline (relevant to this reply from @amatsuo above).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Documentation and instructions enhancement
Projects
None yet
Development

No branches or pull requests

4 participants