-
Notifications
You must be signed in to change notification settings - Fork 136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tfidf fitting much slower than expected #335
Comments
Hi. Code looks fine. Can you try single process ( |
edit: the parallel one has been going for 2 hours now. Seems broken. Just ran it. It took about 11 minutes on a single thread. Running the parallel again, more than 20 minutes so far and still going. I forgot to add, when I run the parallel tokenizer I get the following warnings every few seconds while its running:
And earlier when I interrupted R early:
|
This means, workers (processes which process chunks of the input data) are dying for some reason and don't deliver results of their job. You might need to investigate somehow why this happens. |
Hi! I came across this package because I have a dataset of ~2 million text sequences (each <500 chars long) and I wanted to get faster performance than sklearn's tfidf vectorizer while I play with different configurations. Sklearn's vectorizer is single threaded and written in python.
It takes about 5 minutes to vectorize and transform in sklearn in python:
I can see on top that this is only using a single thread.
with text2vec (I hope I'm using it right! I tried to follow the example http://text2vec.org/vectorization.html#tf-idf):
I've left it running on an AWS. I can see on top that 4 threads are going. But they've been going much much longer than 5 minutes. Had to kill the process eventually. If I work on a smaller subset of a few thousand articles it works fine.
Am I missing something? Or do I just lack patience? Thanks for your help.
The text was updated successfully, but these errors were encountered: