tfidf fitting much slower than expected #335

bogedy · 2022-11-18T01:00:02Z

Hi! I came across this package because I have a dataset of ~2 million text sequences (each <500 chars long) and I wanted to get faster performance than sklearn's tfidf vectorizer while I play with different configurations. Sklearn's vectorizer is single threaded and written in python.

It takes about 5 minutes to vectorize and transform in sklearn in python:

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(sublinear_tf=True,
                        norm='l2',
                        encoding='latin-1', ngram_range=(1, 2),
                        stop_words=None)

%time X = tfidf.fit_transform(dataset.text)

CPU times: user 4min 27s, sys: 14.1 s, total: 4min 41s
Wall time: 4min 49s

I can see on top that this is only using a single thread.

with text2vec (I hope I'm using it right! I tried to follow the example http://text2vec.org/vectorization.html#tf-idf):

dt = fread('dataset.csv.tar.gz')

setkey(dt, id)

prep_fun = tolower
tok_fun = word_tokenizer

my_iterator = itoken_parallel(dt$text,
                  preprocessor = prep_fun,
                  tokenizer = tok_fun,
                  ids = dt$id,
                  progressbar = TRUE)

t10 = Sys.time()
vocab = create_vocabulary(my_iterator, ngram=c(1L, 2L))
vectorizer = vocab_vectorizer(vocab)
dtm_train = create_dtm(my_iterator, vectorizer)

# define tfidf model
tfidf = TfIdf$new(norm = 'l2', sublinear_tf = TRUE)
# fit model to train data and transform train data with fitted model
dtm_train_tfidf = fit_transform(dtm_train, tfidf)
# tfidf modified by fit_transform() call!

paste('Time to build tfidf:', difftime(Sys.time(), t10, units = 'sec'))

I've left it running on an AWS. I can see on top that 4 threads are going. But they've been going much much longer than 5 minutes. Had to kill the process eventually. If I work on a smaller subset of a few thousand articles it works fine.

Am I missing something? Or do I just lack patience? Thanks for your help.

The text was updated successfully, but these errors were encountered:

dselivanov · 2022-11-18T05:23:40Z

Hi. Code looks fine. Can you try single process (itoken() instead of itoken_parallel)?

bogedy · 2022-11-18T15:46:13Z

edit: the parallel one has been going for 2 hours now. Seems broken.

Just ran it. It took about 11 minutes on a single thread. Running the parallel again, more than 20 minutes so far and still going.

I forgot to add, when I run the parallel tokenizer I get the following warnings every few seconds while its running:

Warning message in selectChildren(jobs, timeout):
“cannot wait for child 30598 as it does not exist”
Warning message in selectChildren(jobs, timeout):
“cannot wait for child 30731 as it does not exist”
Warning message in parallel::mccollect(jobs = jobs_in_progress, wait = FALSE):
“1 parallel job did not deliver a result”
Warning message in selectChildren(jobs, timeout):
“cannot wait for child 31722 as it does not exist”
Warning message in parallel::mccollect(jobs = jobs_in_progress, wait = FALSE):
“1 parallel job did not deliver a result”
Warning message in selectChildren(jobs, timeout):
“cannot wait for child 31721 as it does not exist”
Warning message in parallel::mccollect(jobs = jobs_in_progress, wait = FALSE):
“1 parallel job did not deliver a result”
Warning message in selectChildren(jobs, timeout):
“cannot wait for child 31732 as it does not exist”
Warning message in selectChildren(jobs, timeout):
“cannot wait for child 31736 as it does not exist”
Warning message in parallel::mccollect(jobs = jobs_in_progress, wait = FALSE):
“1 parallel job did not deliver a result”
Warning message in selectChildren(jobs, timeout):
“cannot wait for child 32259 as it does not exist”
Warning message in parallel::mccollect(jobs = jobs_in_progress, wait = FALSE):
“1 parallel job did not deliver a result”
Warning message in selectChildren(jobs, timeout):
“cannot wait for child 32368 as it does not exist”
Warning message in selectChildren(jobs, timeout):
“cannot wait for child 32441 as it does not exist”

And earlier when I interrupted R early:

Warning message in selectChildren(jobs, timeout):
“cannot wait for child 17021 as it does not exist”
Warning message in selectChildren(jobs, timeout):
“cannot wait for child 17049 as it does not exist”
as(<dgTMatrix>, "dgCMatrix") is deprecated since Matrix 1.5-0; do as(., "CsparseMatrix") instead

dselivanov · 2022-11-21T01:58:58Z

This means, workers (processes which process chunks of the input data) are dying for some reason and don't deliver results of their job. You might need to investigate somehow why this happens.

dselivanov added a commit that referenced this issue Nov 30, 2022

related to warnings #335, see also futureverse/future#218

6711d31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tfidf fitting much slower than expected #335

tfidf fitting much slower than expected #335

bogedy commented Nov 18, 2022

dselivanov commented Nov 18, 2022

bogedy commented Nov 18, 2022 •

edited

Loading

dselivanov commented Nov 21, 2022

tfidf fitting much slower than expected #335

tfidf fitting much slower than expected #335

Comments

bogedy commented Nov 18, 2022

dselivanov commented Nov 18, 2022

bogedy commented Nov 18, 2022 • edited Loading

dselivanov commented Nov 21, 2022

bogedy commented Nov 18, 2022 •

edited

Loading