Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

memory leak #18

Open
SolskGaer opened this issue Jan 30, 2018 · 4 comments
Open

memory leak #18

SolskGaer opened this issue Jan 30, 2018 · 4 comments

Comments

@SolskGaer
Copy link

when passing a long text, like 20MB long, the script will soon run out of the computer's memory. My machine has 16GB of RAM, it takes about 5 minutes to freeze.

@romanovzky
Copy link

I notice this as well, huge memory leak that does not improve by invoking python's garbage collector

@fbarrios
Copy link
Contributor

Hi! I will look this up, but the TextRank algorithm is actually designed to summarise articles. I don't know if results with 20MB input can be trusted.

@romanovzky
Copy link

Hi @fbarrios ,
I've noticed the memory leak if you attempt to summarise/extract keywords from many small documents. It seems that it might be retaining in memory things that it shouldn't (previous graphs maybe?).
Attempt it yourself, get an extensive corpus of small articles (like news articles) and extract keywords from all of them, and you'll very quickly see the used memory jumping to +16GB.
Cheers

@Panaetius
Copy link

Panaetius commented Feb 20, 2019

Profiling with cProfile, when the issue occurs for me, execution seems to be stuck in graph.py:159:edge_weight(), with tons of calls. The relevant edge_weight call is in pagerank_weighted.py:50:build_adjacency_matrix().

My guess is that it is due to there being too many sentences and/or too many connections between sentences, and since the adjacency matrix runtime is O(n^2) relative to number of sentences, this can blow up if there's lots of sentences.

A quick, dirty fix is to limit the number of sentences to some maximum in the keywords() method, though of course then you won't be analyzing the whole text anymore.

Switching over to a sparse adjacency matrix and estimating likely candidates, as described on https://cran.r-project.org/web/packages/textrank/vignettes/textrank.html (In the MinHash section) might make it feasible for larger documents. But I don't know your codebase or the textrank algorithm well enough to implement it myself easily.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants