-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Potential memory leak in SingleCell's .merge_single_cells()
method
#195
Comments
Thanks for documenting this @axiomcura - passing this over to @staylorx to see if he and his team can take a look. |
Thank you @gwaygenomics and @axiomcura (really, well done on the docs). We'll take a look. |
I've invited @d33bs to take a look, and he's looking into memory profiling now. |
Hi @gwaygenomics and @axiomcura - as @staylorx mentioned, I'm looking into memory profiling using SQ00014613.sqlite and the code sample provided in this issue ( When everything runs, something I'm noticing is that there are two executions of |
Yep, this is correct. The naming of "right" vs. "left" might not be the best convention, and could actually add confusion. See here for an example of the |
Thank you for the clarification @gwaygenomics ! In running a few tests regarding this issue, I found that one potential improvement to conserve memory revolves around changing a couple Running the code which @axiomcura posted with the above modification seemed to improve (lower) memory consumption by around 7GB . That said, it's a bit of a tradeoff, as I also saw increased runtime duration of 8 or more minutes each (up from ~28 minutes to ~36 minutes or longer to complete). Would this tradeoff be something worthwhile to consider in the project? I ran these tests using a GCP e2-highmem-8 machine (8 vCPU, 64 GB Mem). I analyzed memory performance using Memray. Memray provides some helpful flamegraph charts which can narrow down performance issues. If you're interested in the results in flamegraph form, please take a look below. Note: these links unfortunately won't render as html directly from Github (they're converted to txt when attempting to view the raw contents). Other ConsiderationsDuring my investigation of this issue, I noticed two other areas which were major draws on memory: Pandas DataFrame MergesThere are multiple Another consideration would be to use a Pandas-API-compatible dataframe library which is optimized for parallelization or in other ways. Modin, Mars, and Polars come to mind here as some options (in addition to Vaex mentioned below). These open the possibility to replace pandas for the library itself, or to replace pandas functionality in isolated/intensive functions. It's worth noting they will incur additional but different I/O if used in an isolated manner (many include SQLite reads from
|
I think you've captured our issues well @d33bs - thanks! I'll respond point-by-point with my perspectives. Modifying pd.DataFrame.rename
Probably, yes. Although, as you allude to in subsequent points, this is really just a bandaid for much larger issues. Pandas merges
I think this is our best bet to resolve this issue without a major overhaul. One challenge here is that we're using three columns to merge most compartments (
As pycytominer is a pipeline tool, we are 100% on board with replacing pandas all together. Version 0.1 and before required rapid development, and pandas was the right tool for the job at the time. It's no longer the right tool. However, this change might be beyond scope of this specific issue. Of the various package options you listed, do you have a sense of how we should decide? SQliteI appreciate how you've broken down your response to these three core subtitles. You've really hit the nail on the head in terms of broadening issues with the code base! Perhaps even a more dire need than replacing pandas, is replacing sqlite. See some notes: cytomining/cytominer-database#96 and cytomining/cytominer-database#124. The solution for replacing SQLite likely resides in a separate python package https://github.com/cytomining/cytominer-database or https://github.com/cytomining/cytominer-transport Ideally, pycytominer should be able to handle multiple data input types. In addition to Vaex, I'd also like to point to https://github.com/scverse/anndata as another option; one that might be closer to our data-specific scientific needs. Next steps
Please consider my comments above, and then determine how you'd wish to proceed. I'd love to continue our discussion. We should be cognizant that this tool is used by other labs, and any discussions we have should be summarized back on github. Thanks! |
Thank you for the great feedback and thoughts @gwaygenomics! There's a ton of awesome work in this repo, including the Pandas data management aspects. The There's a lot to consider with a dataframe management library change or migration. My feeling is that the choice would be best made based on the type of data involved, the environment in which pycytominer typically runs (resources, OS, etc). That said, understanding what SQLite replacement(s) will be used will help navigate the change. For the purposes of this specific issue, I plan to make a PR with the Further Thoughts on SQLite Replacement(s)Appreciate you sharing the https://github.com/cytomining/cytominer-database and https://github.com/cytomining/cytominer-transport repos. Parquet can offer fantastic performance and compression of large datasets - it looks like there's some work already going on there towards this effect. https://github.com/scverse/anndata also looks interesting and may offer improvements here. Many dataframe and other data management technologies are beginning to adopt Apache Arrow internally or as a deliverable. Because of Arrow's cross-compatibility and performance capabilities it opens many options for data management which may not have been possible before. It also provides some flexibility in dependency selection now and in the future. Arrow might offer the capability to keep the data format compatible across various file-based and in-mem formats while navigating incremental change (see below). flowchart LR
sqlite[(SQLite Database)] --> |conversion required| arrow[Arrow Data]
dataset[(Arrow-compatible\nDataset)] --> |direct| arrow[Arrow Data]
arrow --> Pandas.DataFrame
arrow --> Dataframe.X
arrow --> Others["Others (language agnostic)"]
Pandas.DataFrame --> |deprecate| Pycytominer.work
Dataframe.X --> |future-facing\nfor performance| Pycytominer.work
Others --> |integrations or\nadditional fxnality?| Pycytominer.work
Thinking about Arrow's possibilities here and the SQLite file format more generally (a single-file RDBMS) made me think DuckDB might be useful here. This might balance portability with performance and the dependency flexibility I mentioned. Welcome your thoughts with all the above, and excited to continue work towards these efforts! |
Thanks @d33bs - I feel you have a very strong grasp on several potential directions we could take, and perhaps the next step is to define two deprecation strategies for pandas and sqlite. We and others currently use pycytominer version 0.1 in many projects, and in projects pre-0.1, we used good versioning hygiene with github hashs. Therefore, once we define some deprecation strategies, we should be all set to begin this development on a A couple other miscellaneous thoughts:
What would a helpful next step be? |
Great points @gwaygenomics ! I like the idea of working on a Would it make sense to point #197 to this I think along with what's already been mentioned we could focus on some general repo health enhancements as part of the As next steps, how about the following?
My instinct is to work towards the SQLite file reads first, then work on the dataframe scalability. It may turn out that these are too tightly wound together to decouple, and maybe one will determine or resolve the other, or it would make sense to work the other way around. |
I think it is ok to merge into master prior, so let's leave it as is. In regards to strategy, yes! I 💯 concur with the plan you've outlined. Three separate issues, SQLite first, determine how tightly wound SQLite and pandas are (my intuition is that they are not as tightly coupled as you might initially guess), and then start developing! 🛠️ Thank you! |
Hi @gwaygenomics - I've begun adding some issues as per our conversations. I don't seem to have access to add branches to this repo, could I ask for your or someone who has access to create the |
Great! I just sent an invite to collaborate on this repo. You should have access now |
@gwaygenomics Thank you! I now have the access I need for the branch. |
It's a delight to read all the thoughts and ideas that have gone into this issue! Do I understand correctly that the plan here is to undertake the big overhaul (outlined in #195 (comment)), and that in turn will address this issue (with Can anyone comment on whether we already have a proposal for a stopgap solution to the original problem reported in this issue? If so, I can look around for help to implement that solution. I read through the issue, but I couldn't figure that out. |
We do not currently have a stopgap solution, nor is anyone (to my knowlege at least) working on finding one. In other words, our only approach right now is to sidestep with big overhaul
That sounds great! |
Thanks for clarifying @MarziehHaghighi had created a set of notebooks that capture her typical use cases for working with single-cell data Specifically, this is how she loads all the single cell data from a SQLite file @MarziehHaghighi do you recollect how long it takes to load a typical ~10GB SQLite file, as well as the memory requirements, using |
I'm fairly confident that @bunnech's #219 solves the issue described in the title. However, I think that some notes here are worth preserving (in someplace other than a closed issue). Therefore, we could either:
@d33bs and/or @axiomcura - Do you have any intuitions here? |
Hi @gwaybio - very excited about #219 (amazing work @bunnech)! I'd be in favor of closing the issue and continuing to reference it where appropriate. Below are some ideas for new issues based on the conversations here or in related PR's. Open to everyone's thoughts here.
|
Thanks for documenting all the separate threads discussed in this project @d33bs and @axiomcura - I've created six separate issues (and linked relevant ones). I think that it is safe to close this issue now, and reserve discussion on these topics in their respective threads. |
Small update about the changes that were applied in #219 I ran a memory profile with the same code explained here to measure runtime and memory usage Total runtime was 8 minutes for a 10 GB file:
Generated memory profile:
The resulting output from Additional infoMachine Information
|
Thanks a lot for reporting this @axiomcura (ps - I love the perfect level of detail in your issues!) That's mindblowing -- bravo (again) @bunnech 🎉 ! I'll head over to #205 for a related topic. |
Currently the
.merge_single_cells()
method in theSingleCell
class uses extraordinarily high amount of memory for a 10GB file.To replicate this:
The data used in this example can be downloaded from here
Specifically, I have downloaded the smallest file, which is
SQ00014613.sqlite
fileMemory benchmarking
To locate where the high memory consumption is occurring, I conducted a memory profile using memory_profiler
I copied the
.merge_single_cells()
source code and placed it into a script.Then, I replaced
self
with theSingleCell
instance variable, which issc_p
in this exampleBelow is the memory profile output:
The high consumption of memory is occurring within the
.load_compartment()
method (line 63) going from 298MB to 21792MB.Note that this profile was generated by only running one iteration (
exit()
statements at the bottom).The computer will start using all the memory when starting the next iteration.
Aditional information
It seems several of people have similar issues using pandas to read database files:
Add doc note on memory usage of read_sql with chunksize
Loading SQL data into Pandas without running out of memory
ENH: Memory usage of read_sql could be significantly reduced to ~25% of current memory usage pandas-dev/pandas#40847.
There is an article that explain potential methods for optimizing I/O and memory usage with postgres files, which may be applicable to the sqlite files used in
pycytominer
.Machine Information
@gwaygenomics
The text was updated successfully, but these errors were encountered: