Processing of large images #1034

honzee · 2023-08-04T06:53:48Z

honzee
Aug 4, 2023

Hi,

I am also wondering, how would you approach analyzing large images, e.g. images generated from multiplexed imaging platforms. Our images have 40 000 x 40 000 pixels with ~30 channels. Do you think the pixie clustering/annotation approach could be applied?

Best,
Jan

cliu72 · 2023-08-04T16:44:01Z

cliu72
Aug 4, 2023
Collaborator

Hi Jan, we have used datasets of a similar size with no problem. How many pixels you can use to train the SOM depends on how much RAM your machine has. You can decrease the subset_proportion in the pixel clustering notebook to an appropriate amount for your machine. I would recommend testing with a smaller number of pixels and tracking the RAM usage on your machine to get an idea of how much your machine can handle.

1 reply

honzee Aug 5, 2023
Author

Thank you for your helpful answer, I will try it out.

FloWuenne · 2023-08-09T07:20:21Z

FloWuenne
Aug 9, 2023

Hi Ark team,

we want to use Pixie on a highly-multiplexed imaging dataset from the Lunaphore COMET platform (Sequential Immunofluorescence). Our images are similar to what @honzee described, about 35k x 30k pixels but we only have 12 channels. Our current cohort consists of 9 such images, but we have other cohorts that are even bigger.

We currently try to run pixel clustering with 1% subsetting for SOM building, but it fails during preprocessing due to memory, when asking for 240GB. I'am currently running it with 500GB and will see if this works.

slurmstepd: error: Detected 1 oom-kill event(s) in StepId=1294683.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.
============================= JOB FEEDBACK =============================
Job ID: 1294683
Cluster: helix
User/Group: hd_gr294/hd_hd
State: OUT_OF_MEMORY (exit code 0)
Nodes: 1
Cores per node: 2
CPU Utilized: 00:31:05
CPU Efficiency: 58.54% of 00:53:06 core-walltime
Job Wall-clock time: 00:26:33
Memory Utilized: n/a
Memory Efficiency: 74.95% of 240.00 GB

Are there any other ways in the pipeline to reduce the memory footprint or speed up the calculations? I tried using multiprocessing, but that created some errors based on the multiprocess pool on our Slurm cluster (I am not very familiar with using multiprocess on SLURM).

Thanks for your help!

Best,
Florian

8 replies

ngreenwald Aug 10, 2023
Maintainer

Got it, thanks for the info. If you could add a couple print statements to your script to log which steps are succeeding, and when it fails, it will help us diagnose which are the memory-inefficient steps.

I think a fairly straightforward step would be to crop your images into smaller pieces, and process them separately. Since pipeline operates on a per-image basis, having 4 images that are each 1/4th the size would require 1/4th the memory.

As a longer-term fix, we can look into the specific steps that you identify as being the most memory inefficient and look into optimizing them. If it is indeed the pre-processing that is causing the job to fail, there's likely a lot of low-hanging fruit in there, since we've only done optimization on the training with the subsetting proportion.

FloWuenne Aug 11, 2023

So I had already put print statements in, so the key problem was actually the preprocessing step in this case. Simply dialing up the RAM of the node worked, but isn't feasible for larger datasets at some point.

I have since tried to run the same dataset but with 5 and 10% subsampling, but got different errors related to core dumped and segmentation fault. Which also usually relate to memory issues, but I am wondering why SLURM reports it differently.

A few general comments and questions for datasets of this size, which will become more frequent, as people get new equipment like CODEX, Lunaphore, Rarecyte etc:

Large, multi-channel OME tiffs are quite big in size (as our dataset or larger). If people have a cohort of images that they want to train and apply their SOM on, it seems it becomes infeasible around the size of dataset that we have.
Question: Is it possible to use a collection of subset ROIs of images, representative of the dataset to train the SOM and then project the full images onto the trained SOM?
Even if my suggestion above works, the preprocessing of the full images seems to cause issues with memory. What would be some steps to mediate this, beside subsetting images into smaller FOVs.
We noticed that Pixie only ignores completely black pixles. We have some tissues that are smaller circles in a larger FOV (because lunaphore acquires a large square, larger than our tissue). In these instances, we have now opted for manually annotating tissue regions and masking the rest as black pixels. Does Pixie have a max projection filter across channels or something similar? Otherwise, this might make sense for filtering low information / background pixel.

As a general comment, I think it would be really worthwhile to consider how Pixie can scale to these larger datasets that will come out in the next year :).

Thanks for the help again Noah and great work with Pixie by the way, we really do like the approach, despite my extensive comments 😁!

cliu72 Aug 11, 2023
Collaborator

Hi Florian,

Thank you for your thoughts on this! Dealing with larger datasets is certainly something that we have been thinking about.

We discussed this issue during our team meeting today, and we think the simplest approach is to first crop the image into smaller chunks, and then stitch them back together at the end to generate a pixel phenotype map of the size of the original image. Since Pixie doesn't utilize spatial information when training the SOM, cropping shouldn't make a difference. I have opened an issue for it here: #1038. In the next few weeks, our computational staff will work on creating a Jupyter notebook that will crop the image into smaller chunks, which can then be run through Pixie normally, and then stitching the resulting pixel phenotype maps back together. A user will theoretically be able to use this notebook before Pixie, run the pipeline as is, and then use the notebook once again after. If you would like a quick solution, I think this is the way to go. You are free to try to write something that does this yourself if you would like to test it out. Otherwise, we can also let you know once our version has been created and you can test it out.

To respond to your comments:

It is certainly possible - this is in spirit very similar to what we currently do in the pipeline. We take a subset of the pixels from each image to train the SOM and then project the full images onto the trained SOM. Taking a subset of ROIs could certainly be possible, but would be a longer-term project.
As I mentioned above, I think cropping is the fastest/most straightforward option. We discussed some other ways to do this, for example utilizing packages that can read in patches of images (and not the full image). However, this would require more extensive changes to the code, so we have opted for the cropping strategy. We will continue to brainstorm ways to handle large datasets.
In the newest version of Pixie, we added the functionality to filter out any pixels that are below a certain threshold, where the threshold is determined using a certain percentile cutoff (default is 0.05) of the total signal across all channels for each FOV (relevant code is here: https://github.com/angelolab/ark-analysis/blob/main/src/ark/phenotyping/pixel_cluster_utils.py#L61-L104). This percentile cutoff could be tuned for your use-case. Happy to discuss other options too.

FloWuenne Aug 14, 2023

Hi Candace,

thank you so much for the detailed response and insight into some of your design decisions!

I think your suggestion to crop images and stitch them back together later is as. you say, probably the easiest and best solution for the short term. We had been thinking about this as well, but managed to get Pixie running on full images by doing some small tweaking:

As mentioned above, since we didn't know this filtering functionality existed in the newest version, we set all pixel that are not on our tissue to 0. This did boost speed and performance a lot, since in our case, our tissue only makes up about 40% of our image.
Visualizing our pixel classification results in Mantis unfortunately is still quite slow, even though the pixel classification maps are only about 30Mb in size. We are loading them without segmentation and without other channels, since the actual fluorescence channels are like 2.7Gb per channel. Maybe implementing pyramids or tiles in Mantis would make sense for the long run, to enable loading larger pixel maps alongside channels and segmentations.

We will continue experimenting with Pixie on our larger datasets and keep you posted. If I have some time, I might put together the CLI scripts in a nice way so that other people with large image data can utilize Pixie via the CLI on their HPCs.

Looking forward to what features you will bring to Pixie in the future!
Best,

Florian

cliu72 Aug 14, 2023
Collaborator

Hi Florian,

I'm glad you got it working! Thanks for letting us know about Mantis. Unfortunately, the organization that was developing Mantis (PICI) went through some organizational changes lately so we're not sure how often Mantis is being updated these days. We haven't used napari much but perhaps we will need to move towards it for larger images.

That would be awesome if you could share your CLI scripts! Thanks for sharing your experiences with Pixie - it will be super helpful in determining which features we will work on next.

FloWuenne · 2023-09-19T09:46:42Z

FloWuenne
Sep 19, 2023

Hi Pixie team,
congrats on the recent publication, good to see the method getting some recognition!

We have been discussing internally a lot recently about a couple of key aspects about pixie, that limit our usage of it. Some of them are already mentioned above by @honzee and me. I will summarize the main aspects here again to make it easier to follow. I will preface this by saying that I really like the approach you have established in Pixie and it has performed great on some of the data we have tested it on. However, here are the major features limiting our usage of it:

There is no implemented CLI version of Pixie, that let's us easily implement it into current workflows (like MCMICRO). Yes, I did write a very simple one, but the large parameter space, make it very time consuming to do this. Also, since I don't really know what functions might change in the future, this might in turn then need maintenance and updating.
Large datasets are challenging to analyze with Pixie due to the high amounts of RAM required. As dataset sizes scale up for technologies other then MIBI, this might limit the usage of Pixie by the community. Are you guys working on ways to mitigate this? (using different representations of the data via DASK or other frameworks, implementing approaches to perform clustering on subsets and merge these etc>
Mantis viewer breaks for larger images and thus faces the same challenge as 2). I know you said that Mantis viewer isn't being actively maintained as much. Any plans to create Napari plugins or something similar? (Napari is becoming the tool to visualize larger images and 3d images I would say at the moment).
Manually metaclustering is implemented really well using the GUI components, however still requires manual interaction. Which approaches did you try to automatically merge the metaclusters into a set of predefined number of clusters?

The main reason I am asking these questions is to see what you have tried in the past to answer some of these challenges and whether it thus would be worth to direct efforts into implementing some of them.

Thanks again and cheers! 😊

4 replies

ngreenwald Sep 19, 2023
Maintainer

Thanks for the succinct summary!

Yes, I imagine this would be quite useful for integration with pipeline managers. For our own purposes we generally don't run Pixie in this format, but it's something that I imagine could be quite useful. If you/your team would be interested in putting together something we'd be happy to accept a PR, or a separate repo. That's how the Mesmer CLI Docker is set up.
Yes, the very large WSI datasets definitely present a challenge. We already do a substantial amount of downsampling during the clustering process itself, since the extra pixels aren't useful for training. I don't think there'd be a need to do any merging of clusters, since we generate the same clusters across all of the data.The current challenge with large images is opening them in the first place to perform the downsampling. For large images, this is memory prohibitive as you mentioned. We (meaning @alex-l-kong) are looking into Dask for piecewise loading of the image data, but this is still in the very early stages.
We haven't looked into Napari, but agreed regarding its takeover of python! Mantis is currently used for manual inspection of the results, followed by updating the clustering. i.e., it's not actually linked to our pipeline in automated way. As a result, it should be fairly straightforward to replace the Mantis-based visualization with Napari-based visualization. This is another area where if someone from your team is interested in helping out, it would certainly speed things up.
I'll let @cliu72 answer 4. She did a ton of work on this for the paper

Related to the issues with large datasets, we're currently doing the final troubleshooting of a deep-learning plugin that will integrate with Pixie for cell classification. We're rolling it out to beta testers. If you're interested in trying it out, we'd definitely welcome your feedback. This would only support cell classification, not the pixel-based analysis.

cliu72 Sep 21, 2023
Collaborator

Thanks for the suggestions! Always nice to hear from users.

Regarding number 4, we tried a bunch of methods for automated metaclustering, but in the end, every method required at least some amount of manual adjustment, which motivated us to create the GUI. Since some manual adjustment seemed to be necessary, we decided on consensus hierarchical clustering + GUI, as it was fast and allowed for manual curation easily. These are some of the methods we tried for automated metaclustering:

Kmeans
Using spatial information - for each of the SOM clusters, we calculated which other clusters were more often located next to it and used that information for metaclustering
Using correlation - merge clusters that were most correlated with each other
Using the SOM nodes - merging clusters that were closer in the SOM grid
Determining the top 1-3 expressing markers for each cluster and merging those that had the same
Iterative clustering - re-cluster pixels that were in the largest hierarchical clusters in an iterative way until we reached the "cleanest" clusters (defined by number of markers expressed in the cluster over a certain z score)
Various data transformations for each of the above methods - ln, log2, log10, arcsinh, binarization, quantile normalization, etc.

If you have other ideas for automated metaclustering that you try out and find work well, we'd love to hear about it!

FloWuenne Feb 19, 2024

Hi @cliu72,

sorry for the long silence. I haven't actively used Pixie since applying it to our large dataset, but still had the thought of optimizing the algorithm for really large datasets even further and trying to automate metaclustering.

I was just wondering whether you had tried Consensusclustering as an approach for metaclustering after generating 100 metaclusters?
Something like : https://bioconductor.org/packages/release/bioc/html/ConsensusClusterPlus.html
Sorry if this approach is included in one of the methods you mentioned above. Just wanted to see whether this was something you had tried with pixie data.
Cheers!

cliu72 Feb 26, 2024
Collaborator

Hi @FloWuenne!

Yes, we actually do currently use consensusclustering for metaclustering. We use a python implementation of the R package that you linked: https://github.com/angelolab/ark-analysis/blob/main/src/ark/phenotyping/cluster_helpers.py#L415-L566

I'm glad that you're still thinking of optimizing Pixie! Any ideas are much appreciated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Processing of large images #1034

{{title}}

Replies: 3 comments 13 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Processing of large images #1034

honzee Aug 4, 2023

Replies: 3 comments · 13 replies

cliu72 Aug 4, 2023 Collaborator

honzee Aug 5, 2023 Author

FloWuenne Aug 9, 2023

ngreenwald Aug 10, 2023 Maintainer

FloWuenne Aug 11, 2023

cliu72 Aug 11, 2023 Collaborator

FloWuenne Aug 14, 2023

cliu72 Aug 14, 2023 Collaborator

FloWuenne Sep 19, 2023

ngreenwald Sep 19, 2023 Maintainer

cliu72 Sep 21, 2023 Collaborator

FloWuenne Feb 19, 2024

cliu72 Feb 26, 2024 Collaborator

honzee
Aug 4, 2023

Replies: 3 comments 13 replies

cliu72
Aug 4, 2023
Collaborator

honzee Aug 5, 2023
Author

FloWuenne
Aug 9, 2023

ngreenwald Aug 10, 2023
Maintainer

cliu72 Aug 11, 2023
Collaborator

cliu72 Aug 14, 2023
Collaborator

FloWuenne
Sep 19, 2023

ngreenwald Sep 19, 2023
Maintainer

cliu72 Sep 21, 2023
Collaborator

cliu72 Feb 26, 2024
Collaborator