Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JWST Pipeline Memory Leaks when run as a Subprocess #8404

Closed
TheSkyentist opened this issue Apr 2, 2024 · 4 comments
Closed

JWST Pipeline Memory Leaks when run as a Subprocess #8404

TheSkyentist opened this issue Apr 2, 2024 · 4 comments

Comments

@TheSkyentist
Copy link

I am running the Stage 1 Pipeline over ~1000 of UNCAL images. To leverage multiple CPU cores I am using the Python multiprocessing library to parallelize the operations since each UNCAL image will be independent.

However, when I do so, each Stage 1 Pipeline incurs sufficient memory leakage that by the end of running 1000 files, I have ~600GB of memory usage, forcing me to use high-memory nodes. This problem continues to scale the more files I have, appearing to incur ~500MB of leakage per file processed.

Here is a minimal working example:

# Import Packages
from multiprocessing import Process
from jwst.step import GroupScaleStep

# Detector 1 Pipeline Step
def cal(file):
    test = GroupScaleStep.call(file)
    return

# Run pipeline in parallel
if __name__ == '__main__':

    # Example file
    file = 'jw01571078001_03201_00001_nis_uncal.fits'

    # Spawn subprocesses
    p = Process(target=cal, args=(file,))
    p.start()
    p.join()
    p.close()

While this performs the expected behavior, it returns the warning:

UserWarning: resource_tracker: There appear to be 3 leaked semaphore objects to clean up at shutdown

And if this is run over 1000s of images, the memory leaks continue to pile up.

This only occurs if the relevant Pipeline step is imported at the top of the function, rather than in the specific step itself. I.e. the following does not incur the same memory leakage:

# Import Packages
from multiprocessing import Process

# Detector 1 Pipeline Step
def cal(file):
    from jwst.step import GroupScaleStep #IMPORTED INSIDE FUNCTION
    test = GroupScaleStep.call(file)
    return

# Run pipeline in parallel
if __name__ == '__main__':

    # Example file
    file = 'jw01571078001_03201_00001_nis_uncal.fits'

    # Spawn subprocesses
    p = Process(target=cal, args=(file,))
    p.start()
    p.join()
    p.close()

It appears that when the JWST pipeline modules are copied to the new process, something funny is occurring that enables memory leaks. For now I am wrapping my JWST imports within the functions that are called by multiprocessing modules. Perhaps it should either be documented that this is how to avoid memory leaks or the root problem should be determined.

I have tested this problem on macOS (M2) and Linux (Rocky Linux, CentOS) and it appears in all cases.
Thanks to @jdavies-st for helping diagnose this issue.

@braingram
Copy link
Collaborator

Thanks for opening the issue and for sharing the minimal example.

What version of jwst are you using?

I tried to replicate this locally (mac OS M1, jwst main) and I'm not seeing a memory leak for the following (which is slightly modified from the example you provided):

import os
import shutil
from jwst.step import GroupScaleStep

from multiprocessing import Pool

# Detector 1 Pipeline Step
def cal(file):
    test = GroupScaleStep.call(file)
    return

# Run pipeline in parallel
if __name__ == '__main__':
    # Example file
    file = 'jw01094001002_02107_00001_nis_uncal.fits'

    # make N copies of the file to run them in parallel
    N = 50
    files = []
    for i in range(N):
        dst = f"data/{i}_{file}"
        if not os.path.exists(dst):
            shutil.copyfile(file, dst)
        files.append(dst)

    with Pool(4) as pool:
        pool.map(cal, files)

Running the above the memory usage climbs to ~240 MB per process (the size of the input file) and remains constant throughout the run. I do get the UserWarning: resource_tracker warnings you mentioned (they appear when run with main but not with #8343).

Is it possible the minimal example didn't capture the issue? Is it possible to share more of the code?

@TheSkyentist
Copy link
Author

Thanks for the quick response!

I am using pulling directly against the GitHub repo (jwst:1.14.1.dev2+gdd295809). I believe the exact tag I am using unfortunately no longer exists, however.

I may have been slightly hasty in diagnosing this issue. When I run the code you provided within an HPC cluster environment, the job reports hundreds of GBs of use, even with just ~50 input files in parallel. This can cause the job to get cancelled as it exceeds the memory budget provided. I thought it was related to the leaking semaphore warning, but I have no proof of this.

I have also been able to see the same level of memory usage on macOS. Either CentOS is not freeing up resources related to the semaphores, or it is an unrelated issue to the semaphores at all. Or, as you suggested, there may be something related to how processes are created on different OSes (e.g. spawn vs fork vs forkserver)

I'm about to head on vacation, but will look more into this issue when I return. I am hoping that if the semaphore issue is solved, it will also solve my problem.

@TheSkyentist
Copy link
Author

Closed as I believe the source of the problem is high memory usage for Stage 1 Pipeline processing. Perhaps related to #2144?

@jdavies-st
Copy link
Collaborator

I suspect this is not related to #2144, as that is TSO data which by its nature has very large input files, and the solution there was to segment _uncal files into integration chunks.

That said, if there's a memory leak in Detector1Pipeline, it would be good to document it in this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants