Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

memory problem in Python interface with Mesh object #952

Open
the-hampel opened this issue Jul 31, 2024 · 2 comments
Open

memory problem in Python interface with Mesh object #952

the-hampel opened this issue Jul 31, 2024 · 2 comments

Comments

@the-hampel
Copy link
Member

the-hampel commented Jul 31, 2024

Comparing a simple Python script that uses nested objects with copies of triqs mesh objects seem to have a memory issue in the Python layer (probably not a memory leak) of TRIQS 3.3.x / unstable compared to 3.2.x!

Details

Consider the following script (uses only triqs and system libraries):

from copy import deepcopy
import os
import psutil
from triqs.gf.meshes import MeshImFreq

def process_memory():
    process = psutil.Process(os.getpid())
    mem_info = process.memory_info()
    return mem_info.rss

class SumkDFT():
    def __init__(self, mesh):
        self.mesh = mesh

class Solver():
    def __init__(self, sum_k):
        self.sum_k = deepcopy(sum_k)

def cycle():
    mesh = MeshImFreq(beta=40.0, statistic='Fermion', n_iw=10025)
    # mesh = np.linspace(0.0, 1.0, 10025)
    sum_k = SumkDFT(mesh=mesh)
    solver = Solver(sum_k)
    return

# now loop a lot and call cycle each time, every time a Solver object is created the memory increases!
print('mem in MB\n')
for j in range(200):
    for i in range(1000):
        cycle()
    print(f'{process_memory()/1024**2:.2f}')

Running this with triqs 3.2.x and 3.3.x gives vastly different memory footprints:
Untitled

A few more observations:

  • removing the deepcopy call removes the problem, the problem seems to come from copying a py object with the mesh
  • using the simple commented line of a numpy array instead also removes the problem, so it has to do with triqs
  • the memory is monitored here with psutils but matches what you can observe with top
  • memory consumption scales definitely with mesh size! This is the object causing the memory growth

compiler info

  • clang 16
  • MKL
  • Python 3.10 and 3.11

It would be great if someone else could verify this. The problem is pretty bad with larger objects holding many mesh objects etc. As you can imagine from the naming of the mock objects here the problem occurred in triqs/solid_dmft as a severe memory problem, making nodes run out of memory when dealing with larger objects.

  • Alex
@the-hampel
Copy link
Member Author

the-hampel commented Jul 31, 2024

I just noted that this actually seems to be a problem of deepcopy , i.e. even this much simpler script gives the same strange memory behavior:

from copy import deepcopy
import os
import psutil
from triqs.gf.meshes import MeshImFreq

def process_memory():
    process = psutil.Process(os.getpid())
    mem_info = process.memory_info()
    return mem_info.rss

def cycle():
    mesh = MeshImFreq(beta=40.0, statistic='Fermion', n_iw=10025)
    mesh_2 = deepcopy(mesh)
    return

# now loop a lot and call cycle each time, every time a Solver object is created the memory increases!
print('mem in MB\n')
for j in range(200):
    for i in range(1000):
        cycle()
    print(f'{process_memory()/1024**2:.2f}')

I guess I should in general avoid using deepcopy (the mesh object has its own copy function), but I still have to identify where this happens in my original code.

@the-hampel
Copy link
Member Author

the-hampel commented Aug 2, 2024

The issue has been identified. The problem is in the creation of attributes as variable length strings in the h5 library, which is used in the (de-)serialization of objects in triqs. This is used when calling deepcopy, mpi.bcast etc. There seems to be a memory leak in the hdf5 version 1.12.3 and 1.14.x that has not been yet reported. The issue can be resolved by using hdf version 1.10.11 or older. However, the de-serialization of tuple objects is horribly slow via h5. This was introduced when switching from boost serialization to h5 in triqs 3.2.x. @Wentzell added a fix reverting back some of these changes for simple tuples that does not require boost on the test branch https://github.com/TRIQS/triqs/tree/DEV_SERIALIZATION giving tremendous speed (factor 10 or more) improvements over the current version.

We are currently preparing an issue for the hdf5 library but for now it is safer to avoid using the newer hdf versions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant