-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Legate doesn't recognize/use GPUs on other nodes #949
Comments
Running on latest 24.06 release, with some additional debugging output, I can confirm that this is indeed the case (I'm testing on a machine with 2 GPUs, running 1 vs 2 ranks, with 1 GPU per rank):
In the first case the entire domain I'm pretty sure
Currently, Legion (the underlying tech that Legate is based on) splits its memory reservations between two pools: the "deferred" pool, used for allocating objects with size that's known ahead of time, e.g. most cuNumeric
We are working to remove this separation of pools, so hopefully this flag won't be relevant in the near future. |
Thank you for your reply. After some additional testing, I found something interesting: Earlier, I was not passing the I am trying out the
which I found to be weird since I believe the example uses the identity matrix as the test case. I'm not sure if this has to do with my value for As another note, the (if the follow-up questions regarding |
I am trying to run the cunumeric
cholesky.py
example on multiple nodes. Each node has 3 A100 40GB GPUs. I was running into some out-of-memory errors, so I first tried the following test script (called itmemtest.py
) to see how memory was being allocated.I ran this script with
legate --nodes <num_nodes> --gpus 3 --fbmem 38000 --eager-alloc-percentage 1 --mem-usage ./memtest.py <n>
. Here is the output forn = 121000
andnum_nodes = 1
This makes sense; The total memory usage corresponds to the size of the matrix. Now, if I run it with
num_nodes=2
, I get the same output. I would assume that if the number of nodes (hence GPUs) was doubled, the memory usage of each GPU would be halved.Is there something wrong with how I am running the program?
Also, what does the
--eager-alloc-percentage
actually do? I observed that if you make it higher, the program throws an out-of-memory error for smaller values of the matrix size. Is it OK to always keep this value at 1? Is 0 an allowed value?Any help is appreciated.
Legate version and info:
The text was updated successfully, but these errors were encountered: