-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LAMMPS Terminate with Torch error #29
Comments
Hi @mhsiron , Is this definitely the full output? Or just stdout without stderr? Just wondering if there is any more information.
Did you remember to run |
I did: and there was no error from this. I used this model.pth for the LAMMPS calculation. I did a gdb on LAMMPS and this was the output:
There's no other output files. |
This is really strange, I've never seen anything like it. If you
does it error when trying to load the model in Python? Did you build your LAMMPS using |
The output looks normal -- looks like the architecture of the neural network. I built using conda, however I have also tried libtorch and received the same error. |
@anjohan ever seen something like this? I can't find any relevant information, but this looks like one of those strange internal PyTorch bugs usually resolved by upgrading. Can you try using PyTorch 1.11? You can also try disabling the JIT by setting |
Hi @Linux-cpp-lisp, my torch was 1.12. I uninstalled torch 1.12 and installed 1.11 for CUDA 10.2 (I am running CUDA 11.2 but there is no Torch for this). Same error, with or without PYTORCH_JIT=0. |
Hm I see... one other question, what |
@Linux-cpp-lisp I am on the main branch of pair_nequip and nequip version: 0.5.5 installed via pip. Thanks for your help! |
Not sure if this helps, I attempted to load model using ASE calculator in Nequip and received this warning:
|
Hm those warnings are normal and are from our code--- generally they can safely be ignored, they are there so that global state is never changed silently to avoid hard-to-debug issues in client programs. Let's try to narrow this down a little more with some old-fashioned print debugging... can you edit your
also after line 207 adding another print:
you can then rebuild with |
This is the output:
It appears the error is definitely triggered from: |
Hm ok I will ask around with the PyTorch people... In the meantime, I wonder if it has something to do with CUDA / CUDA versions?
Have you tried building without CUDA support? ( |
Compiling with USE_CUDA=0, USE_CUDNN=0 and libtorch leads to the following output:
Appears to be the same error |
means it can't have been compiled without CUDA... did you fully remove your |
Ah, I did not spend too much attention to the cmake output:
If I try to compile on a device with CUDA installed (or a GPU) I receive the following error:
I did manage to get through that error by creating a new conda environment and pip installing a cpu-only version of pytorch however. The USE_CUDA=0 and USE_CUDNN=0 were not utilize but from the CMake output I do not need any signs of CUDA being utilized. Will update after compile! |
Looks like without CUDA the model did load:
I added
I guess the new question is: how can I troubleshoot this to work with CUDA? |
I see, so CUDA is the issue (as suspected). Can you try again:
as a more relevant Python test? |
This leads to no errors -- just outputs the model architecture |
So I haven't heard anything back from the PyTorch Slack on this... It's possible this comes from your use of CUDA 10.2 (which has been recently deprecated by PyTorch); as far as I know we always tested with 11.*. You could also try to test your installation without LAMMPS or NequIP using the PyTorch tutorial (https://pytorch.org/tutorials/advanced/cpp_export.html) and loading model direct to CUDA to see if it fails in your build environment. |
I have pytorch==1.10.1 with cuda 10.2 support
I have cuda 11.2 on my system.
I compiled LAMMPS without any additional package, only Nequip support. There was no error during compilation.
I trained a nequip model and have a model.pth file.
LAMMPS terminates with the following output:
My input file is as follow:
Could I get some help into troubleshooting this?
The text was updated successfully, but these errors were encountered: