Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LAMMPS Terminate with Torch error #29

Open
mhsiron opened this issue Sep 15, 2022 · 19 comments
Open

LAMMPS Terminate with Torch error #29

mhsiron opened this issue Sep 15, 2022 · 19 comments

Comments

@mhsiron
Copy link

mhsiron commented Sep 15, 2022

I have pytorch==1.10.1 with cuda 10.2 support
I have cuda 11.2 on my system.
I compiled LAMMPS without any additional package, only Nequip support. There was no error during compilation.

I trained a nequip model and have a model.pth file.

LAMMPS terminates with the following output:

LAMMPS (29 Sep 2021 - Update 3)
OMP_NUM_THREADS environment is not set. Defaulting to 1 thread. (src/comm.cpp:98)
  using 1 OpenMP thread(s) per MPI task
Reading data file ...
  orthogonal box = (0.0000000 0.0000000 0.0000000) to (11.408130 11.408130 19.495504)
  1 by 1 by 1 MPI processor grid
  reading atoms ...
  215 atoms
  read_data CPU = 0.001 seconds
Replicating atoms ...
  orthogonal box = (0.0000000 0.0000000 0.0000000) to (34.224390 34.224390 58.486512)
  1 by 1 by 1 MPI processor grid
  5805 atoms
  replicate CPU = 0.001 seconds
NEQUIP is using device cuda
NequIP Coeff: type 1 is element Ti
NequIP Coeff: type 2 is element O
Loading model from model.pth
terminate called after throwing an instance of 'torch::jit::ErrorReport'
terminate called recursively

My input file is as follow:

units           metal
boundary        p p p

atom_style      atomic

read_data       data.meam


replicate 3 3 3

pair_style      nequip
pair_coeff      * * model.pth Ti O


minimize        1.0e-8 1.0e-8 1000 100000
min_style       cg

timestep 0.005
velocity all create 1000.0 454883 mom yes rot yes dist gaussian

thermo_style custom step pe ke etotal temp press density
thermo 100

fix 2 all npt temp 1000.00 2400.00 0.1 iso 1.0 1.0 1000.0


dump           1 all atom 10000 dump.meam

run             10000000

Could I get some help into troubleshooting this?

@Linux-cpp-lisp
Copy link
Collaborator

Linux-cpp-lisp commented Sep 15, 2022

Hi @mhsiron ,

Is this definitely the full output? Or just stdout without stderr? Just wondering if there is any more information.

I trained a nequip model and have a model.pth file.

Did you remember to run nequip-deploy? Was there anything unusual about your training or model?

@mhsiron
Copy link
Author

mhsiron commented Sep 15, 2022

I did:
nequip-deploy build --train-dir results/tio2/anatase/ model.pth

and there was no error from this. I used this model.pth for the LAMMPS calculation.

I did a gdb on LAMMPS and this was the output:

Starting program: lmp -in in.script
Missing separate debuginfos, use: zypper install glibc-debuginfo-2.22-100.15.4.x86_64
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
warning: File "/nfs/site/itools/em64t_SLES12SP5/pkgs/gcc/9.2.0/lib64/libstdc++.so.6.0.27-gdb.py" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load".
To enable execution of this file add
        add-auto-load-safe-path /nfs/site/itools/em64t_SLES12SP5/pkgs/gcc/9.2.0/lib64/libstdc++.so.6.0.27-gdb.py
line to your configuration file "$HOME/.gdbinit".
To completely disable this security protection add
        set auto-load safe-path /
line to your configuration file "$HOME/.gdbinit".
For more information about this security protection see the
"Auto-loading safe path" section in the GDB manual.  E.g., run from the shell:
        info "(gdb)Auto-loading safe path"
LAMMPS (29 Sep 2021 - Update 3)
OMP_NUM_THREADS environment is not set. Defaulting to 1 thread. (src/comm.cpp:98)
  using 1 OpenMP thread(s) per MPI task
Reading data file ...
  orthogonal box = (0.0000000 0.0000000 0.0000000) to (11.408130 11.408130 19.495504)
  1 by 1 by 1 MPI processor grid
  reading atoms ...
  215 atoms
  read_data CPU = 0.001 seconds
Replicating atoms ...
  orthogonal box = (0.0000000 0.0000000 0.0000000) to (34.224390 34.224390 58.486512)
  1 by 1 by 1 MPI processor grid
  5805 atoms
  replicate CPU = 0.001 seconds
Missing separate debuginfo for /usr/lib64/libcuda.so.1
Try: zypper install -C "debuginfo(build-id)=2f5b386bef4cbe74500eaa8ad1839ea4825315a8"
[New Thread 0x2aab39204700 (LWP 36273)]
NEQUIP is using device cuda
NequIP Coeff: type 1 is element Ti
NequIP Coeff: type 2 is element O
Loading model from model.pth
[New Thread 0x2aab39405700 (LWP 36275)]
terminate called after throwing an instance of 'torch::jit::ErrorReport'
terminate called recursively

Thread 1 "lmp" received signal SIGABRT, Aborted.
0x00002aab028c9fd7 in raise () from /lib64/libc.so.6

There's no other output files.

@Linux-cpp-lisp
Copy link
Collaborator

This is really strange, I've never seen anything like it.

If you

import torch
m = torch.jit.load("model.pth")
print(m)

does it error when trying to load the model in Python?

Did you build your LAMMPS using libtorch or PyTorch from your conda env?

@mhsiron
Copy link
Author

mhsiron commented Sep 15, 2022

The output looks normal -- looks like the architecture of the neural network. I built using conda, however I have also tried libtorch and received the same error.

@Linux-cpp-lisp
Copy link
Collaborator

@anjohan ever seen something like this?

I can't find any relevant information, but this looks like one of those strange internal PyTorch bugs usually resolved by upgrading. Can you try using PyTorch 1.11?

You can also try disabling the JIT by setting PYTORCH_JIT=0 which can sometimes help by bypassing some of the things that are more fragile at the price of some performance.

@mhsiron
Copy link
Author

mhsiron commented Sep 18, 2022

Hi @Linux-cpp-lisp, my torch was 1.12. I uninstalled torch 1.12 and installed 1.11 for CUDA 10.2 (I am running CUDA 11.2 but there is no Torch for this). Same error, with or without PYTORCH_JIT=0.

@Linux-cpp-lisp
Copy link
Collaborator

Hm I see... one other question, what nequip and pair_nequip versions are you using, and are you using the same nequip version on which you trained the model?

@mhsiron
Copy link
Author

mhsiron commented Sep 19, 2022

@Linux-cpp-lisp I am on the main branch of pair_nequip and nequip version: 0.5.5 installed via pip.

Thanks for your help!

@mhsiron
Copy link
Author

mhsiron commented Sep 19, 2022

Not sure if this helps, I attempted to load model using ASE calculator in Nequip and received this warning:

//nequip/utils/_global_options.py:58: UserWarning: Setting the GLOBAL value for jit fusion strategy to `[('DYNAMIC', 3)]` which is different than the previous value of `[('STATIC', 2), ('DYNAMIC', 10)]`
  warnings.warn(
///nequip/ase/nequip_calculator.py:73: UserWarning: Trying to use chemical symbols as NequIP type names; this may not be correct for your model! To avoid this warning, please provide `species_to_type_name` explicitly.
  warnings.warn(

@Linux-cpp-lisp
Copy link
Collaborator

Hm those warnings are normal and are from our code--- generally they can safely be ignored, they are there so that global state is never changed silently to avoid hard-to-debug issues in client programs.

Let's try to narrow this down a little more with some old-fashioned print debugging... can you edit your LAMMPS/src/pair_nequip.cpp around like 165 so it reads:

  std::cout << "Loading model from " << arg[2] << "\n";

  std::unordered_map<std::string, std::string> metadata = {
    {"config", ""},
    {"nequip_version", ""},
    {"r_max", ""},
    {"n_species", ""},
    {"type_names", ""},
    {"_jit_bailout_depth", ""},
    {"_jit_fusion_strategy", ""},
    {"allow_tf32", ""}
  };
  std::cout << "TEST: loading\n";
  model = torch::jit::load(std::string(arg[2]), device, metadata);
  std::cout << "TEST: loaded\n";
  model.eval();
  std::cout << "TEST: eval mode on\n";

also after line 207 adding another print:

    // Do it normally
      model = torch::jit::freeze(model);
    #endif
  }
  std::cout << "TEST: froze model\n";

you can then rebuild with make; it should be fast and only rebuild this one file. I'd like to confirm that the issue is really coming up in torch::jit::load before I ask around with the PyTorch people.

@mhsiron
Copy link
Author

mhsiron commented Sep 20, 2022

This is the output:

LAMMPS (29 Sep 2021 - Update 3)
OMP_NUM_THREADS environment is not set. Defaulting to 1 thread. (src/comm.cpp:98)
  using 1 OpenMP thread(s) per MPI task
Reading data file ...
  orthogonal box = (0.0000000 0.0000000 0.0000000) to (11.408130 11.408130 19.495504)
  1 by 1 by 1 MPI processor grid
  reading atoms ...
  215 atoms
  read_data CPU = 0.001 seconds
Replicating atoms ...
  orthogonal box = (0.0000000 0.0000000 0.0000000) to (34.224390 34.224390 58.486512)
  1 by 1 by 1 MPI processor grid
  5805 atoms
  replicate CPU = 0.001 seconds
NEQUIP is using device cuda
NequIP Coeff: type 1 is element Ti
NequIP Coeff: type 2 is element O
Loading model from model.pth
TEST: loading
terminate called after throwing an instance of 'torch::jit::ErrorReport'
terminate called recursively

It appears the error is definitely triggered from: model = torch::jit::load(std::string(arg[2]), device, metadata);

@Linux-cpp-lisp
Copy link
Collaborator

Linux-cpp-lisp commented Sep 21, 2022

Hm ok I will ask around with the PyTorch people...

In the meantime, I wonder if it has something to do with CUDA / CUDA versions?

I have pytorch==1.10.1 with cuda 10.2 support. I have cuda 11.2 on my system.

Have you tried building without CUDA support? (USE_CUDA=0 USE_CUDNN=0 cmake ... and using base libtorch? Will probably need a fresh build dir.)

@mhsiron
Copy link
Author

mhsiron commented Sep 21, 2022

Compiling with USE_CUDA=0, USE_CUDNN=0 and libtorch leads to the following output:

LAMMPS (29 Sep 2021 - Update 3)
OMP_NUM_THREADS environment is not set. Defaulting to 1 thread. (src/comm.cpp:98)
  using 1 OpenMP thread(s) per MPI task
Reading data file ...
  orthogonal box = (0.0000000 0.0000000 0.0000000) to (11.408130 11.408130 19.495504)
  1 by 1 by 1 MPI processor grid
  reading atoms ...
  215 atoms
  read_data CPU = 0.001 seconds
Replicating atoms ...
  orthogonal box = (0.0000000 0.0000000 0.0000000) to (34.224390 34.224390 58.486512)
  1 by 1 by 1 MPI processor grid
  5805 atoms
  replicate CPU = 0.001 seconds
NEQUIP is using device cuda
NequIP Coeff: type 1 is element Ti
NequIP Coeff: type 2 is element O
Loading model from model.pth
TEST: loading
terminate called after throwing an instance of 'torch::jit::ErrorReport'
terminate called recursively

Appears to be the same error

@Linux-cpp-lisp
Copy link
Collaborator

NEQUIP is using device cuda

means it can't have been compiled without CUDA... did you fully remove your build directory between different attempts?

@mhsiron
Copy link
Author

mhsiron commented Sep 21, 2022

Ah, I did not spend too much attention to the cmake output:

CMake Warning:
  Manually-specified variables were not used by the project:

    USE_CDNN
    USE_CUDA

If I try to compile on a device with CUDA installed (or a GPU) I receive the following error:

CUDA_TOOLKIT_ROOT_DIR not found or specified
-- Could NOT find CUDA (missing: CUDA_TOOLKIT_ROOT_DIR CUDA_NVCC_EXECUTABLE CUDA_INCLUDE_DIRS CUDA_CUDART_LIBRARY) 
CMake Warning at /nfs/site/disks/msironml/lammps/libtorch/share/cmake/Caffe2/public/cuda.cmake:31 (message):
  Caffe2: CUDA cannot be found.  Depending on whether you are building Caffe2
  or a Caffe2 dependent library, the next warning / error will give you more
  info.
Call Stack (most recent call first):
  /nfs/site/disks/msironml/lammps/libtorch/share/cmake/Caffe2/Caffe2Config.cmake:88 (include)
  /nfs/site/disks/msironml/lammps/libtorch/share/cmake/Torch/TorchConfig.cmake:68 (find_package)
  CMakeLists.txt:922 (find_package)


CMake Error at /nfs/site/disks/msironml/lammps/libtorch/share/cmake/Caffe2/Caffe2Config.cmake:90 (message):
  Your installed Caffe2 version uses CUDA but I cannot find the CUDA
  libraries.  Please set the proper CUDA prefixes and / or install CUDA.
Call Stack (most recent call first):
  /nfs/site/disks/msironml/lammps/libtorch/share/cmake/Torch/TorchConfig.cmake:68 (find_package)
  CMakeLists.txt:922 (find_package)

I did manage to get through that error by creating a new conda environment and pip installing a cpu-only version of pytorch however. The USE_CUDA=0 and USE_CUDNN=0 were not utilize but from the CMake output I do not need any signs of CUDA being utilized. Will update after compile!

@mhsiron
Copy link
Author

mhsiron commented Sep 21, 2022

Looks like without CUDA the model did load:

LAMMPS (29 Sep 2021 - Update 2)
OMP_NLAMMPS (29 Sep 2021 - Update 2)
OMP_NUM_THREADS environment is not set. Defaulting to 1 thread. (src/comm.cpp:98)
  using 1 OpenMP thread(s) per MPI task
Reading data file ...
  orthogonal box = (0.0000000 0.0000000 0.0000000) to (11.408130 11.408130 19.495504)
  1 by 1 by 1 MPI processor grid
  reading atoms ...
  215 atoms
  read_data CPU = 0.001 seconds
Replicating atoms ...
  orthogonal box = (0.0000000 0.0000000 0.0000000) to (34.224390 34.224390 58.486512)
  1 by 1 by 1 MPI processor grid
  5805 atoms
  replicate CPU = 0.001 seconds
NEQUIP is using device cpu
NequIP Coeff: type 1 is element Ti
NequIP Coeff: type 2 is element O
Loading model from model.pth
TEST: loading
TEST: loaded
TEST: eval mode on
Freezing TorchScript model...
TEST: froze model
WARNING: Using 'neigh_modify every 1 delay 0 check yes' setting during minimization (src/min.cpp:188)
ERROR: Pair style NEQUIP requires newton pair off (src/pair_nequip.cpp:108)
Last command: minimize        1.0e-8 1.0e-8 1000 100000UM_THREADS environment is not set. Defaulting to 1 thread. (src/comm.cpp:98)

I added newton off to the input script and it does appear to run. Removed replicate command for speed and to test:

LAMMPS (29 Sep 2021 - Update 2)
OMP_NUM_THREADS environment is not set. Defaulting to 1 thread. (src/comm.cpp:98)
  using 1 OpenMP thread(s) per MPI task
Reading data file ...
  orthogonal box = (0.0000000 0.0000000 0.0000000) to (11.408130 11.408130 19.495504)
  1 by 1 by 1 MPI processor grid
  reading atoms ...
  215 atoms
  read_data CPU = 0.001 seconds
NEQUIP is using device cpu
NequIP Coeff: type 1 is element Ti
NequIP Coeff: type 2 is element O
Loading model from model.pth
TEST: loading
TEST: loaded
TEST: eval mode on
Freezing TorchScript model...
TEST: froze model
WARNING: Using 'neigh_modify every 1 delay 0 check yes' setting during minimization (src/min.cpp:188)
Neighbor list info ...
  update every 1 steps, delay 0 steps, check yes
  max neighbors/atom: 2000, page size: 100000
  master list distance cutoff = 9
  ghost atom cutoff = 9
  binsize = 4.5, bins = 3 3 5
  1 neighbor lists, perpetual/occasional/extra = 1 0 0
  (1) pair nequip, perpetual
      attributes: full, newton off
      pair build: full/bin/atomonly
      stencil: full/bin/3d
      bin: standard
Setting up cg style minimization ...
  Unit style    : metal
  Current step  : 0
Per MPI rank memory allocation (min/avg/max) = 4.260 | 4.260 | 4.260 Mbytes
Step Temp E_pair E_mol TotEng Press 
       0            0   -1919.9104            0   -1919.9104            0 

I guess the new question is: how can I troubleshoot this to work with CUDA?

@Linux-cpp-lisp
Copy link
Collaborator

I see, so CUDA is the issue (as suspected). Can you try again:

import torch
m = torch.jit.load("model.pth", map_location="cuda")
print(m)

as a more relevant Python test?

@mhsiron
Copy link
Author

mhsiron commented Sep 21, 2022

This leads to no errors -- just outputs the model architecture

@Linux-cpp-lisp
Copy link
Collaborator

So I haven't heard anything back from the PyTorch Slack on this...

It's possible this comes from your use of CUDA 10.2 (which has been recently deprecated by PyTorch); as far as I know we always tested with 11.*.

You could also try to test your installation without LAMMPS or NequIP using the PyTorch tutorial (https://pytorch.org/tutorials/advanced/cpp_export.html) and loading model direct to CUDA to see if it fails in your build environment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants