LAMMPS Terminate with Torch error #29

mhsiron · 2022-09-15T17:37:01Z

I have pytorch==1.10.1 with cuda 10.2 support
I have cuda 11.2 on my system.
I compiled LAMMPS without any additional package, only Nequip support. There was no error during compilation.

I trained a nequip model and have a model.pth file.

LAMMPS terminates with the following output:

LAMMPS (29 Sep 2021 - Update 3)
OMP_NUM_THREADS environment is not set. Defaulting to 1 thread. (src/comm.cpp:98)
  using 1 OpenMP thread(s) per MPI task
Reading data file ...
  orthogonal box = (0.0000000 0.0000000 0.0000000) to (11.408130 11.408130 19.495504)
  1 by 1 by 1 MPI processor grid
  reading atoms ...
  215 atoms
  read_data CPU = 0.001 seconds
Replicating atoms ...
  orthogonal box = (0.0000000 0.0000000 0.0000000) to (34.224390 34.224390 58.486512)
  1 by 1 by 1 MPI processor grid
  5805 atoms
  replicate CPU = 0.001 seconds
NEQUIP is using device cuda
NequIP Coeff: type 1 is element Ti
NequIP Coeff: type 2 is element O
Loading model from model.pth
terminate called after throwing an instance of 'torch::jit::ErrorReport'
terminate called recursively

My input file is as follow:

units           metal
boundary        p p p

atom_style      atomic

read_data       data.meam


replicate 3 3 3

pair_style      nequip
pair_coeff      * * model.pth Ti O


minimize        1.0e-8 1.0e-8 1000 100000
min_style       cg

timestep 0.005
velocity all create 1000.0 454883 mom yes rot yes dist gaussian

thermo_style custom step pe ke etotal temp press density
thermo 100

fix 2 all npt temp 1000.00 2400.00 0.1 iso 1.0 1.0 1000.0


dump           1 all atom 10000 dump.meam

run             10000000

Could I get some help into troubleshooting this?

The text was updated successfully, but these errors were encountered:

Linux-cpp-lisp · 2022-09-15T19:53:22Z

Hi @mhsiron ,

Is this definitely the full output? Or just stdout without stderr? Just wondering if there is any more information.

I trained a nequip model and have a model.pth file.

Did you remember to run nequip-deploy? Was there anything unusual about your training or model?

mhsiron · 2022-09-15T20:17:19Z

I did:
nequip-deploy build --train-dir results/tio2/anatase/ model.pth

and there was no error from this. I used this model.pth for the LAMMPS calculation.

I did a gdb on LAMMPS and this was the output:

Starting program: lmp -in in.script
Missing separate debuginfos, use: zypper install glibc-debuginfo-2.22-100.15.4.x86_64
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
warning: File "/nfs/site/itools/em64t_SLES12SP5/pkgs/gcc/9.2.0/lib64/libstdc++.so.6.0.27-gdb.py" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load".
To enable execution of this file add
        add-auto-load-safe-path /nfs/site/itools/em64t_SLES12SP5/pkgs/gcc/9.2.0/lib64/libstdc++.so.6.0.27-gdb.py
line to your configuration file "$HOME/.gdbinit".
To completely disable this security protection add
        set auto-load safe-path /
line to your configuration file "$HOME/.gdbinit".
For more information about this security protection see the
"Auto-loading safe path" section in the GDB manual.  E.g., run from the shell:
        info "(gdb)Auto-loading safe path"
LAMMPS (29 Sep 2021 - Update 3)
OMP_NUM_THREADS environment is not set. Defaulting to 1 thread. (src/comm.cpp:98)
  using 1 OpenMP thread(s) per MPI task
Reading data file ...
  orthogonal box = (0.0000000 0.0000000 0.0000000) to (11.408130 11.408130 19.495504)
  1 by 1 by 1 MPI processor grid
  reading atoms ...
  215 atoms
  read_data CPU = 0.001 seconds
Replicating atoms ...
  orthogonal box = (0.0000000 0.0000000 0.0000000) to (34.224390 34.224390 58.486512)
  1 by 1 by 1 MPI processor grid
  5805 atoms
  replicate CPU = 0.001 seconds
Missing separate debuginfo for /usr/lib64/libcuda.so.1
Try: zypper install -C "debuginfo(build-id)=2f5b386bef4cbe74500eaa8ad1839ea4825315a8"
[New Thread 0x2aab39204700 (LWP 36273)]
NEQUIP is using device cuda
NequIP Coeff: type 1 is element Ti
NequIP Coeff: type 2 is element O
Loading model from model.pth
[New Thread 0x2aab39405700 (LWP 36275)]
terminate called after throwing an instance of 'torch::jit::ErrorReport'
terminate called recursively

Thread 1 "lmp" received signal SIGABRT, Aborted.
0x00002aab028c9fd7 in raise () from /lib64/libc.so.6

There's no other output files.

Linux-cpp-lisp · 2022-09-15T20:35:39Z

This is really strange, I've never seen anything like it.

If you

import torch
m = torch.jit.load("model.pth")
print(m)

does it error when trying to load the model in Python?

Did you build your LAMMPS using libtorch or PyTorch from your conda env?

mhsiron · 2022-09-15T20:38:53Z

The output looks normal -- looks like the architecture of the neural network. I built using conda, however I have also tried libtorch and received the same error.

Linux-cpp-lisp · 2022-09-18T23:22:54Z

@anjohan ever seen something like this?

I can't find any relevant information, but this looks like one of those strange internal PyTorch bugs usually resolved by upgrading. Can you try using PyTorch 1.11?

You can also try disabling the JIT by setting PYTORCH_JIT=0 which can sometimes help by bypassing some of the things that are more fragile at the price of some performance.

mhsiron · 2022-09-18T23:39:39Z

Hi @Linux-cpp-lisp, my torch was 1.12. I uninstalled torch 1.12 and installed 1.11 for CUDA 10.2 (I am running CUDA 11.2 but there is no Torch for this). Same error, with or without PYTORCH_JIT=0.

Linux-cpp-lisp · 2022-09-19T00:20:31Z

Hm I see... one other question, what nequip and pair_nequip versions are you using, and are you using the same nequip version on which you trained the model?

mhsiron · 2022-09-19T01:31:35Z

@Linux-cpp-lisp I am on the main branch of pair_nequip and nequip version: 0.5.5 installed via pip.

Thanks for your help!

mhsiron · 2022-09-19T16:51:24Z

Not sure if this helps, I attempted to load model using ASE calculator in Nequip and received this warning:

//nequip/utils/_global_options.py:58: UserWarning: Setting the GLOBAL value for jit fusion strategy to `[('DYNAMIC', 3)]` which is different than the previous value of `[('STATIC', 2), ('DYNAMIC', 10)]`
  warnings.warn(
///nequip/ase/nequip_calculator.py:73: UserWarning: Trying to use chemical symbols as NequIP type names; this may not be correct for your model! To avoid this warning, please provide `species_to_type_name` explicitly.
  warnings.warn(

Linux-cpp-lisp · 2022-09-19T17:13:21Z

Hm those warnings are normal and are from our code--- generally they can safely be ignored, they are there so that global state is never changed silently to avoid hard-to-debug issues in client programs.

Let's try to narrow this down a little more with some old-fashioned print debugging... can you edit your LAMMPS/src/pair_nequip.cpp around like 165 so it reads:

  std::cout << "Loading model from " << arg[2] << "\n";

  std::unordered_map<std::string, std::string> metadata = {
    {"config", ""},
    {"nequip_version", ""},
    {"r_max", ""},
    {"n_species", ""},
    {"type_names", ""},
    {"_jit_bailout_depth", ""},
    {"_jit_fusion_strategy", ""},
    {"allow_tf32", ""}
  };
  std::cout << "TEST: loading\n";
  model = torch::jit::load(std::string(arg[2]), device, metadata);
  std::cout << "TEST: loaded\n";
  model.eval();
  std::cout << "TEST: eval mode on\n";

also after line 207 adding another print:

    // Do it normally
      model = torch::jit::freeze(model);
    #endif
  }
  std::cout << "TEST: froze model\n";

you can then rebuild with make; it should be fast and only rebuild this one file. I'd like to confirm that the issue is really coming up in torch::jit::load before I ask around with the PyTorch people.

mhsiron · 2022-09-20T21:41:24Z

This is the output:

LAMMPS (29 Sep 2021 - Update 3)
OMP_NUM_THREADS environment is not set. Defaulting to 1 thread. (src/comm.cpp:98)
  using 1 OpenMP thread(s) per MPI task
Reading data file ...
  orthogonal box = (0.0000000 0.0000000 0.0000000) to (11.408130 11.408130 19.495504)
  1 by 1 by 1 MPI processor grid
  reading atoms ...
  215 atoms
  read_data CPU = 0.001 seconds
Replicating atoms ...
  orthogonal box = (0.0000000 0.0000000 0.0000000) to (34.224390 34.224390 58.486512)
  1 by 1 by 1 MPI processor grid
  5805 atoms
  replicate CPU = 0.001 seconds
NEQUIP is using device cuda
NequIP Coeff: type 1 is element Ti
NequIP Coeff: type 2 is element O
Loading model from model.pth
TEST: loading
terminate called after throwing an instance of 'torch::jit::ErrorReport'
terminate called recursively

It appears the error is definitely triggered from: model = torch::jit::load(std::string(arg[2]), device, metadata);

Linux-cpp-lisp · 2022-09-21T01:36:44Z

Hm ok I will ask around with the PyTorch people...

In the meantime, I wonder if it has something to do with CUDA / CUDA versions?

I have pytorch==1.10.1 with cuda 10.2 support. I have cuda 11.2 on my system.

Have you tried building without CUDA support? (USE_CUDA=0 USE_CUDNN=0 cmake ... and using base libtorch? Will probably need a fresh build dir.)

mhsiron · 2022-09-21T02:26:29Z

Compiling with USE_CUDA=0, USE_CUDNN=0 and libtorch leads to the following output:

LAMMPS (29 Sep 2021 - Update 3)
OMP_NUM_THREADS environment is not set. Defaulting to 1 thread. (src/comm.cpp:98)
  using 1 OpenMP thread(s) per MPI task
Reading data file ...
  orthogonal box = (0.0000000 0.0000000 0.0000000) to (11.408130 11.408130 19.495504)
  1 by 1 by 1 MPI processor grid
  reading atoms ...
  215 atoms
  read_data CPU = 0.001 seconds
Replicating atoms ...
  orthogonal box = (0.0000000 0.0000000 0.0000000) to (34.224390 34.224390 58.486512)
  1 by 1 by 1 MPI processor grid
  5805 atoms
  replicate CPU = 0.001 seconds
NEQUIP is using device cuda
NequIP Coeff: type 1 is element Ti
NequIP Coeff: type 2 is element O
Loading model from model.pth
TEST: loading
terminate called after throwing an instance of 'torch::jit::ErrorReport'
terminate called recursively

Appears to be the same error

Linux-cpp-lisp · 2022-09-21T02:40:53Z

NEQUIP is using device cuda

means it can't have been compiled without CUDA... did you fully remove your build directory between different attempts?

mhsiron · 2022-09-21T02:47:51Z

Ah, I did not spend too much attention to the cmake output:

CMake Warning:
  Manually-specified variables were not used by the project:

    USE_CDNN
    USE_CUDA

If I try to compile on a device with CUDA installed (or a GPU) I receive the following error:

CUDA_TOOLKIT_ROOT_DIR not found or specified
-- Could NOT find CUDA (missing: CUDA_TOOLKIT_ROOT_DIR CUDA_NVCC_EXECUTABLE CUDA_INCLUDE_DIRS CUDA_CUDART_LIBRARY) 
CMake Warning at /nfs/site/disks/msironml/lammps/libtorch/share/cmake/Caffe2/public/cuda.cmake:31 (message):
  Caffe2: CUDA cannot be found.  Depending on whether you are building Caffe2
  or a Caffe2 dependent library, the next warning / error will give you more
  info.
Call Stack (most recent call first):
  /nfs/site/disks/msironml/lammps/libtorch/share/cmake/Caffe2/Caffe2Config.cmake:88 (include)
  /nfs/site/disks/msironml/lammps/libtorch/share/cmake/Torch/TorchConfig.cmake:68 (find_package)
  CMakeLists.txt:922 (find_package)


CMake Error at /nfs/site/disks/msironml/lammps/libtorch/share/cmake/Caffe2/Caffe2Config.cmake:90 (message):
  Your installed Caffe2 version uses CUDA but I cannot find the CUDA
  libraries.  Please set the proper CUDA prefixes and / or install CUDA.
Call Stack (most recent call first):
  /nfs/site/disks/msironml/lammps/libtorch/share/cmake/Torch/TorchConfig.cmake:68 (find_package)
  CMakeLists.txt:922 (find_package)

I did manage to get through that error by creating a new conda environment and pip installing a cpu-only version of pytorch however. The USE_CUDA=0 and USE_CUDNN=0 were not utilize but from the CMake output I do not need any signs of CUDA being utilized. Will update after compile!

mhsiron · 2022-09-21T03:22:16Z

Looks like without CUDA the model did load:

LAMMPS (29 Sep 2021 - Update 2)
OMP_NLAMMPS (29 Sep 2021 - Update 2)
OMP_NUM_THREADS environment is not set. Defaulting to 1 thread. (src/comm.cpp:98)
  using 1 OpenMP thread(s) per MPI task
Reading data file ...
  orthogonal box = (0.0000000 0.0000000 0.0000000) to (11.408130 11.408130 19.495504)
  1 by 1 by 1 MPI processor grid
  reading atoms ...
  215 atoms
  read_data CPU = 0.001 seconds
Replicating atoms ...
  orthogonal box = (0.0000000 0.0000000 0.0000000) to (34.224390 34.224390 58.486512)
  1 by 1 by 1 MPI processor grid
  5805 atoms
  replicate CPU = 0.001 seconds
NEQUIP is using device cpu
NequIP Coeff: type 1 is element Ti
NequIP Coeff: type 2 is element O
Loading model from model.pth
TEST: loading
TEST: loaded
TEST: eval mode on
Freezing TorchScript model...
TEST: froze model
WARNING: Using 'neigh_modify every 1 delay 0 check yes' setting during minimization (src/min.cpp:188)
ERROR: Pair style NEQUIP requires newton pair off (src/pair_nequip.cpp:108)
Last command: minimize        1.0e-8 1.0e-8 1000 100000UM_THREADS environment is not set. Defaulting to 1 thread. (src/comm.cpp:98)

I added newton off to the input script and it does appear to run. Removed replicate command for speed and to test:

LAMMPS (29 Sep 2021 - Update 2)
OMP_NUM_THREADS environment is not set. Defaulting to 1 thread. (src/comm.cpp:98)
  using 1 OpenMP thread(s) per MPI task
Reading data file ...
  orthogonal box = (0.0000000 0.0000000 0.0000000) to (11.408130 11.408130 19.495504)
  1 by 1 by 1 MPI processor grid
  reading atoms ...
  215 atoms
  read_data CPU = 0.001 seconds
NEQUIP is using device cpu
NequIP Coeff: type 1 is element Ti
NequIP Coeff: type 2 is element O
Loading model from model.pth
TEST: loading
TEST: loaded
TEST: eval mode on
Freezing TorchScript model...
TEST: froze model
WARNING: Using 'neigh_modify every 1 delay 0 check yes' setting during minimization (src/min.cpp:188)
Neighbor list info ...
  update every 1 steps, delay 0 steps, check yes
  max neighbors/atom: 2000, page size: 100000
  master list distance cutoff = 9
  ghost atom cutoff = 9
  binsize = 4.5, bins = 3 3 5
  1 neighbor lists, perpetual/occasional/extra = 1 0 0
  (1) pair nequip, perpetual
      attributes: full, newton off
      pair build: full/bin/atomonly
      stencil: full/bin/3d
      bin: standard
Setting up cg style minimization ...
  Unit style    : metal
  Current step  : 0
Per MPI rank memory allocation (min/avg/max) = 4.260 | 4.260 | 4.260 Mbytes
Step Temp E_pair E_mol TotEng Press 
       0            0   -1919.9104            0   -1919.9104            0

I guess the new question is: how can I troubleshoot this to work with CUDA?

Linux-cpp-lisp · 2022-09-21T15:47:19Z

I see, so CUDA is the issue (as suspected). Can you try again:

import torch
m = torch.jit.load("model.pth", map_location="cuda")
print(m)

as a more relevant Python test?

mhsiron · 2022-09-21T16:08:43Z

This leads to no errors -- just outputs the model architecture

Linux-cpp-lisp · 2022-10-03T22:47:28Z

So I haven't heard anything back from the PyTorch Slack on this...

It's possible this comes from your use of CUDA 10.2 (which has been recently deprecated by PyTorch); as far as I know we always tested with 11.*.

You could also try to test your installation without LAMMPS or NequIP using the PyTorch tutorial (https://pytorch.org/tutorials/advanced/cpp_export.html) and loading model direct to CUDA to see if it fails in your build environment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LAMMPS Terminate with Torch error #29

LAMMPS Terminate with Torch error #29

mhsiron commented Sep 15, 2022

Linux-cpp-lisp commented Sep 15, 2022 •

edited

Loading

mhsiron commented Sep 15, 2022

Linux-cpp-lisp commented Sep 15, 2022

mhsiron commented Sep 15, 2022

Linux-cpp-lisp commented Sep 18, 2022

mhsiron commented Sep 18, 2022

Linux-cpp-lisp commented Sep 19, 2022

mhsiron commented Sep 19, 2022

mhsiron commented Sep 19, 2022

Linux-cpp-lisp commented Sep 19, 2022

mhsiron commented Sep 20, 2022

Linux-cpp-lisp commented Sep 21, 2022 •

edited

Loading

mhsiron commented Sep 21, 2022

Linux-cpp-lisp commented Sep 21, 2022

mhsiron commented Sep 21, 2022 •

edited

Loading

mhsiron commented Sep 21, 2022 •

edited

Loading

Linux-cpp-lisp commented Sep 21, 2022

mhsiron commented Sep 21, 2022 •

edited

Loading

Linux-cpp-lisp commented Oct 3, 2022

LAMMPS Terminate with Torch error #29

LAMMPS Terminate with Torch error #29

Comments

mhsiron commented Sep 15, 2022

Linux-cpp-lisp commented Sep 15, 2022 • edited Loading

mhsiron commented Sep 15, 2022

Linux-cpp-lisp commented Sep 15, 2022

mhsiron commented Sep 15, 2022

Linux-cpp-lisp commented Sep 18, 2022

mhsiron commented Sep 18, 2022

Linux-cpp-lisp commented Sep 19, 2022

mhsiron commented Sep 19, 2022

mhsiron commented Sep 19, 2022

Linux-cpp-lisp commented Sep 19, 2022

mhsiron commented Sep 20, 2022

Linux-cpp-lisp commented Sep 21, 2022 • edited Loading

mhsiron commented Sep 21, 2022

Linux-cpp-lisp commented Sep 21, 2022

mhsiron commented Sep 21, 2022 • edited Loading

mhsiron commented Sep 21, 2022 • edited Loading

Linux-cpp-lisp commented Sep 21, 2022

mhsiron commented Sep 21, 2022 • edited Loading

Linux-cpp-lisp commented Oct 3, 2022

Linux-cpp-lisp commented Sep 15, 2022 •

edited

Loading

Linux-cpp-lisp commented Sep 21, 2022 •

edited

Loading

mhsiron commented Sep 21, 2022 •

edited

Loading

mhsiron commented Sep 21, 2022 •

edited

Loading

mhsiron commented Sep 21, 2022 •

edited

Loading