Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lammps failed with c10::error #28

Open
hhlim12 opened this issue Aug 31, 2022 · 1 comment
Open

Lammps failed with c10::error #28

hhlim12 opened this issue Aug 31, 2022 · 1 comment

Comments

@hhlim12
Copy link

hhlim12 commented Aug 31, 2022

Hi, thank you very much for developing NequIP.
Though I can do training without problem (with GPU), I got error when running the model in LAMMPS.
The error said terminate called after throwing an instance of 'c10::Error' what(): expected scalar type Float but found Byte which probably related to #25 (comment).
I have used Pytorch 1.11 and LAMMPS 29 Sep as suggested.
I've tried to use libtorch 1.11 instead of pytorch but the same error occured.
I installed NequIP 0.5.5 with Pytorch 1.11
I put the output below.

LAMMPS (29 Sep 2021 - Update 2)
  using 1 OpenMP thread(s) per MPI task
Reading data file ...
  orthogonal box = (0.0000000 0.0000000 0.0000000) to (30.000000 30.000000 30.000000)
  1 by 1 by 1 MPI processor grid
  reading atoms ...
  21 atoms
  read_data CPU = 0.001 seconds
NEQUIP is using device cuda
NequIP Coeff: type 1 is element H
NequIP Coeff: type 2 is element O
NequIP Coeff: type 3 is element C
Loading model from aspirin.pth
Freezing TorchScript model...
WARNING: Using 'neigh_modify every 1 delay 0 check yes' setting during minimization (src/min.cpp:188)
Neighbor list info ...
  update every 1 steps, delay 0 steps, check yes
  max neighbors/atom: 2000, page size: 100000
  master list distance cutoff = 5
  ghost atom cutoff = 5
  binsize = 2.5, bins = 12 12 12
  1 neighbor lists, perpetual/occasional/extra = 1 0 0
  (1) pair nequip, perpetual
      attributes: full, newton off
      pair build: full/bin/atomonly
      stencil: full/bin/3d
      bin: standard
Setting up cg style minimization ...
  Unit style    : real
  Current step  : 0
terminate called after throwing an instance of 'c10::Error'
  what():  expected scalar type Float but found Byte
Exception raised from data_ptr<float> at /opt/conda/conda-bld/pytorch_1646755903507/work/build/aten/src/ATen/core/TensorMethods.cpp:18 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x4d (0x14d0984b31bd in /home/k0107/k010716/miniconda3/envs/lammps_nequip_rev3/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x68 (0x14d0984af838 in /home/k0107/k010716/miniconda3/envs/lammps_nequip_rev3/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: float* at::TensorBase::data_ptr<float>() const + 0xde (0x14d09a3abc3e in /home/k0107/k010716/miniconda3/envs/lammps_nequip_rev3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so)
frame #3: at::TensorAccessor<float, 2ul, at::DefaultPtrTraits, long> at::TensorBase::accessor<float, 2ul>() const & + 0xcb (0x8bea4b in ./lmp)
frame #4: ./lmp() [0x8b66b2]
frame #5: ./lmp() [0x477689]
frame #6: ./lmp() [0x47be8e]
frame #7: ./lmp() [0x439995]
frame #8: ./lmp() [0x43799b]
frame #9: ./lmp() [0x41a416]
frame #10: __libc_start_main + 0xf3 (0x14d063f84493 in /usr/lib/gcc/x86_64-redhat-linux/8/../../../../lib64/libc.so.6)
frame #11: ./lmp() [0x41a2ee]

[acc008:691367] *** Process received signal ***
[acc008:691367] Signal: Aborted (6)
[acc008:691367] Signal code:  (-6)
[acc008:691367] [ 0] /usr/lib/gcc/x86_64-redhat-linux/8/../../../../lib64/libpthread.so.0(+0x12c20)[0x14d0649dac20]
[acc008:691367] [ 1] /usr/lib/gcc/x86_64-redhat-linux/8/../../../../lib64/libc.so.6(gsignal+0x10f)[0x14d063f9837f]
[acc008:691367] [ 2] /usr/lib/gcc/x86_64-redhat-linux/8/../../../../lib64/libc.so.6(abort+0x127)[0x14d063f82db5]
[acc008:691367] [ 3] /usr/lib/gcc/x86_64-redhat-linux/8/../../../../lib64/libstdc++.so.6(+0x9009b)[0x14d06597a09b]
[acc008:691367] [ 4] /usr/lib/gcc/x86_64-redhat-linux/8/../../../../lib64/libstdc++.so.6(+0x9653c)[0x14d06598053c]
[acc008:691367] [ 5] /usr/lib/gcc/x86_64-redhat-linux/8/../../../../lib64/libstdc++.so.6(+0x96597)[0x14d065980597]
[acc008:691367] [ 6] /usr/lib/gcc/x86_64-redhat-linux/8/../../../../lib64/libstdc++.so.6(+0x967f8)[0x14d0659807f8]
[acc008:691367] [ 7] /home/k0107/k010716/miniconda3/envs/lammps_nequip_rev3/lib/python3.8/site-packages/torch/lib/libc10.so(_ZN3c106detail14torchCheckFailEPKcS2_jRKSs+0x93)[0x14d0984af863]
[acc008:691367] [ 8] /home/k0107/k010716/miniconda3/envs/lammps_nequip_rev3/lib/python3.8/site-packages/torch/lib/libtorch_cpu.so(_ZNK2at10TensorBase8data_ptrIfEEPT_v+0xde)[0x14d09a3abc3e]
[acc008:691367] [ 9] ./lmp(_ZNKR2at10TensorBase8accessorIfLm2EEENS_14TensorAccessorIT_XT0_ENS_16DefaultPtrTraitsElEEv+0xcb)[0x8bea4b]
[acc008:691367] [10] ./lmp[0x8b66b2]
[acc008:691367] [11] ./lmp[0x477689]
[acc008:691367] [12] ./lmp[0x47be8e]
[acc008:691367] [13] ./lmp[0x439995]
[acc008:691367] [14] ./lmp[0x43799b]
[acc008:691367] [15] ./lmp[0x41a416]
[acc008:691367] [16] /usr/lib/gcc/x86_64-redhat-linux/8/../../../../lib64/libc.so.6(__libc_start_main+0xf3)[0x14d063f84493]
[acc008:691367] [17] ./lmp[0x41a2ee]
[acc008:691367] *** End of error message ***
Aborted (core dumped)

Curiously, when I compile LAMMPS with Pytorch 1.12 (CPU only) the MD can run successfully.
I'd appreciate it if you have any suggestion to solve this problem.

Below are more details on the system that I experiment with. I'm sorry for the lengthy message.

  • System: I use minimal.yaml as NequIP input which can be found in NequIP source directory. Then I deploy the model using nequip-deploy to get .pth file which then I use in LAMMPS.
  • Computer: NVIDIA A100 with CUDA 11.6 loaded
  • I install pytorch through the following command: conda install pytorch==1.11.0 cudatoolkit=11.3 -c pytorch
  • conda list for the environment that I use:
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main
_openmp_mutex             5.1                       1_gnu
asttokens                 2.0.5              pyhd3eb1b0_0
backcall                  0.2.0              pyhd3eb1b0_0
blas                      1.0                         mkl
ca-certificates           2022.07.19           h06a4308_0
certifi                   2022.6.15        py38h06a4308_0
cudatoolkit               11.3.1               h2bc3f7f_2
decorator                 5.1.1              pyhd3eb1b0_0
executing                 0.8.3              pyhd3eb1b0_0
intel-openmp              2022.0.1          h06a4308_3633
ipython                   8.4.0            py38h06a4308_0
jedi                      0.18.1           py38h06a4308_1
ld_impl_linux-64          2.38                 h1181459_1
libffi                    3.3                  he6710b0_2
libgcc-ng                 11.2.0               h1234567_1
libgomp                   11.2.0               h1234567_1
libstdcxx-ng              11.2.0               h1234567_1
libuv                     1.40.0               h7b6447c_0
matplotlib-inline         0.1.2              pyhd3eb1b0_2
mkl                       2022.0.1           h06a4308_117
mkl-include               2022.0.1           h06a4308_117
ncurses                   6.3                  h5eee18b_3
numpy                     1.23.2                   pypi_0    pypi
openssl                   1.1.1q               h7f8727e_0
parso                     0.8.3              pyhd3eb1b0_0
pexpect                   4.8.0              pyhd3eb1b0_3
pickleshare               0.7.5           pyhd3eb1b0_1003
pip                       22.1.2           py38h06a4308_0
prompt-toolkit            3.0.20             pyhd3eb1b0_0
ptyprocess                0.7.0              pyhd3eb1b0_2
pure_eval                 0.2.2              pyhd3eb1b0_0
pygments                  2.11.2             pyhd3eb1b0_0
python                    3.8.13               h12debd9_0
pytorch                   1.11.0          py3.8_cuda11.3_cudnn8.2.0_0    pytorch
pytorch-mutex             1.0                        cuda    pytorch
readline                  8.1.2                h7f8727e_1
setuptools                63.4.1           py38h06a4308_0
six                       1.16.0             pyhd3eb1b0_1
sqlite                    3.39.2               h5082296_0
stack_data                0.2.0              pyhd3eb1b0_0
tk                        8.6.12               h1ccaba5_0
traitlets                 5.1.1              pyhd3eb1b0_0
typing_extensions         4.3.0            py38h06a4308_0
wcwidth                   0.2.5              pyhd3eb1b0_0
wheel                     0.37.1             pyhd3eb1b0_0
xz                        5.2.5                h7f8727e_1
zlib                      1.2.12               h7f8727e_2
  • cmake output:
-- The CXX compiler identification is NVHPC 22.2.0
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /home/app/hpc_sdk/Linux_x86_64/22.2/compilers/bin/nvc++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Git: /home/k0107/k010716/bin/git (found version "2.27.0")
-- Appending /home/app/openmpi/4.1.2/lib to CMAKE_LIBRARY_PATH: /home/app/openmpi/4.1.2/lib
-- Running check for auto-generated files from make-based build system
-- Found MPI_CXX: /home/app/openmpi/4.1.2/lib/libmpi.so (found version "3.1")
-- Found MPI: TRUE (found version "3.1")
-- Looking for C++ include omp.h
-- Looking for C++ include omp.h - found
-- Found OpenMP_CXX: -mp
-- Found OpenMP: TRUE
-- Found JPEG: /usr/lib64/libjpeg.so (found version "62")
-- Found PNG: /usr/lib64/libpng.so (found version "1.6.34")
-- Found ZLIB: /usr/lib64/libz.so (found version "1.2.11")
-- Found GZIP: /bin/gzip
-- Could NOT find FFMPEG (missing: FFMPEG_EXECUTABLE)
-- Looking for C++ include cmath
-- Looking for C++ include cmath - found
-- Generating style headers...
-- Generating package headers...
-- Generating lmpinstalledpkgs.h...
-- Could NOT find ClangFormat (missing: ClangFormat_EXECUTABLE) (Required is at least version "8.0")
-- The following tools and libraries have been found and configured:
 * Git
 * MPI
 * OpenMP
 * JPEG
 * PNG
 * ZLIB

-- <<< Build configuration >>>
   Operating System: Linux Red Hat Enterprise Linux 8.5
   Build type:       RelWithDebInfo
   Install path:     /home/k0107/k010716/.local
   Generator:        Unix Makefiles using /bin/gmake
-- Enabled packages: <None>
-- <<< Compilers and Flags: >>>
-- C++ Compiler:     /home/app/hpc_sdk/Linux_x86_64/22.2/compilers/bin/nvc++
      Type:          NVHPC
      Version:       22.2.0
      C++ Flags:     -O2 -gopt
      Defines:       LAMMPS_SMALLBIG;LAMMPS_MEMALIGN=64;LAMMPS_OMP_COMPAT=4;LAMMPS_JPEG;LAMMPS_PNG;LAMMPS_GZIP
-- <<< Linker flags: >>>
-- Executable name:  lmp
-- Static library flags:
-- <<< MPI flags >>>
-- MPI_defines:      MPICH_SKIP_MPICXX;OMPI_SKIP_MPICXX;_MPICC_H
-- MPI includes:     /home/app/openmpi/4.1.2/include
-- MPI libraries:    /home/app/openmpi/4.1.2/lib/libmpi.so;
-- Looking for C++ include pthread.h
-- Looking for C++ include pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE
-- Found CUDA: /home/k0107/k010716/GPU/cuda/ (found version "11.6")
-- The CUDA compiler identification is NVIDIA 11.6.55
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CUDA compiler: /home/app/hpc_sdk/Linux_x86_64/22.2/compilers/bin/nvcc - skipped
-- Detecting CUDA compile features
-- Detecting CUDA compile features - done
-- Caffe2: CUDA detected: 11.6
-- Caffe2: CUDA nvcc is: /home/k0107/k010716/GPU/cuda/bin/nvcc
-- Caffe2: CUDA toolkit directory: /home/k0107/k010716/GPU/cuda/
-- Caffe2: Header version is: 11.6
-- Found CUDNN: /home/k0107/k010716/GPU/cudnn/lib/libcudnn.so
-- Found cuDNN: v8.5.0  (include: /home/k0107/k010716/GPU/cudnn/include, library: /home/k0107/k010716/GPU/cudnn/lib/libcudnn.so)
-- /home/k0107/k010716/GPU/cuda/lib64/libnvrtc.so shorthash is 280a23f6
-- Autodetected CUDA architecture(s):  8.0 8.0 8.0 8.0
-- Added CUDA NVCC flags for: -gencode;arch=compute_80,code=sm_80
CMake Warning at /home/k0107/k010716/miniconda3/envs/lammps_nequip_rev3/lib/python3.8/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:22 (message):
  static library kineto_LIBRARY-NOTFOUND not found.
Call Stack (most recent call first):
  /home/k0107/k010716/miniconda3/envs/lammps_nequip_rev3/lib/python3.8/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:127 (append_torchlib_if_found)
  CMakeLists.txt:922 (find_package)


-- Found Torch: /home/k0107/k010716/miniconda3/envs/lammps_nequip_rev3/lib/python3.8/site-packages/torch/lib/libtorch.so
-- Configuring done
-- Generating done
-- Build files have been written to: /home/k0107/k010716/LAMMPS/lammps-nequip4/build
  • After cmake then I do make and get executable though some warnings are printed:
"/home/k0107/k010716/LAMMPS/lammps-nequip4/src/fmt/format.h", line 1156: warning: statement is unreachable
       return;
       ^
         detected during:
           instantiation of "void fmt::v7_lmp::detail::specs_setter<Char>::on_fill(fmt::v7_lmp::basic_string_view<Char>) [with Char=char]" at line 2823
           instantiation of "const Char *fmt::v7_lmp::detail::parse_align(const Char *, const Char *, Handler &&) [with Char=char, Handler=fmt::v7_lmp::detail::specs_checker<fmt::v7_lmp::detail::specs_handler<fmt::v7_lmp::basic_format_parse_context<char, fmt::v7_lmp::detail::error_handler>, fmt::v7_lmp::buffer_context<char>>> &]" at line 2883
           instantiation of "const Char *fmt::v7_lmp::detail::parse_format_specs(const Char *, const Char *, SpecHandler &&) [with Char=char, SpecHandler=fmt::v7_lmp::detail::specs_checker<fmt::v7_lmp::detail::specs_handler<fmt::v7_lmp::basic_format_parse_context<char, fmt::v7_lmp::detail::error_handler>, fmt::v7_lmp::buffer_context<char>>> &]" at line 3099
           instantiation of "const Char *fmt::v7_lmp::detail::format_handler<OutputIt, Char, Context>::on_format_specs(int, const Char *, const Char *) [with OutputIt=fmt::v7_lmp::detail::buffer_appender<char>, Char=char, Context=fmt::v7_lmp::buffer_context<char>]" at line 2975
           instantiation of "const Char *fmt::v7_lmp::detail::parse_replacement_field(const Char *, const Char *, Handler &&) [with Char=char, Handler=fmt::v7_lmp::detail::format_handler<fmt::v7_lmp::detail::buffer_appender<char>, char, fmt::v7_lmp::buffer_context<char>> &]" at line 2997
           instantiation of "void fmt::v7_lmp::detail::parse_format_string<IS_CONSTEXPR,Char,Handler>(fmt::v7_lmp::basic_string_view<Char>, Handler &&) [with IS_CONSTEXPR=false, Char=char, Handler=fmt::v7_lmp::detail::format_handler<fmt::v7_lmp::detail::buffer_appender<char>, char, fmt::v7_lmp::buffer_context<char>> &]" at line 3776
           instantiation of "void fmt::v7_lmp::detail::vformat_to(fmt::v7_lmp::detail::buffer<Char> &, fmt::v7_lmp::basic_string_view<Char>, fmt::v7_lmp::basic_format_args<fmt::v7_lmp::basic_format_context<fmt::v7_lmp::detail::buffer_appender<fmt::v7_lmp::type_identity_t<Char>>, fmt::v7_lmp::type_identity_t<Char>>>, fmt::v7_lmp::detail::locale_ref) [with Char=char]" at line 2752 of "/home/k0107/k010716/LAMMPS/lammps-nequip4/src/fmt/format-inl.h"
"/home/k0107/k010716/miniconda3/envs/lammps_nequip_rev3/lib/python3.8/site-packages/torch/include/c10/core/TensorImpl.h", line 1669: warning: unknown attribute "fallthrough"
          C10_FALLTHROUGH;
          ^
"/home/k0107/k010716/miniconda3/envs/lammps_nequip_rev3/lib/python3.8/site-packages/torch/include/c10/core/TensorImpl.h", line 1669: warning: unknown attribute "fallthrough"
          C10_FALLTHROUGH;
          ^
"/home/k0107/k010716/miniconda3/envs/lammps_nequip_rev3/lib/python3.8/site-packages/torch/include/ATen/core/ivalue_inl.h", line 296: warning: unknown attribute "fallthrough"
          C10_FALLTHROUGH;
          ^

"/home/k0107/k010716/miniconda3/envs/lammps_nequip_rev3/lib/python3.8/site-packages/torch/include/ATen/core/ivalue_inl.h", line 299: warning: unknown attribute "fallthrough"
          C10_FALLTHROUGH;
          ^

"/home/k0107/k010716/miniconda3/envs/lammps_nequip_rev3/lib/python3.8/site-packages/torch/include/ATen/core/ivalue_inl.h", line 296: warning: unknown attribute "fallthrough"
          C10_FALLTHROUGH;
          ^

"/home/k0107/k010716/miniconda3/envs/lammps_nequip_rev3/lib/python3.8/site-packages/torch/include/ATen/core/ivalue_inl.h", line 299: warning: unknown attribute "fallthrough"
          C10_FALLTHROUGH;
          ^
"/home/k0107/k010716/miniconda3/envs/lammps_nequip_rev3/lib/python3.8/site-packages/torch/include/torch/csrc/api/include/torch/ordered_dict.h", line 360: warning: missing return statement at end of non-void function "torch::OrderedDict<Key, Value>::operator[](const Key &)"
  }
  ^

"/home/k0107/k010716/miniconda3/envs/lammps_nequip_rev3/lib/python3.8/site-packages/torch/include/torch/csrc/api/include/torch/ordered_dict.h", line 368: warning: missing return statement at end of non-void function "torch::OrderedDict<Key, Value>::operator[](const Key &) const"
  }
  ^
"/home/k0107/k010716/miniconda3/envs/lammps_nequip_rev3/lib/python3.8/site-packages/torch/include/torch/csrc/api/include/torch/ordered_dict.h", line 360: warning: missing return statement at end of non-void function "torch::OrderedDict<Key, Value>::operator[](const Key &)"
  }
  ^

"/home/k0107/k010716/miniconda3/envs/lammps_nequip_rev3/lib/python3.8/site-packages/torch/include/torch/csrc/api/include/torch/ordered_dict.h", line 368: warning: missing return statement at end of non-void function "torch::OrderedDict<Key, Value>::operator[](const Key &) const"
  }
  ^
"/home/k0107/k010716/miniconda3/envs/lammps_nequip_rev3/lib/python3.8/site-packages/torch/include/c10/core/TensorImpl.h", line 1669: warning: unknown attribute "fallthrough"
          C10_FALLTHROUGH;
          ^
"/home/k0107/k010716/miniconda3/envs/lammps_nequip_rev3/lib/python3.8/site-packages/torch/include/ATen/core/ivalue_inl.h", line 296: warning: unknown attribute "fallthrough"
          C10_FALLTHROUGH;
          ^

"/home/k0107/k010716/miniconda3/envs/lammps_nequip_rev3/lib/python3.8/site-packages/torch/include/ATen/core/ivalue_inl.h", line 299: warning: unknown attribute "fallthrough"
          C10_FALLTHROUGH;
          ^
"/home/k0107/k010716/LAMMPS/lammps-nequip4/src/pair_nequip.cpp", line 390: warning: variable "jtype" was declared but never referenced
        int jtype = type[j];
            ^

"/home/k0107/k010716/LAMMPS/lammps-nequip4/src/pair_nequip.cpp", line 382: warning: variable "itype" was declared but never referenced
      int itype = type[i];
          ^

"/home/k0107/k010716/miniconda3/envs/lammps_nequip_rev3/lib/python3.8/site-packages/torch/include/torch/csrc/api/include/torch/ordered_dict.h", line 360: warning: missing return statement at end of non-void function "torch::OrderedDict<Key, Value>::operator[](const Key &)"
  }
  ^

"/home/k0107/k010716/miniconda3/envs/lammps_nequip_rev3/lib/python3.8/site-packages/torch/include/torch/csrc/api/include/torch/ordered_dict.h", line 368: warning: missing return statement at end of non-void function "torch::OrderedDict<Key, Value>::operator[](const Key &) const"
  }
  ^

Best regards,

@hhlim12
Copy link
Author

hhlim12 commented Aug 31, 2022

I attach the deployed model, lammps input file, and aspirin structure in here in case it is necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant