-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error with the new pair_allegro-stress branch #12
Comments
Hi @nukenadal . Thanks for your interest in our code and for trying this new feature! The Hopefully this resolves the issue, and please let me know if you observe any issues or suspicious results when using the |
Thank you so much for your rapid response! The above problem is resolved with the pair_style allegro3232. I tested an NpT MD run, but I found that the stresses are not predicted as how I expected. I used the normal
Where the stresses are saved with the unit eV per angstrom cubed. Then in the MD production, I had a printout like this:
The pressure was initially quite high, and then lowered towards 1 bar. I was wondering if there is any additional setting that is required to reconcile the stresses from the FF and how LAMMPS handles the pressure of the cell? I used both the main branch of Nequip=0.5.6 and Allegro=0.2.0 if that is relevant. Thank you! |
Hi @nukenadal, you can have a try with the para_stress branch of Nequip and stress branch of pair_allegro and train with config.yaml including |
@nukenadal : have you verified with a test set / validation set, such as with It's hard for us to say without more details on the system, training data, and LAMMPS input file... |
Sorry, I found that the above MD calculation had a mismatched FF and input structure. That is probably why the structure disintegrated rapidly. I retrained the FF with more data and epochs. It gives the following test results from
Moreover, I tried re-compiling LAMMPS because I accidentally wiped out the
I tested multiple combinations of packages but this issue remains. I was wondering if you have any suggestions on this type of error? Thanks so much! |
👍
This is an error of less than 1%, which sounds excellent; in an absolute sense it also looks good to me... @simonbatzner ? Regarding the segfault, in general we've seen this either in certain cases if you run out of memory, or if there is an empty simulation domain in a parallel LAMMPS simulation... do either of these sound applicable? |
With a few tests, I found that this is actually an out-of-memory error, I lowered the setting of FF training (4 layers to 2 layers) and the production can run without hitting the memory limit and also gave meaningful trajectories. Thank you so much for the valuable advice! Some extra points are, though not quite relevant,
I wonder if this is telling me that I don't have enough storage space for some temporary data or a memory error like above? |
Hi @nukenadal ,
Great! This should also be possible to resolve, then, by running on more GPUs with the same model.
You still have ~2500 atoms like your original post? And you mean two GPUs, and not two GPU nodes? You are measuring this through the LAMMPS performance summary printed at the end, and if so, how does the time percentage spent in "Comm" change for you from one to two GPUs?
This happens after it prints "Processing..." but before "Loaded Dataset"? If so, then yes it is failing to write out the preprocessed data file, which is in an efficient binary format but can be quite large due to the inclusion of full neighborlist information, which can be quite large for a big cutoff in a dense system. |
Hi,
I used about 2600 atoms again. I was comparing between 1 GPU on 1 node and 2 GPUs on 1 node. I compared the speed through the time taken for the same 2000 MD steps. I have got the following results: for 1 GPU:
And for 2 GPUs:
I wonder if this allows us to find out why the scaling is abnormal? As the percentage of time spent on Comm was similar.
This happened before the line "Loaded Dataset", as the first few lines in the log are:
|
^ from the second log @nukenadal are you sure it is using two GPUs? have you checked |
Regarding OOM, that looks like preprocessing, yes. You can preprocess your dataset on a normal CPU node (where hopefully you can allocate more RAM) using |
I found that I may not have properly set the multi-GPU process, I wonder what command should I use to call say 2 GPUs on the same node? Is it just the same with |
Yes (or possibly a launcher like |
Thank you so much for your help! The issues of multi-processing and OOM errors are both resolved. |
Hi Allegro developers,
Thank you for making the update on the stress branch of pair_allegro. I am trying to install this new branch to allow stress prediction in my MD production. Previously I have used the main branch and made several successful runs on NVT MD. But with this stress branch, some errors occurred when producing the NpT MD trajectories.
I used the following versions of the packages:
torch 1.11.0
libtorch 1.13.0+cu116
(since the documentation says avoid 1.12)
CUDA 11.4.2
cuDNN 8.5.0.96
The LAMMPS was successfully compiled according to the documentation, the FF was trained with stresses included. In the following MD run, I got LAMMPS output as:
It stopped with this error:
I wonder if this is because I used the wrong combination of packages that the new stress branch is not compatible with?
Thank you so much.
The text was updated successfully, but these errors were encountered: