Problem Restarting Jobs #107

lyncdw19 · 2024-10-11T16:40:21Z

I have a few Allegro training jobs that have stopped early due to wall time limit that have not yet converged. I have tried to restart them to continue training until convergence. As far as I understand, this done by simply running the training command again within the same directory, and the previous saved model will be automatically loaded and continue training. Unfortunately, I am getting the following error when trying to restart.

Traceback (most recent call last): File "..//env-allegro/.pixi/envs/default/bin/nequip-train", line 10, in <module> sys.exit(main()) File "..//env-allegro/.pixi/envs/default/lib/python3.10/site-packages/nequip/scripts/train.py", line 96, in main trainer = restart(config) File "..//env-allegro/.pixi/envs/default/lib/python3.10/site-packages/nequip/scripts/train.py", line 289, in restart raise ValueError( ValueError: Key "optimizer_kwargs" is different in config and the result trainer.pth file. Please double check

I have not changed the yaml config file -- I am using the same one that I originally began training with.

Does anyone know what is causing this error and how to fix it? Thanks.

The text was updated successfully, but these errors were encountered:

cw-tan · 2024-11-22T03:18:31Z

Hi @lyncdw19

Thank you for your interest in our code. Apologies for the delayed response, but the nequip framework and allegro model (that runs in the nequip infrastructure) are undergoing a major overhaul. We are close to the end of the revamps, and things look very different from what you see on main. The code related to your problem has been deleted in the revamps. If you're just starting on a project, it may be better to try using the new nequip infrastructure and corresponding allegro code, both on the develop branches of the respective git repositories. The configs/tutorial.yaml on both repos should be helpful, as well as the new docs https://nequip.readthedocs.io/en/develop/guide/workflow.html (note the develop in the url), for getting started.

If you really need to get things to work with the current public code and want to try to debug this issue, some comments. The error thrown comes from
https://github.com/mir-group/nequip/blob/1e150cdc8614e640116d11e085d8e5e45b21e94d/nequip/scripts/train.py#L290,
which just checks the original config file used for training and the config saved in the checkpoint. A reasonable approach would be to investigate why the error got thrown by maybe inspecting best_model.pth and figuring out why it's erroring out at that part of the code highlighted earlier (inspect what's in the dicts being inspected).

I would advise trying to migrate to the new infrastructure if possible, but it's understandable if it's more favorable to continue using the current public infrastructure if you're in the middle of a project and have various models trained in that framework. Happy to give further advice.

Chuin Wei

lyncdw19 · 2024-11-25T15:48:08Z

Hi Chuin Wei,

Thanks so much for the reply. I am currently in the testing phase and it may be reasonable to upgrade to the newer version of the software. In your opinion, would it be better to wait until the "major overhaul" is finished -- is there an ETA for its release?

Alternatively, if I switch to the develop branch, I am assuming I would need to install the develop branch of both Nequip and Allegro, but what about LAMMPS/pair-allegro versions?

Best,

Cory

cw-tan · 2024-11-25T16:47:45Z

Hi Cory,

We're expecting to be ready for a release in a couple of months. The problem I guess is that the overhaul is mostly infrastructure changes (how we manage data, training, testing, etc), but also some model architecture changes (minimal but still means it's a different model than the one you're using). So if you want to continue using the exact same models you've trained, migration is probably not a good idea (and there will be a few more architecture side changes to come before the release).

As for pair-allegro, whatever's on current develop now would work with the existing pair-allegro. The caveat is that the overhaul raised the minimum torch version, and ideally one would use the latest stable torch. This means having to recompile pair-allegro with LAMMPS and maybe KOKKOS with a newer libtorch compatible with the torch version (which could be a pain depending on the HPC, e.g. GLIBC or GLIBCXX versioning issues, etc).

I would advise trying out the new infrastructure at this point if you want to learn and have an easier transition when the release happens, but probably not if you have an ongoing project and don't want to have to retrain your models, and run the same simulations again in a few months (models trained with the old infrastructure would not be usable with the new infrastructure).

Also, let us know if you're still having trouble with the original issue posted. We can also be reached at [email protected].

lyncdw19 · 2024-11-26T19:58:50Z

Dear Chuin Wei,

Thanks for the information. I have reached out to the HPC team at my university and am looking into the possibility of installing the development branches. If I have any other question I will certainly email.

Best regards,

Cory

cw-tan · 2024-11-26T20:08:51Z

Hi Cory, thanks for your interest! I think I was over-optimistic in my suggestions and wish to now backtrack (sorry!) -- I think it's fine to use the new developments if you wanna play around with it and get used to the overall workflow of the new infrastructure, but I would advise against investing time and effort installing everything correctly for production use until everything is stable. To put it bluntly, it's not stable enough for me to recommend migrating over for production use at this point in time (but definitely fine if you wanna test it, with the expectation that things will change in breaking ways in the coming months, such that you might have to reinstall everything/retrain all your models, etc).

lyncdw19 · 2024-12-03T19:04:08Z

Dear Chuin Wei,

Thanks for the updated suggestions. After consulting with my HPC team, we have decided to hold off on the upgrade until the more stable release in the next few months. The current version should be sufficient for our testing purposes and I'll be looking forward to the new release when it's available.

Best regards,

Cory

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem Restarting Jobs #107

Problem Restarting Jobs #107

lyncdw19 commented Oct 11, 2024 •

edited

Loading

cw-tan commented Nov 22, 2024

lyncdw19 commented Nov 25, 2024

cw-tan commented Nov 25, 2024 •

edited

Loading

lyncdw19 commented Nov 26, 2024

cw-tan commented Nov 26, 2024

lyncdw19 commented Dec 3, 2024

Problem Restarting Jobs #107

Problem Restarting Jobs #107

Comments

lyncdw19 commented Oct 11, 2024 • edited Loading

cw-tan commented Nov 22, 2024

lyncdw19 commented Nov 25, 2024

cw-tan commented Nov 25, 2024 • edited Loading

lyncdw19 commented Nov 26, 2024

cw-tan commented Nov 26, 2024

lyncdw19 commented Dec 3, 2024

lyncdw19 commented Oct 11, 2024 •

edited

Loading

cw-tan commented Nov 25, 2024 •

edited

Loading