Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem Restarting Jobs #107

Open
lyncdw19 opened this issue Oct 11, 2024 · 6 comments
Open

Problem Restarting Jobs #107

lyncdw19 opened this issue Oct 11, 2024 · 6 comments

Comments

@lyncdw19
Copy link

lyncdw19 commented Oct 11, 2024

I have a few Allegro training jobs that have stopped early due to wall time limit that have not yet converged. I have tried to restart them to continue training until convergence. As far as I understand, this done by simply running the training command again within the same directory, and the previous saved model will be automatically loaded and continue training. Unfortunately, I am getting the following error when trying to restart.

Traceback (most recent call last): File "..//env-allegro/.pixi/envs/default/bin/nequip-train", line 10, in <module> sys.exit(main()) File "..//env-allegro/.pixi/envs/default/lib/python3.10/site-packages/nequip/scripts/train.py", line 96, in main trainer = restart(config) File "..//env-allegro/.pixi/envs/default/lib/python3.10/site-packages/nequip/scripts/train.py", line 289, in restart raise ValueError( ValueError: Key "optimizer_kwargs" is different in config and the result trainer.pth file. Please double check

I have not changed the yaml config file -- I am using the same one that I originally began training with.

Does anyone know what is causing this error and how to fix it? Thanks.

@cw-tan
Copy link
Collaborator

cw-tan commented Nov 22, 2024

Hi @lyncdw19

Thank you for your interest in our code. Apologies for the delayed response, but the nequip framework and allegro model (that runs in the nequip infrastructure) are undergoing a major overhaul. We are close to the end of the revamps, and things look very different from what you see on main. The code related to your problem has been deleted in the revamps. If you're just starting on a project, it may be better to try using the new nequip infrastructure and corresponding allegro code, both on the develop branches of the respective git repositories. The configs/tutorial.yaml on both repos should be helpful, as well as the new docs https://nequip.readthedocs.io/en/develop/guide/workflow.html (note the develop in the url), for getting started.

If you really need to get things to work with the current public code and want to try to debug this issue, some comments. The error thrown comes from
https://github.com/mir-group/nequip/blob/1e150cdc8614e640116d11e085d8e5e45b21e94d/nequip/scripts/train.py#L290,
which just checks the original config file used for training and the config saved in the checkpoint. A reasonable approach would be to investigate why the error got thrown by maybe inspecting best_model.pth and figuring out why it's erroring out at that part of the code highlighted earlier (inspect what's in the dicts being inspected).

I would advise trying to migrate to the new infrastructure if possible, but it's understandable if it's more favorable to continue using the current public infrastructure if you're in the middle of a project and have various models trained in that framework. Happy to give further advice.

Chuin Wei

@lyncdw19
Copy link
Author

Hi Chuin Wei,

Thanks so much for the reply. I am currently in the testing phase and it may be reasonable to upgrade to the newer version of the software. In your opinion, would it be better to wait until the "major overhaul" is finished -- is there an ETA for its release?

Alternatively, if I switch to the develop branch, I am assuming I would need to install the develop branch of both Nequip and Allegro, but what about LAMMPS/pair-allegro versions?

Best,

Cory

@cw-tan
Copy link
Collaborator

cw-tan commented Nov 25, 2024

Hi Cory,

We're expecting to be ready for a release in a couple of months. The problem I guess is that the overhaul is mostly infrastructure changes (how we manage data, training, testing, etc), but also some model architecture changes (minimal but still means it's a different model than the one you're using). So if you want to continue using the exact same models you've trained, migration is probably not a good idea (and there will be a few more architecture side changes to come before the release).

As for pair-allegro, whatever's on current develop now would work with the existing pair-allegro. The caveat is that the overhaul raised the minimum torch version, and ideally one would use the latest stable torch. This means having to recompile pair-allegro with LAMMPS and maybe KOKKOS with a newer libtorch compatible with the torch version (which could be a pain depending on the HPC, e.g. GLIBC or GLIBCXX versioning issues, etc).

I would advise trying out the new infrastructure at this point if you want to learn and have an easier transition when the release happens, but probably not if you have an ongoing project and don't want to have to retrain your models, and run the same simulations again in a few months (models trained with the old infrastructure would not be usable with the new infrastructure).

Also, let us know if you're still having trouble with the original issue posted. We can also be reached at [email protected].

@lyncdw19
Copy link
Author

Dear Chuin Wei,

Thanks for the information. I have reached out to the HPC team at my university and am looking into the possibility of installing the development branches. If I have any other question I will certainly email.

Best regards,

Cory

@cw-tan
Copy link
Collaborator

cw-tan commented Nov 26, 2024

Hi Cory, thanks for your interest! I think I was over-optimistic in my suggestions and wish to now backtrack (sorry!) -- I think it's fine to use the new developments if you wanna play around with it and get used to the overall workflow of the new infrastructure, but I would advise against investing time and effort installing everything correctly for production use until everything is stable. To put it bluntly, it's not stable enough for me to recommend migrating over for production use at this point in time (but definitely fine if you wanna test it, with the expectation that things will change in breaking ways in the coming months, such that you might have to reinstall everything/retrain all your models, etc).

@lyncdw19
Copy link
Author

lyncdw19 commented Dec 3, 2024

Dear Chuin Wei,

Thanks for the updated suggestions. After consulting with my HPC team, we have decided to hold off on the upgrade until the more stable release in the next few months. The current version should be sufficient for our testing purposes and I'll be looking forward to the new release when it's available.

Best regards,

Cory

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants