-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problem Restarting Jobs #107
Comments
Hi @lyncdw19 Thank you for your interest in our code. Apologies for the delayed response, but the If you really need to get things to work with the current public code and want to try to debug this issue, some comments. The error thrown comes from I would advise trying to migrate to the new infrastructure if possible, but it's understandable if it's more favorable to continue using the current public infrastructure if you're in the middle of a project and have various models trained in that framework. Happy to give further advice. Chuin Wei |
Hi Chuin Wei, Thanks so much for the reply. I am currently in the testing phase and it may be reasonable to upgrade to the newer version of the software. In your opinion, would it be better to wait until the "major overhaul" is finished -- is there an ETA for its release? Alternatively, if I switch to the develop branch, I am assuming I would need to install the develop branch of both Nequip and Allegro, but what about LAMMPS/pair-allegro versions? Best, Cory |
Hi Cory, We're expecting to be ready for a release in a couple of months. The problem I guess is that the overhaul is mostly infrastructure changes (how we manage data, training, testing, etc), but also some model architecture changes (minimal but still means it's a different model than the one you're using). So if you want to continue using the exact same models you've trained, migration is probably not a good idea (and there will be a few more architecture side changes to come before the release). As for I would advise trying out the new infrastructure at this point if you want to learn and have an easier transition when the release happens, but probably not if you have an ongoing project and don't want to have to retrain your models, and run the same simulations again in a few months (models trained with the old infrastructure would not be usable with the new infrastructure). Also, let us know if you're still having trouble with the original issue posted. We can also be reached at [email protected]. |
Dear Chuin Wei, Thanks for the information. I have reached out to the HPC team at my university and am looking into the possibility of installing the development branches. If I have any other question I will certainly email. Best regards, Cory |
Hi Cory, thanks for your interest! I think I was over-optimistic in my suggestions and wish to now backtrack (sorry!) -- I think it's fine to use the new developments if you wanna play around with it and get used to the overall workflow of the new infrastructure, but I would advise against investing time and effort installing everything correctly for production use until everything is stable. To put it bluntly, it's not stable enough for me to recommend migrating over for production use at this point in time (but definitely fine if you wanna test it, with the expectation that things will change in breaking ways in the coming months, such that you might have to reinstall everything/retrain all your models, etc). |
Dear Chuin Wei, Thanks for the updated suggestions. After consulting with my HPC team, we have decided to hold off on the upgrade until the more stable release in the next few months. The current version should be sufficient for our testing purposes and I'll be looking forward to the new release when it's available. Best regards, Cory |
I have a few Allegro training jobs that have stopped early due to wall time limit that have not yet converged. I have tried to restart them to continue training until convergence. As far as I understand, this done by simply running the training command again within the same directory, and the previous saved model will be automatically loaded and continue training. Unfortunately, I am getting the following error when trying to restart.
Traceback (most recent call last): File "..//env-allegro/.pixi/envs/default/bin/nequip-train", line 10, in <module> sys.exit(main()) File "..//env-allegro/.pixi/envs/default/lib/python3.10/site-packages/nequip/scripts/train.py", line 96, in main trainer = restart(config) File "..//env-allegro/.pixi/envs/default/lib/python3.10/site-packages/nequip/scripts/train.py", line 289, in restart raise ValueError( ValueError: Key "optimizer_kwargs" is different in config and the result trainer.pth file. Please double check
I have not changed the yaml config file -- I am using the same one that I originally began training with.
Does anyone know what is causing this error and how to fix it? Thanks.
The text was updated successfully, but these errors were encountered: