Starting from a previous model #42

terryfrankcombe · 2023-05-15T02:23:37Z

terryfrankcombe
May 15, 2023

Is pretraining expected to work for allegro?

I have some data which I train an allegro model to. I then add some more data. Adapting what worked for straight NequIP, I try setting:

model_builders:
- allegro.model.Allegro
- PerSpeciesRescale
- ForceOutput
- RescaleEnergyEtc
- initialize_from_state

in the yaml, along with e.g.
initial_model_state: training/points0551-r6-lmax2/last_model.pth

When I try to start this, the first epoch gives me
! Best model 0 0.000
and never any more apparently better models. That model isn't very good for the new data.

Am I doing something obviously wrong?

Answered by Linux-cpp-lisp

May 25, 2023

Hi Terry,

Interesting, glad and interested to hear that worked well for you in NequIP. You are doing the same system and data shift with Allegro?

It is certainly possible that there is a bug here; one thing I can think of: did you remember to clear the rundirs for your finetuning? If you accidentally had append: True and still had a rundir, initialize_from_state would be a no-op since it would be a restart instead of a new model...

View full answer

Linux-cpp-lisp · 2023-05-25T00:30:46Z

Linux-cpp-lisp
May 25, 2023
Maintainer

Hi @terryfrankcombe ,

Are the initial validation epochs consistent with the last epochs of the previous training you are starting from? This looks technically correct from your post.

The ! Best model lines don't show many digits of precision, so they might not show improvement in the later digits. Are you confirming this with the rest of the logs / wandb?

Can you elaborate on "what worked for NequIP" and what you mean by "worked" there?

From a machine learning perspective, it is not always possible to improve a model with finetuning like this, especially under more significant data shift. To the best of my knowledge this is largely unexplored in the MLIP literature, and most efforts I am aware of retrain from scratch when they add new data. For example, I believe @svandenhaute's Psiflow (https://www.nature.com/articles/s41524-023-00969-x., https://svandenhaute.github.io/psiflow/) retrains from scratch.

Thanks!

6 replies

terryfrankcombe May 25, 2023
Author

Hi @Linux-cpp-lisp

When I say "works" for NequIP, if I trained a model on some data, then added more data and started training again a new model with initialize_from_state, the new model would start with lower loss than starting from scratch and generally "converge" (/stagnate, whatever) in a significantly smaller number of epochs, coming up with new best models at a similar rate to if I'm training from scratch.

But with allegro, doing a similar thing is not behaving the same way. When I train from scratch the model slowly improves, very regularly announcing new best models. I'm stopping training after 24 hours, at which point the rate of new best model discovery is still high even if there isn't dramatic improvement. Adding data (similar in nature and domain to the existing data, just a different sampling) and training again doesn't continue to find new best models, and the output best model (from the initial one reporting 0.000) doesn't reproduce the data very well if I explicitly test against the full set. This smells wrong to me.

Not sure if it is relevant, but for full disclosure this data is for a periodic system and the data set contains configs from different sized cells (derived from different sized supercells) and some slabs.

I haven't been rigorous in my record-keeping of previous training runs (largely because this worked really well with NequIP on a different system and I got complacent). I'll run again and give you some more concrete numbers in a few days. ;-)

Ciao
Terry

Linux-cpp-lisp May 25, 2023
Maintainer

Hi Terry,

Interesting, glad and interested to hear that worked well for you in NequIP. You are doing the same system and data shift with Allegro?

It is certainly possible that there is a bug here; one thing I can think of: did you remember to clear the rundirs for your finetuning? If you accidentally had append: True and still had a rundir, initialize_from_state would be a no-op since it would be a restart instead of a new model...

Answer selected by terryfrankcombe

terryfrankcombe May 29, 2023
Author

Hi Alby

When I did an explicit test training for only a few hours, then training again a fresh instance with more data, allegro restarted just fine, a la NequIP. I don't know what was happening before. :-( I always start a new rundir for each training session.

I didn't apply any of the modifications Sander mentioned.

As I cannot give a MWE of it not working, I withdraw my claim that it doesn't work! ;-)

Ciao
Terry

Linux-cpp-lisp May 29, 2023
Maintainer

Hi Terry,

😆 glad to hear that fixed it!! I wonder if maybe some of your data files got moved around/edited and the cached dataset did not get reprocessed... this can be a source of hard-to-track-down inconsistencies, since NequIP will use the cached data if only the contents of the source file has changed.

terryfrankcombe May 30, 2023
Author

I habitually delete prior processed data directories. My data set does not take long to process. But more importantly if I add data to my database and change the data seed NequIP will still read the cached data, seemingly ignoring the new data (and then do fun things like tell me I don't have enough data for the increased n_train and n_val!)

Linux-cpp-lisp May 30, 2023
Maintainer

Right, this is the key point: nequip more or less assumes that filename == constant dataset for caching purposes. Dataset seed is also intentionally not one of the keys that can invalidate a cache so that you can do different dataset shufflings with the same cached processed data.

svandenhaute · 2023-05-25T07:58:50Z

svandenhaute
May 25, 2023

In psiflow, we do in fact support both training datasets from scratch as well starting from a pretrained model. I personally use this extensively with NequIP and I did notice that total training time is signifcantly reduced when starting from a pretrained model. This should also work with Allegro, though I have less experience with it. You can take look at the modified train script to see how we do it; besides initializing with the pretrained weights, you also need to reset the validation metrics in the Trainer attributes as these are no longer relevant (since also the validation set will have changed). Otherwise, he's going to have trouble selecting the best model after training.

EDIT: sorry, didn't notice this was a discussion; this should be part of @Linux-cpp-lisp's reply thread.

3 replies

Linux-cpp-lisp May 25, 2023
Maintainer

Thanks for replying Sander, I was hoping you might have some relevant info 😆

you also need to reset the validation metrics in the Trainer attributes as these are no longer relevant

You mean reset the saved best validation loss or something else? This is a very good point, but curious where exactly you did this in the code?

svandenhaute May 29, 2023

I realized I no longer do this, my mistake!

This used to be necessary when loading the actual Trainer from a previous run, update its training and validation attributes and then reset the best model metrics. Now, I just initialize a new trainer and load the previous model's state dict into it!

Linux-cpp-lisp May 29, 2023
Maintainer

Great, yes I think that is the more correct way to do it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Starting from a previous model #42

{{title}}

Replies: 2 comments 9 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Starting from a previous model #42

terryfrankcombe May 15, 2023

Replies: 2 comments · 9 replies

Linux-cpp-lisp May 25, 2023 Maintainer

terryfrankcombe May 25, 2023 Author

Linux-cpp-lisp May 25, 2023 Maintainer

terryfrankcombe May 29, 2023 Author

Linux-cpp-lisp May 29, 2023 Maintainer

terryfrankcombe May 30, 2023 Author

Linux-cpp-lisp May 30, 2023 Maintainer

svandenhaute May 25, 2023

Linux-cpp-lisp May 25, 2023 Maintainer

svandenhaute May 29, 2023

Linux-cpp-lisp May 29, 2023 Maintainer

terryfrankcombe
May 15, 2023

Replies: 2 comments 9 replies

Linux-cpp-lisp
May 25, 2023
Maintainer

terryfrankcombe May 25, 2023
Author

Linux-cpp-lisp May 25, 2023
Maintainer

terryfrankcombe May 29, 2023
Author

Linux-cpp-lisp May 29, 2023
Maintainer

terryfrankcombe May 30, 2023
Author

Linux-cpp-lisp May 30, 2023
Maintainer

svandenhaute
May 25, 2023

Linux-cpp-lisp May 25, 2023
Maintainer

Linux-cpp-lisp May 29, 2023
Maintainer