Foundation Models for Science: Progress, Opportunities, and Challenges at NeurIPS 2024 [Paper]
Changwen Xu, Shang Zhu, Venkatasubramanian Viswanathan
University of Michigan
This is the official implementation of CLOUD: "CLOUD: A Scalable Scientific Foundation Model for Crystal Representation Learning". In this work, we introduce CrystaL fOUnDation model (CLOUD), a Transformer-based foundation model for crystal representation learning via a novel symmetry-aware string representation and accurate, generalizable, and scalable property prediction. If you find our work useful in your research, please cite:
@inproceedings{xu2024cloud,
title={CLOUD: A Scalable Scientific Foundation Model for Crystal Representation Learning},
author={Xu, Changwen and Zhu, Shang and Viswanathan, Venkatasubramanian},
booktitle={Neurips 2024 Workshop Foundation Models for Science: Progress, Opportunities, and Challenges}
}
This work is still under development and new progress will be made publically available when ready.
Set up conda environment and clone the github repo
# clone the source code of CLOUD
$ git clone https://github.com/ChangwenXu98/CLOUD.git
$ cd CLOUD
# create the environment from environment.yml
$ conda env create -f environment.yml
$ conda activate cloud
To obtain the string representation from cif files.
$ python structure_to_str.py --dir <path_to_cif> --out <output_path> --numproc <num_of_processes> --batchsize <batch_size>
To pretrain CLOUD, where the configurations and detailed explaination for each variable can be found in config_pretrain.yaml
.
$ python -m torch.distributed.launch --nproc_per_node=2 pretrain.py
DistributedDataParallel is used for faster pretraining.
To finetune the pretrained CLOUD on MatBench or UnconvBench about crystal properties, where the configurations and detailed explaination for each variable can be found in config.yaml
.
$ python train.py
To finetune the pretrained CLOUD on MatBench Discovery and make predictions for WBM test set.
$ python train_mp.py
$ python wbm_predict.py