-
Notifications
You must be signed in to change notification settings - Fork 212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to use multi-training without slurm system? #458
Comments
If it is just on a single node you can use the interface in the documentation: https://mace-docs.readthedocs.io/en/latest/guide/multigpu.html, on the single-node section. |
thanks for your reply, i use the tutorial to change my code but i meet new problem
i'm running
and the config.yaml
|
You should comment out the |
I modify the mace package slurm_distributed.py, the root of which is /opt/miniconda/envs/mace/lib/python3.9/site-packages/mace/tools, and it can run.
I think if the multi-train is not opened?
|
i find #143 and i want to solve the problem, so i reinstall the version which has hugfacing. but it also can't run.
and it also has out of memory
i also test 4 gpus train, but similar problem was happened. |
I created a modified train script for this which doesn't use the whole See here. Essentially it comes down to setting the required environment variables manually: def main() -> None:
"""
This script runs the training/fine tuning for mace
"""
args = tools.build_default_arg_parser().parse_args()
if args.distributed:
world_size = torch.cuda.device_count()
import torch.multiprocessing as mp
mp.spawn(run, args=(args, world_size), nprocs=world_size)
else:
run(0, args, 1)
def run(rank: int, args: argparse.Namespace, world_size: int) -> None:
"""
This script runs the training/fine tuning for mace
"""
tag = tools.get_tag(name=args.name, seed=args.seed)
if args.distributed:
# try:
# distr_env = DistributedEnvironment()
# except Exception as e: # pylint: disable=W0703
# logging.error(f"Failed to initialize distributed environment: {e}")
# return
# world_size = distr_env.world_size
# local_rank = distr_env.local_rank
# rank = distr_env.rank
# if rank == 0:
# print(distr_env)
# torch.distributed.init_process_group(backend="nccl")
local_rank = rank
os.environ["MASTER_ADDR"] = "localhost"
os.environ["MASTER_PORT"] = "12355"
torch.cuda.set_device(rank)
torch.distributed.init_process_group(
backend='nccl',
rank=rank,
world_size=world_size,
)
else:
pass
|
hello, i use your method to change the code, but some errors have happened and i can't understand it.
and the torch also has error,
I don't think this is a port problem.Because even if I modify a port that no one has used before, it is still the same error. |
Sounds like your starting it twice. Make sure to use |
thanks for your reply, dear.
and cuda memory out too.
i use two 4090 gpus to run. |
Can you share your new log file? It does not seem to be using the two GPUs. |
thanks, dear ilyes, it's my log file. |
Hello, dear ilyes, I would like to run a single machine version of a multi GPUs mace (with multiple 4090s), but recently I have tried some other methods but have all failed. Can you explain in detail the specific reason why the dual GPU did not run successfully? |
Can you tell me what branch you are using? Note that we only support the official repo and not any modified fork. Also please share you full log file, and not screenshots. |
I am using the official branch. Here is my slurm submission script run_train.txt and the corresponding log file for the error slurm-2522.log |
Does the single GPU work? Have you edited the slurm config file to your env variables? |
dear ShiQiaoL, Distributed in your config is not True. you should watch this document multi. and i think you should use yaml to write your config. it looks so direct. |
single GPU can do it. I revised the file according to your opinion, but my machine doesn't have slurm, it's stanalone. |
Hello dear developers,I run this script
python /root/mace/scripts/run_train.py --name="MACE_model" \ --train_file="train.xyz" \ --valid_fraction=0.05 \ --test_file="test.xyz" \ --config_type_weights='{"Default":1.0}' \ --model="MACE" \ --hidden_irreps='128x0e + 128x1o' \ --r_max=5.0 \ --batch_size=10 \ --energy_key="energy" \ --forces_key="forces" \ --max_num_epochs=100 \ --swa \ --start_swa=80 \ --ema \ --ema_decay=0.99 \ --amsgrad \ --restart_latest \ --device=cuda \
But my computer has two 4090 GPUs, and I have not installed Slurm, so this problem occurred
ERROR:root:Failed to initialize distributed environment: 'SLURM_ JOB_NODELIST
How to solve the problem.
The text was updated successfully, but these errors were encountered: