We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
optimum-neuron==0.0.22 transformers == 4.36.2 python==3.10 torch==2.1.2 optimum=1.18.* Training image in SageMaker: https://github.com/aws-neuron/deep-learning-containers/blob/2.19.1/docker/pytorch/training/2.1.2/Dockerfile.neuronx
@michaelbenayoun @JingyaHuang @philschmid
examples
https://github.com/aws-samples/ml-specialized-hardware/blob/main/tutorials/06_FinetuneLLMs/01_Finetune_LLMs.ipynb
The training is based on the above notebook. I used tp=8, pp=2, 2 trn1.32xlarge instances. Official LLama 3 8B model.
The following errors showed up during the finetuning of the model:
2024-Jul-31 14:47:33.0955282024-Jul-31 14:47:33.0955282024-Jul-31 14:47:33.0955362024-Jul-31 14:47:33.0955372024-Jul-31 14:47:33.0955352024-Jul-31 14:47:33.0955382024-Jul-31 14:47:33.0955352024-Jul-31 14:47:33.0955402024-Jul-31 14:47:33.095544 98:250 ERROR TDRV:v2_cc_execute 104:258 ERROR TDRV:v2_cc_execute 2024-Jul-31 14:47:33.095545 114:253 ERROR TDRV:v2_cc_execute 120:261 ERROR TDRV:v2_cc_execute 125:286 ERROR TDRV:v2_cc_execute 112:277 ERROR TDRV:v2_cc_execute [nec_dev 4, gid 4] MPMD detected but reload is not supported yet 96:263 ERROR TDRV:v2_cc_execute 117:274 ERROR TDRV:v2_cc_execute 115:273 ERROR TDRV:v2_cc_execute 2024-Jul-31 14:47:33.095550[nec_dev 31, gid 31] MPMD detected but reload is not supported yet[nec_dev 18, gid 18] MPMD detected but reload is not supported yet[nec_dev 23, gid 23] MPMD detected but reload is not supported yet2024-Jul-31 14:47:33.0955582024-Jul-31 14:47:33.0955442024-Jul-31 14:47:33.095567[nec_dev 21, gid 21] MPMD detected but reload is not supported yet2024-Jul-31 14:47:33.095543 116:256 ERROR TDRV:v2_cc_execute 2024-Jul-31 14:47:33.095573
The text was updated successfully, but these errors were encountered:
Try installing optimum-neuron from source, recent changes have fixed several issues, including some MPMD errors.
pip install git+https://github.com/huggingface/optimum-neuron.git
Sorry, something went wrong.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
No branches or pull requests
System Info
Who can help?
@michaelbenayoun @JingyaHuang @philschmid
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction (minimal, reproducible, runnable)
https://github.com/aws-samples/ml-specialized-hardware/blob/main/tutorials/06_FinetuneLLMs/01_Finetune_LLMs.ipynb
The training is based on the above notebook. I used tp=8, pp=2, 2 trn1.32xlarge instances. Official LLama 3 8B model.
Expected behavior
The following errors showed up during the finetuning of the model:
2024-Jul-31 14:47:33.0955282024-Jul-31 14:47:33.0955282024-Jul-31 14:47:33.0955362024-Jul-31 14:47:33.0955372024-Jul-31 14:47:33.0955352024-Jul-31 14:47:33.0955382024-Jul-31 14:47:33.0955352024-Jul-31 14:47:33.0955402024-Jul-31 14:47:33.095544 98:250 ERROR TDRV:v2_cc_execute 104:258 ERROR TDRV:v2_cc_execute 2024-Jul-31 14:47:33.095545 114:253 ERROR TDRV:v2_cc_execute 120:261 ERROR TDRV:v2_cc_execute 125:286 ERROR TDRV:v2_cc_execute 112:277 ERROR TDRV:v2_cc_execute [nec_dev 4, gid 4] MPMD detected but reload is not supported yet 96:263 ERROR TDRV:v2_cc_execute 117:274 ERROR TDRV:v2_cc_execute 115:273 ERROR TDRV:v2_cc_execute 2024-Jul-31 14:47:33.095550[nec_dev 31, gid 31] MPMD detected but reload is not supported yet[nec_dev 18, gid 18] MPMD detected but reload is not supported yet[nec_dev 23, gid 23] MPMD detected but reload is not supported yet2024-Jul-31 14:47:33.0955582024-Jul-31 14:47:33.0955442024-Jul-31 14:47:33.095567[nec_dev 21, gid 21] MPMD detected but reload is not supported yet2024-Jul-31 14:47:33.095543 116:256 ERROR TDRV:v2_cc_execute 2024-Jul-31 14:47:33.095573
The text was updated successfully, but these errors were encountered: