v0.2.0
What's Changed
- Support llama2 based on official repo of Megatron-LM
- Refactor tutorial docs and add tools to convert megatron ckpt to hf
- Fix parameter sync when src_pipe != tgt_pipe and tgt_pipe != 1
- Reduce the number of port required
- Refine resume training and doc
- Add node address to error message and exit with error code
- Show the log in each worker node and refine docs
- Add continue train docs and check applied device
- Join log thread with timeout and trigger when process exit
- Support custom model flow
- Feat: support optimizer offload
- Doc: add faq
Full Changelog: v0.1.0...v0.2.0