Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

请问有人使用GLM跑通过Continual Pre-training么? #185

Open
wjn1996 opened this issue Jul 8, 2023 · 0 comments
Open

请问有人使用GLM跑通过Continual Pre-training么? #185

wjn1996 opened this issue Jul 8, 2023 · 0 comments

Comments

@wjn1996
Copy link

wjn1996 commented Jul 8, 2023

请问有人使用GLM跑通过Continual Pre-training么?

  • 预训练语料是否需要自行处理,还是GLM在训练前可以自动帮我们生成含有mask的文本?预训练数据应该存放在哪个目录下,数据格式是怎样的(是否直接就是纯文本?)
  • 能否支持新增special token?我在tokenization.py文件里看到了一些包括[gMASK],[sMASK],[dBLOCK]等标记,是否需要改此文件?
  • 如果自己想更改或新增一些对预训练数据的mask任务,应该在哪里进行编写代码?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant