Skip to content

Preprocess the data

zhezhaoa edited this page Oct 8, 2022 · 3 revisions
usage: preprocess.py [-h] --corpus_path CORPUS_PATH
                     [--dataset_path DATASET_PATH]
                     [--tokenizer {bert,bpe,char,space,xlmroberta,image,text_image}]
                     [--vocab_path VOCAB_PATH] [--merges_path MERGES_PATH]
                     [--spm_model_path SPM_MODEL_PATH]
                     [--do_lower_case {true,false}]
                     [--vqgan_model_path VQGAN_MODEL_PATH]
                     [--vqgan_config_path VQGAN_CONFIG_PATH]
                     [--tgt_tokenizer {bert,bpe,char,space,xlmroberta}]
                     [--tgt_vocab_path TGT_VOCAB_PATH]
                     [--tgt_merges_path TGT_MERGES_PATH]
                     [--tgt_spm_model_path TGT_SPM_MODEL_PATH]
                     [--tgt_do_lower_case {true,false}]
                     [--processes_num PROCESSES_NUM]
                     [--data_processor {bert,lm,mlm,bilm,albert,mt,t5,cls,prefixlm,gsg,bart,cls_mlm,vit,vilt,clip,s2t,beit,dalle}]
                     [--docs_buffer_size DOCS_BUFFER_SIZE]
                     [--seq_length SEQ_LENGTH]
                     [--tgt_seq_length TGT_SEQ_LENGTH]
                     [--dup_factor DUP_FACTOR]
                     [--short_seq_prob SHORT_SEQ_PROB] [--full_sentences]
                     [--seed SEED] [--dynamic_masking] [--whole_word_masking]
                     [--span_masking] [--span_geo_prob SPAN_GEO_PROB]
                     [--span_max_length SPAN_MAX_LENGTH]
                     [--sentence_selection_strategy {lead,random}]

Users have to preprocess the corpus before pre-training. The example of pre-processing on a single machine:

python3 preprocess.py --corpus_path corpora/book_review_bert.txt --vocab_path models/google_zh_vocab.txt \
                      --dataset_path dataset.pt --processes_num 8 --dynamic_masking --data_processor bert

The output of pre-processing stage is dataset.pt (--dataset_path), which is the input of pretrain.py . If multiple machines are available, users can run preprocess.py on one machine and copy the dataset.pt to other machines.

We need to specify the format of dataset.pt generated by pre-processing stage (--data_processor) since different pre-training models require different data formats in pre-training stage. Currently, TencentPretrain supports formats for abundant pre-training models, for example:

  • lm: language model
  • mlm: masked language model
  • cls: classification
  • bilm: bi-directional language model
  • bert: masked language model + next sentence prediction
  • albert: masked language model + sentence order prediction
  • prefixlm:prefix language model

Notice that we should use the corpus (--corpus_path) whose format is in accordance with the --data_processor . More use cases are found in Pretraining model examples.

--processes_num n denotes that n processes are used for pre-processing. More processes can speed up the preprocess stage but lead to more memory consumption.
--dup_factor denotes that instances are duplicated multiple times (when using static masking). Static masking is used in BERT. The masked words are determined in pre-processing stage.
--dynamic_masking denotes that the words are masked during the pre-training stage, which is used in RoBERTa. Dynamic masking performs better and the output file (--dataset_path) is smaller (since it doesn't have to duplicate instances).
--full_sentences allows a sample to include contents from multiple documents, which is used in RoBERTa.
--span_masking denotes that masking consecutive words, which is used in SpanBERT. If dynamic masking is used, we should specify --span_masking in pre-training stage, otherwise we should specify --span_masking in pre-processing stage.
--docs_buffer_size specifies the buffer size in memory in pre-processing stage.
Sequence length is specified in pre-processing stage by --seq_length . The default value is 128. When doing incremental pre-training upon existing pre-trained model, --seq_length should be smaller than the maximum sequence length the pre-trained model supports (--max_seq_length).

Vocabulary and tokenizer are also specified in pre-processing stage. More details are discussed in Tokenization and vocabulary section.

Clone this wiki locally