Skip to content

odnodn/BERT_medical

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 

Repository files navigation

Fine tuning Bert on a Language Model task

There are 2 ways to train or fine tune BERT or GPT like models

  • on a supervised downstream task
  • in an unsupervised way on a corpus

The supervised training / fine-tuning requires a ground truth dataset.

=> We're going to work on the unsupervised approach

Model fine tuning

We can leverage the scripts given by huggingface:

Scripts

Most of the code within the scripts cited above is devoted to

  • handling both pytorch and tensorflow versions

  • and passing arguments via 3 classes:

    • ModelArguments: class definition in the script.

    Arguments pertaining to which model, config and tokenizer we are going to fine-tune

    • DataTrainingArgument: class definition in the script.

    Arguments pertaining to what data we are going to input our model for training and eval

    Arguments pertaining to the actual training / finetuning of the model

The core of the script is organized along:

  1. loading the data through the dataset module with

    load_dataset(data_args.dataset_name, data_args.dataset_config_name)

  2. Loading the appropriate config, tokenizer and model

    • config = AutoConfig.from_pretrained
    • tokenizer = AutoTokenizer.from_pretrained
    • model = AutoModelForMaskedLM.from_config(config)
  3. tokenizing the data

    This returns 3 elements:

    • a list of tokens (the vocab)
    • the tokens index within the vocab list
    • a sequence of token mask [1,1,1,1,1,0,0,0,0]
  4. The datacollator handles the random masking of tokens

    data_collator = DataCollatorForLanguageModeling

  5. and finally the training / finetuning takes place

    • the trainer is instanciated trainer = Trainer()
    • the training takes place trainer.train()
  6. the model is saved

Fine tuning

To fine tune on a our own data, specify the path to the training file and to the validation file:

python run_mlm.py \
    --model_name_or_path bert-base-uncased \
    --max_seq_length 128 \
    --line_by_line \
    --train_file "path_to_train_file" \
    --validation_file "path_to_validation_file" \
    --do_train \
    --do_eval \
    --max_steps 5000 \
    --save_steps 1000 \ # augment to save disk space
    --output_dir "results/"

Notes

  • distilbert is smaller than BERT
  • each save_steps step, the model is saved in a directory. This can quickly eat up all the space on the disk.

About

Fine tuning BERT on MIMIC dataset

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%