A simple implementation of BERT to imrpove our understanding and for fun.
Co-authors: Mounes Zaval, Zeynep Akkoc
We implemented a simple minimalist BERT model with ALiBi. We coded a class that adds an MLM head and trained the model on an Arabic corpus treating each letter as a token.
The graph above illustrates the train / test loss and scores of our NanoBERT model over 50 epochs. The decreasing trend in loss indicates the model's improving ability to predict masked tokens in the Quran corpus.
To train the model with the provided configurations, use the following command:
train.py \
--model_config_path configs/model_config.json \
--tokenizer_config_path configs/tokenizer_config.json \
--train_config_path configs/train_config.json \
--data_path data/quran.jsonl
This command specifies the paths to the model, tokenizer, and training configurations, as well as the data to be used for training. The model will be trained according to the specified configurations.