Document classification with BERT
Code based on https://github.com/AndriyMulyar/bert_document_classification.
With some modifications:
-switch from the pytorch-transformers to the transformers ( https://github.com/huggingface/transformers ) library.
-unfreezing of the last layers of BERT, instead of freezing complete BERT (the latter results in subpar peformance).
-support for document classification using DistilBert ( BERT with 40% less parameters )
Information on how to get data to try replicate results reported in (https://arxiv.org/abs/1910.13664 ), see. https://github.com/ArneDefauw/bert_document_classification/blob/master/examples/ml4health_2019_replication/data/README.md
In short, you must sign the appropriate agreements: https://portal.dbmi.hms.harvard.edu/projects/n2c2-nlp/
Afterwards, you can download the data. Folder structure should look like this.
examples/ml4health_2019_replication/data/
├── obesity_patient_records_test.xml
├── obesity_patient_records_training2.xml
├── obesity_patient_records_training.xml
├── obesity_standoff_annotations_test.xml
├── obesity_standoff_annotations_training_addendum2.xml
├── obesity_standoff_annotations_training_addendum3.xml
├── obesity_standoff_annotations_training_addendum.xml
├── obesity_standoff_annotations_training.xml
├── README.md
├── smokers_surrogate_test_all_version2.xml
└── smokers_surrogate_train_all_version2.xml
Clinical Bert: https://github.com/EmilyAlsentzer/clinicalBERT
Use the Config files in: https://github.com/ArneDefauw/BERT_doc_classification/blob/master/bert_document_classification/examples/ml4health_2019_replication
And the python scripts ( train_n2c2_2006.py, train_n2c2_2008.py, train_newstest.py ). The latter script will train a system for document classification on the Newsgroup dataset (https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_20newsgroups.html). The Newsgroup dataset will be automatically downloaded when running the script.
Use the predict_n2c2_2006.py, predict_n2c2_2008.py, predict_newstest_bert.py, predict_newstest_distilbert.py,
This repository supports 5 architectures: DocumentBertLSTM, DocumentDistilBertLSTM, DocumentBertTransformer, DocumentBertLinear, DocumentBertMaxPool.
Note: none of these architectures could replicate results reported in https://arxiv.org/abs/1910.13664.
Freezing Clinical BERT + LSTM:
Unfreezing last encoding layer of Clinical BERT + LSTM:
Unfreezing last encoding layer of Clinical BERT + linear layer:
Freezing Clinical BERT + LSTM
Unfreezing last encoding layer of Clinical BERT + LSTM:
Unfreezing last encoding layer of Clinical BERT + linear layer:
Freezing bert-base-uncased + LSTM:
Unfreezing last encoding layer of bert-base-uncased + LSTM:
Unfreezing last encoding layer of distilbert-base-uncased + LSTM: