Kaggle Competition: Predict which Tweets are about real disasters and which ones are not
Model | Best Accuracy | Rank |
---|---|---|
BERT | 84.13% | 39/860 Top 4% |
RoBERTa | 83.97% | 53/860 Top 6% |
username@localhost:~$ conda install pytorch torchvision torchaudio cudatoolkit=11.6 -c pytorch -c conda-forge
username@localhost:~$ pip install transformers
username@localhost:~$ pip install datasets
username@localhost:~$ pip install -U scikit-learn
username@localhost:~$ pip install numpy
username@localhost:~$ pip install pandas
username@localhost:~$ pip install tqdm
username@localhost:~$ pip install colorama
username@localhost:~$ pip install seaborn
username@localhost:~$ pip install nltk
The first choice is tuning parameters, you can directly run the run.sh file. It will take a long time, about 100hrs. The best parameters for different model are provided below.
username@localhost:~$ bash /src/run.sh
Meanwhile, you can just run the python file, it will be executed once, and the result will be printed. You can try different parameters before you execute the python file.
username@localhost:~$ python3 /src/train.py --model_name [$model_name] --threshold [$threshold] --batchsize [$batchsize] --dropout [$dropout] --layer[$layer]
Run the following command, which can achieve the best result of BERT model.
python src/train.py \
--model_name bert_base \
--threshold 0.6 \
--batchsize 8 \
--dropout 0.3 \
--layer 3
Run the following command, which can achieve the best result of RoBERTa model.
python src/train.py \
--model_name roberta_base \
--threshold 0.6 \
--batchsize 16 \
--dropout 0.3 \
--layer 1
Note: The results of each run may deviate slightly