Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add How to Reproduce the Result in README #2

Open
wants to merge 6 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
63 changes: 63 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
fork-setup:
git remote add upstream https://github.com/indobenchmark/indonlu.git
git remote -v

HYPERPARAMETER ?= default
EARLY_STOP ?= 15
BATCH_SIZE ?= 16

.PHONY : reproduce

reproduce:
python3 scripts/reproducer.py $(DATASET) $(EARLY_STOP) $(BATCH_SIZE) $(HYPERPARAMETER)

reproduce_all:
python3 scripts/reproducer.py absa-airy 15 $(BATCH_SIZE) $(HYPERPARAMETER)
python3 scripts/reproducer.py absa-prosa 15 $(BATCH_SIZE) $(HYPERPARAMETER)
python3 scripts/reproducer.py doc-sentiment-prosa 15 $(BATCH_SIZE) $(HYPERPARAMETER)
python3 scripts/reproducer.py emotion-twitter 15 $(BATCH_SIZE) $(HYPERPARAMETER)
python3 scripts/reproducer.py entailment-ui 15 $(BATCH_SIZE) $(HYPERPARAMETER)
python3 scripts/reproducer.py keyword-extraction-prosa 15 $(BATCH_SIZE) $(HYPERPARAMETER)
python3 scripts/reproducer.py qa-factoid-itb 15 $(BATCH_SIZE) $(HYPERPARAMETER)
python3 scripts/reproducer.py ner-grit 15 $(BATCH_SIZE) $(HYPERPARAMETER)
python3 scripts/reproducer.py ner-prosa 15 $(BATCH_SIZE) $(HYPERPARAMETER)
python3 scripts/reproducer.py pos-idn 15 $(BATCH_SIZE) $(HYPERPARAMETER)
python3 scripts/reproducer.py term-extraction-airy 15 $(BATCH_SIZE) $(HYPERPARAMETER)
python3 scripts/reproducer.py pos-prosa 15 $(BATCH_SIZE) $(HYPERPARAMETER)

reproduce_all_1:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove reproduce_all_*, it is already covered in reproduce and reproduce_all for this

python3 scripts/reproducer.py absa-airy 15 $(BATCH_SIZE) $(HYPERPARAMETER)
python3 scripts/reproducer.py absa-prosa 15 $(BATCH_SIZE) $(HYPERPARAMETER)
python3 scripts/reproducer.py doc-sentiment-prosa 15 $(BATCH_SIZE) $(HYPERPARAMETER)

reproduce_all_2:
python3 scripts/reproducer.py emotion-twitter 15 $(BATCH_SIZE) $(HYPERPARAMETER)
python3 scripts/reproducer.py entailment-ui 15 $(BATCH_SIZE) $(HYPERPARAMETER)
python3 scripts/reproducer.py keyword-extraction-prosa 15 $(BATCH_SIZE) $(HYPERPARAMETER)

reproduce_all_3:
python3 scripts/reproducer.py qa-factoid-itb 15 $(BATCH_SIZE) $(HYPERPARAMETER)

reproduce_all_4:
python3 scripts/reproducer.py ner-grit 15 $(BATCH_SIZE) $(HYPERPARAMETER)
python3 scripts/reproducer.py ner-prosa 15 $(BATCH_SIZE) $(HYPERPARAMETER)

reproduce_all_5:
python3 scripts/reproducer.py pos-idn 15 $(BATCH_SIZE) $(HYPERPARAMETER)

reproduce_all_6:
python3 scripts/reproducer.py term-extraction-airy 15 $(BATCH_SIZE) $(HYPERPARAMETER)
python3 scripts/reproducer.py pos-prosa 15 $(BATCH_SIZE) $(HYPERPARAMETER)

run_non_pretrained_no_special_token:
python3 scripts/reproducer_non_pretrained.py $(DATASET) $(EARLY_STOP) $(BATCH_SIZE)

run_non_pretrained_no_special_token_all:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are 8 tasks in here, can you please help adding the other 4 similar to the list in the reproduce_all?

python3 scripts/reproducer_non_pretrained.py emotion-twitter 10 16
python3 scripts/reproducer_non_pretrained.py pos-idn 10 16
python3 scripts/reproducer_non_pretrained.py ner-grit 10 16
python3 scripts/reproducer_non_pretrained.py absa-airy 10 16
python3 scripts/reproducer_non_pretrained.py term-extraction-airy 10 16
python3 scripts/reproducer_non_pretrained.py entailment-ui 10 16
python3 scripts/reproducer_non_pretrained.py doc-sentiment-prosa 10 16
python3 scripts/reproducer_non_pretrained.py keyword-extraction-prosa 10 16
42 changes: 42 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,3 +53,45 @@ We provide the access to our large pretraining dataset. In this version, we excl
## Leaderboard
- Community Portal and Public Leaderboard [[Link]](https://www.indobenchmark.com/leaderboard.html)
- Submission Portal https://competitions.codalab.org/competitions/26537

## Quick Start

### Predict
_TBD_

### Train
_TBD_

### Reproduce Result

1. Set the `CUDA_VISIBLE_DEVICES` environment variable first
```
export CUDA_VISIBLE_DEVICES=0
```
2. Then, simply execute the following command to run the training
```
make reproduce DATASET=<dataset>
```
It will train all of the models for the specified _\<dataset\>_ with default parameter
3. Check the available datasets in `datasets/` directory
4. All of the models used are listed in `scripts/config/model/train.yaml` \
Feel free to add or comment as you see fit
5. To use different hyperparameter, create a new file in `scripts/config/hyperparameter/` \
Then specify it in the command like this
```
make reproduce DATASET=<dataset> HYPERPARAMETER=<hyperparameter_filename_without_the_extension>
```
6. There are 2 more parameters that can be specified in the command:
- EARLY_STOP
- BATCH_SIZE

Use the following command to utilize it
```
make reproduce DATASET=<dataset> EARLY_STOP=<early_stop> BATCH_SIZE=<batch_size>
```
7. There are also a grouping command of specific task for easy access like
```
make reproduce_all_1
make reproduce_all_2
etc
```
File renamed without changes.
File renamed without changes.
11 changes: 11 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
PyYAML==5.3.1
numpy
pandas
torch
tqdm
transformers
nltk
sklearn
matplotlib
seaborn
ipywidgets
6 changes: 6 additions & 0 deletions scripts/config/hyperparameter/default.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
n_epochs: 100
step_size: 1
gamma: 0.9
lr: 1e-5
options:
- --force
6 changes: 6 additions & 0 deletions scripts/config/hyperparameter/no_special_token_1.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
n_epochs: 100
step_size: 1
gamma: 0.5
lr: 6.25e-5
options:
- --no_special_token
6 changes: 6 additions & 0 deletions scripts/config/hyperparameter/no_special_token_2.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
n_epochs: 100
step_size: 1
gamma: 0.8
lr: 6.25e-5
options:
- --no_special_token
10 changes: 10 additions & 0 deletions scripts/config/model/non_pretrained.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
- model_checkpoint: scratch
hyperparameter_config: no_special_token_1.yaml
- model_checkpoint: word2vec
hyperparameter_config: no_special_token_1.yaml
- model_checkpoint: fasttext-twitter
hyperparameter_config: no_special_token_2.yaml
- model_checkpoint: fasttext-cc-id
hyperparameter_config: no_special_token_2.yaml
- model_checkpoint: fasttext-cc-id-no-oov
hyperparameter_config: no_special_token_2.yaml
146 changes: 146 additions & 0 deletions scripts/config/model/train.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,146 @@
# list of used configuration
# model_checkpoint:
# lower:
# num_layers:

# # albert-base-uncased-96000
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This model can be removed

# - model_checkpoint: albert-base-uncased-96000
# lower: True
# num_layers:
# - 12

# # albert-base-uncased-96000-spm
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This model can be removed

# - model_checkpoint: albert-base-uncased-96000-spm
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This model can be removed

# lower: True
# num_layers:
# - 12

# # albert-base-uncased-112500-spm
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This model can be removed

# - model_checkpoint: albert-base-uncased-112500-spm
# lower: True
# num_layers:
# - 12

# scratch
- model_checkpoint: scratch
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This model can be removed

lower: True
num_layers:
- 2
- 4
- 6

# fasttext-cc-id-300-no-oov-uncased
- model_checkpoint: fasttext-cc-id-300-no-oov-uncased
lower: True
num_layers:
- 2
- 4
- 6

# fasttext-4B-id-300-no-oov-uncased
- model_checkpoint: fasttext-4B-id-300-no-oov-uncased
lower: True
num_layers:
- 2
- 4
- 6

# babert-base-512
- model_checkpoint: babert-base-512
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This model can be removed

lower: True
num_layers:
- 12

# babert-bpe-mlm-large-512
- model_checkpoint: babert-bpe-mlm-large-512
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This model can be removed

lower: True
num_layers:
- 24

# mbert
- model_checkpoint: bert-base-multilingual-uncased
lower: False
num_layers:
- 12

# xlm-roberta
- model_checkpoint: xlm-roberta-base
lower: False
num_layers:
- 12
- model_checkpoint: xlm-roberta-base
lower: False
num_layers:
- 12

# babert-opensubtitle
- model_checkpoint: babert-opensubtitle
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This model can be removed

lower: False
num_layers:
- 12

# xlm
- model_checkpoint: xlm-mlm-100-1280
lower: False
num_layers:
- 16

# albert-large-wwmlm-128
- model_checkpoint: albert-large-wwmlm-128
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This model can be removed

lower: True
num_layers:
- 24

# albert-base-wwmlm-512
- model_checkpoint: albert-base-wwmlm-512
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This model can be removed

lower: True
num_layers:
- 12

# albert-large-wwmlm-512
- model_checkpoint: albert-large-wwmlm-512
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This model can be removed

lower: True
num_layers:
- 24

# albert-base-uncased-112500
- model_checkpoint: albert-base-uncased-112500
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This model can be removed

lower: True
num_layers:
- 12

# albert-base-uncased-191k
- model_checkpoint: albert-base-uncased-191k
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This model can be removed

lower: True
num_layers:
- 12

# cartobert
- model_checkpoint: cartobert
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This model can be removed

lower: True
num_layers:
- 12

# babert-bpe-mlm-large-uncased
- model_checkpoint: babert-bpe-mlm-large-uncased
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This model can be removed

lower: True
num_layers:
- 24

# babert-bpe-mlm-large-uncased-1m
- model_checkpoint: babert-bpe-mlm-large-uncased-1m
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This model can be removed

lower: True
num_layers:
- 24

# babert-bpe-mlm-large-uncased-1100k
- model_checkpoint: babert-bpe-mlm-large-uncased-1100k
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This model can be removed

lower: True
num_layers:
- 24

# babert-bpe-mlm-uncased-128-dup10-5
- model_checkpoint: babert-bpe-mlm-uncased-128-dup10-5
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This model can be removed

lower: True
num_layers:
- 12
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you help adding 8 indoBERT models in this file, the model checkpoint and the num_layers would be as follow:

  • indobenchmark/indobert-base-p1 | 12 layers
  • indobenchmark/indobert-base-p2 | 12 layers
  • indobenchmark/indobert-large-p1 | 24 layers
  • indobenchmark/indobert-large-p2 | 24 layers
  • indobenchmark/indobert-lite-base-p1 | 12 layers
  • indobenchmark/indobert-lite-base-p2 | 12 layers
  • indobenchmark/indobert-lite-large-p1 | 24 layers
  • indobenchmark/indobert-lite-large-p2 | 24 layers

61 changes: 61 additions & 0 deletions scripts/reproducer.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
import os
import sys
# import subprocess
import yaml

CUDA = os.getenv("CUDA_VISIBLE_DEVICES", "0")

path = "scripts/config/model/train.yaml"
with open(path, "r") as f:
model_configs = yaml.safe_load(f)

hyperparams_config = sys.argv[4]
path = f"scripts/config/hyperparameter/{hyperparams_config}.yaml"
with open(path, "r") as f:
hyperparams = yaml.safe_load(f)
hyperparams["dataset"] = sys.argv[1]
hyperparams["early_stop"] = sys.argv[2]
hyperparams["train_batch_size"] = sys.argv[3]

hyp_list = [
"n_epochs",
"train_batch_size",
"model_checkpoint",
"step_size",
"gamma",
"experiment_name",
"lr",
"early_stop",
"dataset",
]
for m in model_configs:
hyperparams["model_checkpoint"] = m["model_checkpoint"]
for layer in m["num_layers"]:
exp = [
hyperparams["model_checkpoint"],
f"b{hyperparams['train_batch_size']}",
f"step{hyperparams['step_size']}",
f"gamma{hyperparams['gamma']}",
f"lr{hyperparams['lr']}",
f"early{hyperparams['early_stop']}",
f"layer{layer}",
f"lower{m['lower']}"
]
hyperparams["experiment_name"] = "_".join(exp)

cmd = f"CUDA_VISIBLE_DEVICES={CUDA} python3 main.py"
for hl in hyp_list:
cmd += f" --{hl} {hyperparams[hl]}"
if m["lower"]:
cmd += " --lower"
cmd += f" --num_layers {layer}"
for o in hyperparams["options"]:
cmd += f" {o}"

print(f"Running: {cmd}")

os.system(cmd)

# # run in parallel, comment above command
# results = subprocess.run(
# cmd, shell=True, universal_newlines=True, check=True, text=True)
Loading