DA

Note: to use multidomain kmeans models you need scikit-learn==0.22.2.post1

Dir stucture:

* faieseq (clonned fairseq repo)
* data-prep (copy from `/gpfs/hpc/projects/nlpgroup/bergamot/data-prep`)
* experiments (for model checkpoints and tb log; subfolders include "concat", "finetuned_europarl", "domain_control", etc.)
* scripts (running commands for different DA scenarious and data prep)

Setup

conda create -n da python=3.8
conda activate da
conda install pytorch torchvision torchaudio cudatoolkit=10.1 -c pytorch
pip install scipy numpy pandas transformers==4.0 sentencepiece tensorboardX

git clone https://github.com/maksym-del/da
cd da

git clone https://github.com/pytorch/fairseq
cd fairseq
# fairseq commit: 1a709b2a401ac8bd6d805c8a6a5f4d7f03b923ff
# git reset --hard 1a709b2a401ac8bd6d805c8a6a5f4d7f03b923ff
pip install --editable ./

pip install tensorboard
pip install tensorboardX

cd ..
cp -r /gpfs/hpc/projects/nlpgroup/bergamot/data-prep .

Train

bash scripts/SCRIPTNAME.sh
or 
sbatch scripts/SCRIPTNAME.slurm

Using clustering scripts:

0) Before running below's scripts, change paths to data and models in them (see "CHANGE THIS LINE" comment)

1) Convert trained fseq Transfomer to Huggingface format:
python scripts-clustering/convert_chkp_fseq_to_hf.py de-en

2) Extract NMT sentence and document representations:
# sent
python scripts-clustering/extract_reps.py nmt sent test
python scripts-clustering/extract_reps.py nmt sent dev
python scripts-clustering/extract_reps.py nmt sent train

# doc
python scripts-clustering/extract_reps.py nmt doc test
python scripts-clustering/extract_reps.py nmt doc dev
python scripts-clustering/extract_reps.py nmt doc train

3) Get clusters:
# sent
python scripts-clustering/kmeans_train.py nmt sent 8

python scripts-clustering/kmeans_predict.py nmt sent 8 test
python scripts-clustering/kmeans_predict.py nmt sent 8 dev
python scripts-clustering/kmeans_predict.py nmt sent 8 train

# doc
python scripts-clustering/kmeans_train.py nmt doc 8

python scripts-clustering/kmeans_predict.py nmt doc 8 test
python scripts-clustering/kmeans_predict.py nmt doc 8 dev
python scripts-clustering/kmeans_predict.py nmt doc 8 train

Extract XLM-R sentence and document representations:

Same commands as before, just use "bert" instead of "nmt.
For example:
python scripts-clustering/extract_reps.py bert sent test

Fine-tune

Now that you have cluster (domain) separated data, fine-tune an NMT baseline (from before).to each of the clusters and get results.

Name		Name	Last commit message	Last commit date
Latest commit History 68 Commits
checkpoint_utils		checkpoint_utils
da		da
get_corpora		get_corpora
notebooks		notebooks
scripts-clustering		scripts-clustering
scripts-train-nmt		scripts-train-nmt
slurm_example_scripts		slurm_example_scripts
.gitignore		.gitignore
README.md		README.md
fairseq-train-help.md		fairseq-train-help.md
my_reminders.py		my_reminders.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DA

Setup

Train

Using clustering scripts:

Fine-tune

About

Releases

Packages

Languages

browsermt/da

Folders and files

Latest commit

History

Repository files navigation

DA

Setup

Train

Using clustering scripts:

Fine-tune

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages