Mixture of Attention Heads

This repository contains the code used for WMT14 translation experiments in Mixture of Attention Heads: Selecting Attention Heads Per Token paper.

Software Requirements

Python 3, fairseq and PyTorch are required for the current codebase.

Install PyTorch and fairseq
Generate WMT14 translation dataset with Transformer Clinic.
Scripts and commands
- Train Language Modeling sh run.sh /path/to/your/data
- Test Unsupervised Parsing sh test.sh /path/to/checkpoint
In default setting, the MoA achieves a BLEU of approximately 28.4 on WMT14 EN-DE test set.