- Report: CUED_SPEECH AT TREC 2020 PODCAST SUMMARISATION TRACK
- It is a two-stage approach as the BART model can only take up to 1024 input words:
- Stage1: Truncation or Filtering (by Hierarchical Model)
- Stage2: Running BART
- Scripts have absolute paths (to model/data), typically variables in CAPITAL => change them to your own paths.
- Feel free to email [email protected] if you think this repository might be useful for your work, but it's not working.
- python 3.7
- torch 1.2.0
- transformers (HuggingFace) 2.11.0
(I expect it to work with newer versions too, but not guranteed & not tested)
Spotify Podcast: Download & Pre-processing
- Download link: https://podcastsdataset.byspotify.com/
data/processor.py
: split the data into chunks such that each chuck contains 10k instance, e.g. id0-id9999 in podcast_set0 (we could use your own data processing pipeline!!). Note that to use our trained weights, we use BART tokenizer to process the data.data/processor_testset.py
: for pre-processing test datadata/loader.py
contains functions (mostly hard coded) for data loading, batching, etc.
*Currently, all the configurations must be set in the training script!!
Standard Fine-tuning:
python train_bartvanilla.py
L_rl training:
python train_bartvanilla_rl.py
*Again, all configutations must be set in the script (note that the current settings are the one used in TREC2020)
decoding:
python decode_testset.py start_id end_id
- The test set consists of 1,027 samples. See
data/processor_testset.py
about how to prepare the test data. If used in a single machine, you can unable ID randomization. - To use sentence filtering at test time, you need to decode the test data using the hierarchical model first as it is a two-stage process. Refer to Decode Hierarchical Model section.
ensemble decoding ("token-level combination / product-of-expectations"):
python ensemble_decode_testset.py start_id end_id
For performing sentence filtering, i.e. content selection
python train_hiermodel.py
Configurations are set inside this script.
This is the first stage before BART decoding unless you do truncation at test time:
python hier_filtering_testset.py start_id end_id
If you want to train BART with filtered data:
python hier_filtering_trainset.py start_id end_id
Download links (~8GB) - same file, any option that works for you (if none of the links work, please email me [email protected]):
[Google Drive] [Cambridge OneDrive]
BART
- cued_speech_bart_baseline.pt (run_id: cued_speechUniv3)
- cued_speech_bart_filtered.pt (run_id: cued_speechUniv4)
- cued_speech_ensemble3_x.pt (run_id: cued_speechUniv2)
- cued_speech_ensemble3_1.pt
- cued_speech_ensemble3_2.pt
- cued_speech_ensemble3_3.pt
Hierarchical Model
- HIERMODEL_640_50_step30000.pt (trained using max num sentences = 640, max num word in sent = 50 for 30k steps)