CUED_SPEECH @ TREC2020 Podcast Summarisation Track

Overview

Report: CUED_SPEECH AT TREC 2020 PODCAST SUMMARISATION TRACK
It is a two-stage approach as the BART model can only take up to 1024 input words:
- Stage1: Truncation or Filtering (by Hierarchical Model)
- Stage2: Running BART
Scripts have absolute paths (to model/data), typically variables in CAPITAL => change them to your own paths.
Feel free to email pm574@cam.ac.uk if you think this repository might be useful for your work, but it's not working.

Requirements

python 3.7
torch 1.2.0
transformers (HuggingFace) 2.11.0

(I expect it to work with newer versions too, but not guranteed & not tested)

Data Preparation

Spotify Podcast: Download & Pre-processing

Download link: https://podcastsdataset.byspotify.com/
data/processor.py: split the data into chunks such that each chuck contains 10k instance, e.g. id0-id9999 in podcast_set0 (we could use your own data processing pipeline!!). Note that to use our trained weights, we use BART tokenizer to process the data.
data/processor_testset.py: for pre-processing test data
data/loader.py contains functions (mostly hard coded) for data loading, batching, etc.

Train & Fine-tune BART

*Currently, all the configurations must be set in the training script!!

Standard Fine-tuning:

python train_bartvanilla.py

L_rl training:

python train_bartvanilla_rl.py

Decode (Inference) BART

*Again, all configutations must be set in the script (note that the current settings are the one used in TREC2020)

decoding:

python decode_testset.py start_id end_id

The test set consists of 1,027 samples. See data/processor_testset.py about how to prepare the test data. If used in a single machine, you can unable ID randomization.
To use sentence filtering at test time, you need to decode the test data using the hierarchical model first as it is a two-stage process. Refer to Decode Hierarchical Model section.

ensemble decoding ("token-level combination / product-of-expectations"):

python ensemble_decode_testset.py start_id end_id

Train Hierarchical Model

For performing sentence filtering, i.e. content selection

python train_hiermodel.py

Configurations are set inside this script.

Decode Hierarchical Model section

This is the first stage before BART decoding unless you do truncation at test time:

python hier_filtering_testset.py start_id end_id

If you want to train BART with filtered data:

python hier_filtering_trainset.py start_id end_id

Trained Weights

Download links (~8GB) - same file, any option that works for you (if none of the links work, please email me pm574@cam.ac.uk):

[Google Drive] [Cambridge OneDrive]

BART

cued_speech_bart_baseline.pt (run_id: cued_speechUniv3)
cued_speech_bart_filtered.pt (run_id: cued_speechUniv4)
cued_speech_ensemble3_x.pt (run_id: cued_speechUniv2)
- cued_speech_ensemble3_1.pt
- cued_speech_ensemble3_2.pt
- cued_speech_ensemble3_3.pt

Hierarchical Model

HIERMODEL_640_50_step30000.pt (trained using max num sentences = 640, max num word in sent = 50 for 30k steps)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

CUED_SPEECH @ TREC2020 Podcast Summarisation Track

Overview

Requirements

Data Preparation

Train & Fine-tune BART

Decode (Inference) BART

Train Hierarchical Model

Decode Hierarchical Model section

Trained Weights

Files

README.md

Latest commit

History

README.md

File metadata and controls

CUED_SPEECH @ TREC2020 Podcast Summarisation Track

Overview

Requirements

Data Preparation

Train & Fine-tune BART

Decode (Inference) BART

Train Hierarchical Model

Decode Hierarchical Model section

Trained Weights