Repository for the Paper Summarizer project for the course 'Machine Learning' @ Université Paris-Saclay, CentraleSupélec. BDMA, Fall 2023.
Authors:
The datasets can be found at:
- Arxiv-Summarization dataset: download from Arxiv-Summarization at HuggingFace
- DialogSum dataset: download from DialogSum at HuggingFace
Install the dependencies from the requirements.txt
file. We recommend using a virtual environment and Python 3.9.18.
To preprocess the datasets, you have to use the source/preprocessing/preprocess.py
script. You can run it with the following command:
python source/preprocessing/preprocess.py --zip_dir <zip_location> --zip_to_stories [True/False] --toChunk [True/False] --ch_sum_sent <n_sents> --stories_dir <stories_location> --json_dir <json_location> --max_sentences <max_sents> --min_sentence_length <min_length> --model_name <model_name> --output_dir <output_dir> --preprocess [True/False]
Where:
zip_dir
is the location of the zip file containing the datasetçzip_to_stories
is a boolean indicating whether to convert the zip files to stories or nottoChunk
is a boolean indicating whether to chunk stories longer than 512 tokens or notch_sum_sent
is the number of sentences used to summarize each chunkstories_dir
is the location of the stories folderjson_dir
is the location of the json file to save the preprocessed datamax_sentences
is the maximum number of sentences to keep in the whole summarymin_sentence_length
is the minimum number of tokens in a sentence to keep it in the summarymodel_name
is the name of the model used to tokenize the sentencesoutput_dir
is the location of the output folder, to save the summariespreprocess
is a boolean indicating whether to preprocess the data or not
If you use DialogSum, you might need to use the alternative script source/preprocessing/preprocess_dialogue.py
instead, which works similarly.
To train the model, you have to use the source/train.py
script. You can run it with the following command:
python source/train.py --train_loc <train_dataset_loc> --valid_loc <validation_dataset_loc> --model_loc <path_to_model> --output_dir <output_dir> --model_type <model_type> --verbose [True/False] --batch_size <bsize> --train_size <tsize> --valid_size <vsize> --epochs <n_epochs>
Where:
train_loc
is the location of the training datasetvalid_loc
is the location of the validation datasetmodel_loc
is the name of the model to useoutput_dir
is the location of the output folder, to save the model and the logsmodel_type
is the type of model to use, 'linear' or 'transformer'verbose
is a boolean indicating whether to print the logs or notbatch_size
is the batch size to usetrain_size
is the number of training samples to usevalid_size
is the number of validation samples to useepochs
is the number of epochs to train the model
You can use the script experiment.sh
to run experiments. You can run different experiments with the script run_experiments.sh
, where you can add all the different parameters you want to test.
You can also predict the summaries with the baseline model with both sum_of_sums strategy and of the full text at once by running: python bert_summarize.py