You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Summarization is the task of producing a shorter version of one or several documents that preserves most of the
input's meaning.
Warning: Evaluation Metrics
For summarization, automatic metrics such as ROUGE and METEOR have serious limitations:
They only assess content selection and do not account for other quality aspects, such as fluency, grammaticality, coherence, etc.
To assess content selection, they rely mostly on lexical overlap, although an abstractive summary could express they same content as a reference without any lexical overlap.
Given the subjectiveness of summarization and the correspondingly low agreement between annotators, the metrics were designed to be used with multiple reference summaries per input. However, recent datasets such as CNN/DailyMail and Gigaword provide only a single reference.
Therefore, tracking progress and claiming state-of-the-art based only on these metrics is questionable. Most papers carry out additional manual comparisons of alternative summaries. Unfortunately, such experiments are difficult to compare across papers. If you have an idea on how to do that, feel free to contribute.
CNN / Daily Mail
The CNN / Daily Mail dataset as processed by
Nallapati et al. (2016) has been used
for evaluating summarization. The dataset contains online news articles (781 tokens
on average) paired with multi-sentence summaries (3.75 sentences or 56 tokens on average).
The processed version contains 287,226 training pairs, 13,368 validation pairs and 11,490 test pairs.
Models are evaluated with full-length F1-scores of ROUGE-1, ROUGE-2, ROUGE-L, and METEOR (optional).
Anonymized version
The following models have been evaluated on the entitiy-anonymized version of the dataset introduced by Nallapati et al. (2016).
The Gigaword summarization dataset has been first used by Rush et al., 2015 and represents a sentence summarization / headline generation task with very short input documents (31.4 tokens) and summaries (8.3 tokens). It contains 3.8M training, 189k development and 1951 test instances. Models are evaluated with ROUGE-1, ROUGE-2 and ROUGE-L using full-length F1-scores.
Similar to Gigaword, task 1 of DUC 2004 is a sentence summarization task. The dataset contains 500 documents with on average 35.6 tokens and summaries with 10.4 tokens. Due to its size, neural models are typically trained on other datasets and only tested on DUC 2004. Evaluation metrics are ROUGE-1, ROUGE-2 and ROUGE-L recall @ 75 bytes.
Sentence compression produces a shorter sentence by removing redundant information,
preserving the grammatically and the important content of the original sentence.
Sentence: Floyd Mayweather is open to fighting Amir Khan in the future, despite snubbing the Bolton-born boxer in favour of a May bout with Argentine Marcos Maidana, according to promoters Golden Boy
Compression: Floyd Mayweather is open to fighting Amir Khan in the future.
In short, this is a deletion-based task where the compression is a subsequence from the original sentence. From the 10,000 pairs of the eval portion(repository) it is used the very first 1,000 sentence for automatic evaluation and the 200,000 pairs for training.
Models are evaluated using the following metrics:
F1 - compute the recall and precision in terms of tokens kept in the golden and the generated compressions.
Compression rate (CR) - the length of the compression in characters divided over the sentence length.