This github repository contains the source code of the AutoACU package for automatic summarization evaluation.
AutoACU contains two types of automatic evaluation metrics:
- A2CU: a two-step automatic evaluation metric that first extracts atomic content units (ACUs) from one text sequence and then evaluates the extracted ACUs against another text sequence.
- A3CU: an accelerated version of A2CU that directly computes the similarity between two text sequences without extracting ACUs, but with the similar evaluation target.
You can install AutoACU using pip:
pip install autoacu
or clone the repository and install it manually:
git clone https://github.com/Yale-LILY/AutoACU
cd AutoACU
pip install .
The necessary dependencies include PyTorch and HuggingFace's Transformers. With the usage of the T5 Tokenizer in the Transformers class you will also need to have the SentencePiece package installed as well. It should be compatible with any of the recent versions of PyTorch and Transformers. However, to make sure that the dependencies are compatible, you may run the following command:
pip install autoacu[stable]
You may also use the metrics directly without installing the package by importing the metric classes in autoacu/a2cu.py
and autoacu/a3cu.py
.
The model checkpoints for A2CU and A3CU are available on the HuggingFace model hub.
A2CU needs to be initialized with two models, one for ACU generation and one for ACU matching. The default models are the following:
- ACU generation: Yale-LILY/a2cu-generator, which is a T0-3B model finetuned on the RoSE dataset.
- ACU matching: Yale-LILY/a2cu-classifier, which is a DeBERTa-XLarge model finetuned on the RoSE dataset.
Please note that to use A2CU, you may need to have a GPU with at least 16GB memory.
Below is an example of using A2CU to evaluate the similarity between two text sequences.
from autoacu import A2CU
candidates, references = ["This is a test"], ["This is a test"]
a2cu = A2CU()
recall_scores, prec_scores, f1_scores = a2cu.score(
references=references,
candidates=candidates,
generation_batch_size=2, # the batch size for ACU generation
matching_batch_size=16, # the batch size for ACU matching
output_path=None, # the path to save the evaluation results
recall_only=False, # whether to only compute the recall score
acu_path=None # the path to save the generated ACUs
)
print(f"Recall: {recall_scores[0]:.4f}, Precision {prec_scores[0]:.4f}, F1: {f1_scores[0]:.4f}")
#Sample Output:
Recall: 0.1250, Precision 0.1250, F1: 0.1250
The default model checkpoint for A3CU is Yale-LILY/a3cu, which is based on the BERT-Large model. Below is an example of using A3CU to evaluate the similarity between two text sequences.
from autoacu import A3CU
candidates, references = ["This is a test"], ["This is a test"]
a3cu = A3CU()
recall_scores, prec_scores, f1_scores = a3cu.score(
references=references,
candidates=candidates,
batch_size=16, # the batch size for ACU generation
output_path=None # the path to save the evaluation results
)
print(f"Recall: {recall_scores[0]:.4f}, Precision {prec_scores[0]:.4f}, F1: {f1_scores[0]:.4f}")
#Sample Output:
Recall: 0.8007, Precision 0.8007, F1: 0.8007