This repository contains the evaluation code and dataset to reproduce the results from the paper "FUNDUS: A Simple-to-Use News Scraper Optimized for High Quality Extractions".
Fundus is a user-friendly news scraper that enables users to obtain millions of high-quality news articles with just a few lines of code.
In the following sections, we provide instructions to reproduce the comparative evaluation of Fundus against prominent scraping libraries. Our evaluation shows that Fundus yields significantly higher quality extractions (complete and artifact-free news articles) than comparable news scrapers. For a more in-depth overview of Fundus, the evaluation practises, and its results, consult the result summary and our paper.
Fundus and this evaluation repository require Python 3.8 or later and Java for the Boilerpipe scraper. (Note: The evaluation was tested and performed using Python 3.8 and Java JDK 17.0.10.)
To install the fundus-evaluation
Python package, including the reference scraper dependencies, clone this GitHub repository and simply install the package using pip:
git clone https://github.com/dobbersc/fundus-evaluation.git
pip install ./fundus-evaluation
This installation also contains the dataset and evaluation results.
If you only are interested in the Python package directly (without the dataset and evaluation results), install the fundus-evaluation
package directly from GitHub using pip:
pip install git+https://github.com/dobbersc/fundus-evaluation.git@master
Verify the installation by running evaluate --version
, with the expected output of evaluate <version>
, where <version>
specifies the current version of the evaluation package.
For development, install the package, including the development dependencies:
git clone https://github.com/dobbersc/fundus-evaluation.git
pip install -e ./fundus-evaluation[dev]
In the following steps, we assume that the current working directory is the root of the repository.
To fully reproduce the evaluation results, only the dataset is required.
Each step in the evaluation pipeline requires the outputs from the previous step (dataset -> scrape -> score -> analysis).
To ease the reproducibility, we also provide the artifacts of intermediate steps in the dataset
folder.
Therefore, the pipeline may be started from any step.
The evaluation results may be reproduced using the package's command line interface (CLI), representing the evaluation pipeline steps:
$ evaluate --help
usage: evaluate [-h] [--version] {complexity,scrape,score,analysis} ...
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
Fundus News Scraper Evaluation:
select evaluation pipeline step
{complexity,scrape,score,analysis}
complexity calculate the page complexity scores
scrape scrape extractions on the evaluation dataset
score calculate evaluation scores
analysis generate tables and plots
Each entry point also provides its help page, e.g. with evaluate scrape --help
.
Alternatively to the CLI, we provide direct Python entry points in fundus_evaluation.entry_points
.
In the following steps, we will use the CLI.
We selected the 16 English-language publishers Fundus currently supports as the data source, and retrieved five articles for each publisher from the respective RSS feeds/sitemaps. The selection process yielded an evaluation corpus of 80 news articles. From it, we manually extracted the plain text from each article and stored it together with information on the original paragraph structure.
The resulting evaluation dataset is included in this repository and consists of the (compressed) HTML article files and their ground truth extractions as JSON.
Execute the following command to let all supported scrapers extract the plain text of the evaluation dataset's articles:
evaluate scrape \
--ground-truth-path dataset/ground_truth.json \
--html-directory dataset/html/ \
--output-directory dataset/extractions/
To restrict the scrapers that are part of the evaluation,
- use the
--scrapers
flag to explicitly specify a list of evaluation scrapers, - or use the
--exclude-scrapers
flag to exclude scrapers from the evaluation.
E.g. to exclude BoilerNet, as this scraper is very resource intensive, add the --exclude-scrapers boilernet
argument to the command above.
To evaluate the extraction results with the three supported metrics (paragraph match, ROUGE-LSum and WER), run the following command:
evaluate score \
--ground-truth-path dataset/ground_truth.json \
--extractions-directory dataset/extractions/ \
--output-directory dataset/scores/
This step is not part of the evaluation in our paper and is thus optional.
Execute the following command to calculate the page complexity scores established in "An Empirical Comparison of Web Content Extraction Algorithms" (Bevendorff et al., 2023):
evaluate complexity \
--ground-truth-path dataset/ground_truth.json \
--html-directory dataset/html/ \
--output-path dataset/complexity.tsv
Run the following command to produce the paper's tables and plots for the ROUGE-LSum score:
evaluate analysis --rouge-lsum-path dataset/scores/rouge_lsum.tsv --output-directory dataset/analysis/
To also produce a boxplot of the page complexity, execute:
evaluate analysis --complexity-path dataset/complexity.tsv --output-directory dataset/analysis/
The following table summarizes the overall performance of Fundus and evaluated scrapers in terms of averaged ROUGE-LSum precision, recall and F1-score and their standard deviation. The table is sorted in descending order over the F1-score:
Scraper | Precision | Recall | F1-Score |
---|---|---|---|
Fundus | 99.89±0.57 | 96.75±12.75 | 97.69±9.75 |
Trafilatura | 90.54±18.86 | 93.23±23.81 | 89.81±23.69 |
BTE | 81.09±19.41 | 98.23±8.61 | 87.14±15.48 |
jusText | 86.51±18.92 | 90.23±20.61 | 86.96±19.76 |
news-please | 92.26±12.40 | 86.38±27.59 | 85.81±23.29 |
BoilerNet | 84.73±20.82 | 90.66±21.05 | 85.77±20.28 |
Boilerpipe | 82.89±20.65 | 82.11±29.99 | 79.90±25.86 |
Please cite the following paper when using Fundus or building upon our work:
@misc{dallabetta2024fundus,
title={Fundus: A Simple-to-Use News Scraper Optimized for High Quality Extractions},
author={Max Dallabetta and Conrad Dobberstein and Adrian Breiding and Alan Akbik},
year={2024},
eprint={2403.15279},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
- This repository's architecture has been inspired by the web content extraction benchmark (Bevendorff et al., 2023).
- Since BoilerNet has no Python package on PyPI, we adopted a stripped-down version of the upstream BoilerNet provided by Bevendorff et al. from their web content extraction benchmark.
- Similarly, BTE has no Python package on PyPI. Here, we used the implementation by Jan Pomikalek found from this and this source.