Skip to content

Latest commit

 

History

History
172 lines (122 loc) · 9.1 KB

README.md

File metadata and controls

172 lines (122 loc) · 9.1 KB

Fundus News Scraper Evaluation

This repository contains the evaluation code and dataset to reproduce the results from the paper "FUNDUS: A Simple-to-Use News Scraper Optimized for High Quality Extractions".

Fundus is a user-friendly news scraper that enables users to obtain millions of high-quality news articles with just a few lines of code.

In the following sections, we provide instructions to reproduce the comparative evaluation of Fundus against prominent scraping libraries. Our evaluation shows that Fundus yields significantly higher quality extractions (complete and artifact-free news articles) than comparable news scrapers. For a more in-depth overview of Fundus, the evaluation practises, and its results, consult the result summary and our paper.

Prerequisites

Fundus and this evaluation repository require Python 3.8 or later and Java for the Boilerpipe scraper. (Note: The evaluation was tested and performed using Python 3.8 and Java JDK 17.0.10.)

To install the fundus-evaluation Python package, including the reference scraper dependencies, clone this GitHub repository and simply install the package using pip:

git clone https://github.com/dobbersc/fundus-evaluation.git
pip install ./fundus-evaluation

This installation also contains the dataset and evaluation results. If you only are interested in the Python package directly (without the dataset and evaluation results), install the fundus-evaluation package directly from GitHub using pip:

pip install git+https://github.com/dobbersc/fundus-evaluation.git@master

Verify the installation by running evaluate --version, with the expected output of evaluate <version>, where <version> specifies the current version of the evaluation package.

Development

For development, install the package, including the development dependencies:

git clone https://github.com/dobbersc/fundus-evaluation.git
pip install -e ./fundus-evaluation[dev]

Reproducing the Evaluation Results

In the following steps, we assume that the current working directory is the root of the repository.

To fully reproduce the evaluation results, only the dataset is required. Each step in the evaluation pipeline requires the outputs from the previous step (dataset -> scrape -> score -> analysis). To ease the reproducibility, we also provide the artifacts of intermediate steps in the dataset folder. Therefore, the pipeline may be started from any step.

Usage

The evaluation results may be reproduced using the package's command line interface (CLI), representing the evaluation pipeline steps:

$ evaluate --help
usage: evaluate [-h] [--version] {complexity,scrape,score,analysis} ...

optional arguments:
  -h, --help            show this help message and exit
  --version             show program's version number and exit

Fundus News Scraper Evaluation:
  select evaluation pipeline step

  {complexity,scrape,score,analysis}
    complexity          calculate the page complexity scores
    scrape              scrape extractions on the evaluation dataset
    score               calculate evaluation scores
    analysis            generate tables and plots

Each entry point also provides its help page, e.g. with evaluate scrape --help.

Alternatively to the CLI, we provide direct Python entry points in fundus_evaluation.entry_points. In the following steps, we will use the CLI.

(1) Obtaining the Evaluation Dataset

We selected the 16 English-language publishers Fundus currently supports as the data source, and retrieved five articles for each publisher from the respective RSS feeds/sitemaps. The selection process yielded an evaluation corpus of 80 news articles. From it, we manually extracted the plain text from each article and stored it together with information on the original paragraph structure.

The resulting evaluation dataset is included in this repository and consists of the (compressed) HTML article files and their ground truth extractions as JSON.

(2) Generating the Scraper Extractions

Execute the following command to let all supported scrapers extract the plain text of the evaluation dataset's articles:

evaluate scrape \
  --ground-truth-path dataset/ground_truth.json \
  --html-directory dataset/html/ \
  --output-directory dataset/extractions/

To restrict the scrapers that are part of the evaluation,

  • use the --scrapers flag to explicitly specify a list of evaluation scrapers,
  • or use the --exclude-scrapers flag to exclude scrapers from the evaluation.

E.g. to exclude BoilerNet, as this scraper is very resource intensive, add the --exclude-scrapers boilernet argument to the command above.

(3) Calculating the Evaluation Scores

To evaluate the extraction results with the three supported metrics (paragraph match, ROUGE-LSum and WER), run the following command:

evaluate score \
  --ground-truth-path dataset/ground_truth.json \
  --extractions-directory dataset/extractions/ \
  --output-directory dataset/scores/

Calculating the Page Complexity (Optional)

This step is not part of the evaluation in our paper and is thus optional.

Execute the following command to calculate the page complexity scores established in "An Empirical Comparison of Web Content Extraction Algorithms" (Bevendorff et al., 2023):

evaluate complexity \
  --ground-truth-path dataset/ground_truth.json \
  --html-directory dataset/html/ \
  --output-path dataset/complexity.tsv

(4) Analyzing the Data

Run the following command to produce the paper's tables and plots for the ROUGE-LSum score:

evaluate analysis --rouge-lsum-path dataset/scores/rouge_lsum.tsv --output-directory dataset/analysis/

To also produce a boxplot of the page complexity, execute:

evaluate analysis --complexity-path dataset/complexity.tsv --output-directory dataset/analysis/

Results

The following table summarizes the overall performance of Fundus and evaluated scrapers in terms of averaged ROUGE-LSum precision, recall and F1-score and their standard deviation. The table is sorted in descending order over the F1-score:

Scraper Precision Recall F1-Score
Fundus 99.89±0.57 96.75±12.75 97.69±9.75
Trafilatura 90.54±18.86 93.23±23.81 89.81±23.69
BTE 81.09±19.41 98.23±8.61 87.14±15.48
jusText 86.51±18.92 90.23±20.61 86.96±19.76
news-please 92.26±12.40 86.38±27.59 85.81±23.29
BoilerNet 84.73±20.82 90.66±21.05 85.77±20.28
Boilerpipe 82.89±20.65 82.11±29.99 79.90±25.86

Cite

Please cite the following paper when using Fundus or building upon our work:

@misc{dallabetta2024fundus,
      title={Fundus: A Simple-to-Use News Scraper Optimized for High Quality Extractions}, 
      author={Max Dallabetta and Conrad Dobberstein and Adrian Breiding and Alan Akbik},
      year={2024},
      eprint={2403.15279},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Acknowledgements

  • This repository's architecture has been inspired by the web content extraction benchmark (Bevendorff et al., 2023).
  • Since BoilerNet has no Python package on PyPI, we adopted a stripped-down version of the upstream BoilerNet provided by Bevendorff et al. from their web content extraction benchmark.
  • Similarly, BTE has no Python package on PyPI. Here, we used the implementation by Jan Pomikalek found from this and this source.