Script that scrapes news articles in the 2022 Semeval Task 8 format from the Internet Archive.
Details about the data and the task in the project homepage.
A pair of articles with id 0123456789_9876543210
will be stored in output_dir/89/0123456789.{html|json}
and
output_dir/10/9876543210.{html|json}
respectively.
The HTML file contains the web page of the article as obtained from the internet archive. The json file contains additional information extracted from the page using the package newspaper3k.
The code is available on github, together with sample input data (sample_data.csv
)
python3 -m venv venv source venv/bin/activate pip install semeval_8_2022_ia_downloader python -m semeval_8_2022_ia_downloader.cli --links_file=input.csv --dump_dir=output_dir
This package was created with Cookiecutter and the audreyr/cookiecutter-pypackage project template.