wikipedia_tools
is a Python Package to download wikipedia revisions for pages belonging to certain categories, based on a period of time. This package also provides overview stats for the downloaded data.
@software{elbaff:2022-software,
author = "{El Baff, Roxanne and Hecking, Tobias}",
license = "{MIT}",
month = "dec",
title = "{{Wikipedia Revisions Downloader and Analyzer}}",
url = "{https://github.com/DLR-SC/wikipedia-periodic-revisions}",
version = "{2.4.1}",
url = "https://github.com/DLR-SC/wikipedia-periodic-revisions",
year = 2022
}
This package is built on top of the Wikipedia API. This code was forked under the base
subpackage.
Also we forked the code from ajoer/WikiRevParser and we modified it to support from and to datetime to fetch revisions between certain periods; the modified code is wikipedia_toools.scraper.wikirevparser_with_time.py
.
Note: No need to download these two projects, they are already integrated as part of this project.
Via PIP
pip install wikipedia_tools
Or install manually by cloning and then running
pip install -e wikipedia_tools
This packages is responsible for:
- fetching the wikipages revisions based on a period of time
- load them into parquet, and
- provide basic analysis
It contains three main subpackages and the utils package which contains few helpers functions:
Downlaod Wiki Article Revisions [wikipedia_tools.scraper]
This subpackage is responsible for downloading the wikipedia revisions from the web.
The code below shows how to download all the revisions of pages:
-
belonging to the Climate_change category.
-
revisions between start of 8 months ago (1.1.2022) and now (29.9.2022). The get_x_months_ago_date function returns the datetime of the beginning of 8 months ago.
from wikipedia_tools.utils import utils utils.get_x_months_ago_date(8)
-
if save_each_page= True: each page is fetched and downloaded on the spot under the folder data/periodic_wiki_batches/{categories_names}/from{month-year}_to{month-year}. Otherwise, all the page revisions are fetched first and then saved into one jsonl file.
from wikipedia_tools.scraper import downloader
from datetime import datetime
wikirevs= downloader.WikiPagesRevision(
categories = ["Climate_change"],
revisions_from = utils.get_x_months_ago_date(8),
revisions_to=datetime.now(),
save_each_page= True
)
count, destination_folder = wikirevs.download()
For german wiki revisions, you can set the lang attribute to de - For example, you can download the German Wikipedia page revisions for the Climate_change category, as follows:
from wikipedia_tools.scraper import downloader
from datetime import datetime
wikirevs= downloader.WikiPagesRevision(
categories = ["Klimaveränderung"],
revisions_from = utils.get_x_months_ago_date(1), # beginning of last month, you can use instead datetime.now() + dateutil.relativedelta.relativedelta() to customize past datetime relatively
revisions_to=datetime.now(),
save_each_page= True,
lang="de"
)
count, destination_folder = wikirevs.download()
You can then process each file by, for example, reading the parquet file using pandas:
import pandas as pd
from glob import glob
files = f"{destination_folder}/*.parquet"
# Loop over all wiki page revisions with this period and read each wiki page revs as a pandas dataframe
for page_path in glob(files):
page_revs_df = pd.read_parquet(page_name)
# dataframe with columns ['page', 'lang', 'timestamp', 'categories', 'content', 'images', 'links', 'sections', 'urls', 'user']
# process/use file ....
## Initialize the analyzer object
from wikipedia_tools.analyzer.revisions import WikipediaRevisionAnalyzer
analyzer = WikipediaRevisionAnalyzer(
category = category,
period = properties.PERIODS._YEARLY_,
corpus = CORPUS,
root = ROOT_PATH
)
# Get the yearly number of articles that were created/edit at least once
unique_created_updated_articles = analyzer.get_edited_page_count(plot=True, save=True)
# Returned the number of created articles over time
unique_created_articles = analyzer.get_created_page_count(plot=True, save=True)
# Returns the number of revisions over time
rev_overtime_df = analyzer.get_revisions_over_time(save=True)
# Returns the number of words over time
words_overtime_df = analyzer.get_words_over_time(save=True)
# Returns the number of users over time, grouped by user type
users_overtime_df = analyzer.get_users_over_time(save=True)
# return the top n wikipedia articles over time
top_edited = analyzer.get_most_edited_articles(top=4)
# return the articles sorted from most to least edited over time
most_to_least_revised = analyzer.get_periodic_most_to_least_revised(save=True)
You can find the full example under the examples folder.