-
Notifications
You must be signed in to change notification settings - Fork 78
Examples
Here are some examples of how wptools is being used.
- Get an article extract
- Get a representative image
- Get page HTML
- Get Infobox data
- Get cover images
- Get Wikidata
- Extend Wikidata claims
- Get all the page info
- Get category members
- Get site info
- List most popular articles
The get_query()
method gets (light) HTML and (Markdown) text extracts.
>>> page = wptools.page('Ella Fitzgerald')
>>> page.get_query()
en.wikipedia.org (query) Ella Fitzgerald
en.wikipedia.org (imageinfo) File:Ella Fitzgerald (Gottlieb 02871...
Ella Fitzgerald (en) data
{
extext: <str(2002)> **Ella Jane Fitzgerald** (April 25, 1917 – J...
extract: <str(2067)> <p><b>Ella Jane Fitzgerald</b> (April 25, 1...
...
}
Compare to RESTBase extracts:
>>> page.get_restbase('summary')
en.wikipedia.org (restbase) /page/summary/Ella Fitzgerald
Ella Fitzgerald (en) data
{
exhtml: <str(1455)> <p><b>Ella Jane Fitzgerald</b> (April 25, 19...
exrest: <str(1424)> Ella Jane Fitzgerald (April 25, 1917 – June ...
...
}
A representative image for a page can come from the Wikimedia:API, from an Infobox, from Wikidata Property:P18, or from the RESTBase. See the Images wiki page for details.
>>> page = wptools.page('Frida Kahlo')
>>> page.get_query()
en.wikipedia.org (query) Frida Kahlo
en.wikipedia.org (imageinfo) File:Frida Kahlo, by Guillermo Kahlo...
Frida Kahlo (en) data
{
image: <list(2)> {'kind': 'query-pageimage', u'descriptionshortu...
...
}
>>> page.pageimage()
['query-pageimage', 'query-thumbnail']
>>> page.pageimage('page')['url']
u'https://upload.wikimedia.org/wikipedia/commons/0/06/Frida_Kahlo%2C_by_Guillermo_Kahlo.jpg'
>>> page.pageimage('thumb')['url']
u'https://upload.wikimedia.org/wikipedia/commons/thumb/0/06/Frida_Kahlo%2C_by_Guillermo_Kahlo.jpg/160px-Frida_Kahlo%2C_by_Guillermo_Kahlo.jpg'
The most performant way to get article HTML is via RESTBase.
>>> page = wptools.page('Buddha')
>>> page.get_restbase('html')
en.wikipedia.org (restbase) /page/html/Buddha
Buddha (en) data
{
html: <str(628054)> <!DOCTYPE html><html prefix="dc: http://purl...
}
Getting data from Infoboxes may be unavoidable, but getting Wikidata (via get_wikidata()
) is preferred. Wikidata is structured but sometimes data poor, while Infoboxen are unstructured and frequently data rich. Please consider updating Wikidata if the information you want is only available in a MediaWiki instance so that others may benefit from open linked data.
>>> page = wptools.page('Fela Kuti')
>>> page.get_parse()
en.wikipedia.org (parse) Fela Kuti
en.wikipedia.org (imageinfo) File:Fela Kuti.jpg
Fela Kuti (en) data
{
infobox: <dict(17)> website, associated_acts, death_place, image...
...
}
>>> page.data['infobox']['instrument']
'Saxophone, vocals, keyboards, trumpet, guitar, drums'
Most media (album, book, film, etc.) cover images on Wikipedia appear in an Infobox. For convenience, we put "cover" files from infoboxes in the image
attribute.
>>> page = wptools.page('Blue Train (album)')
>>> page.get_parse()
en.wikipedia.org (parse) Blue Train (album)
en.wikipedia.org (imageinfo) File:John Coltrane - Blue Train.jpg
Blue Train (album) (en) data
{
image: <list(1)> {'kind': 'parse-cover', u'descriptionshorturl':...
infobox: <dict(16)> Name, Language, Artist, Cover, Recorded, Lab...
...
}
>>> page.pageimage()
['parse-cover']
>>> page.pageimage('cover')['url']
u'https://upload.wikimedia.org/wikipedia/en/6/68/John_Coltrane_-_Blue_Train.jpg'
Resolved properties and claims are stored in the wikidata
attribute. Wikidata properties are selected by wptools.wikidata.LABELS
. Properties (e.g. P17 "country") are stored in properties
, and those properties that have Wikidata items for values (e.g. Q142 "France") are stored in claims
and resolved by another Wikidata API call. See the Wikidata page in our wiki for more details.
>>> page = wptools.page('Stephen Fry')
>>> page.get_wikidata()
www.wikidata.org (wikidata) Stephen Fry
www.wikidata.org (claims) Q8817795|Q5|Q7066|Q145
en.wikipedia.org (imageinfo) File:Stephen Fry cropped.jpg
Stephen Fry (en) data
{
aliases: <list(1)> Stephen John Fry
claims: <dict(4)> Q8817795, Q5, Q7066, Q145
description: English comedian, actor, writer, presenter, and activist
image: <list(1)> {'kind': 'wikidata-image', u'descriptionshortur...
label: Stephen Fry
modified: <dict(1)> wikidata
pageid: 191035
properties: <dict(8)> P135, P345, P910, P27, P856, P569, P18, P31
title: Stephen_Fry
what: human
wikibase: Q192912
wikidata: <dict(8)> website, category, citizenship, image, insta...
wikidata_url: https://www.wikidata.org/wiki/Q192912
}
If the predefined wptools.wikidata.LABELS
do not include something you want resolved from a claim, you can simply add your property labels via update_labels()
:
>>> page = wptools.page('Simone de Beauvoir')
>>> page.update_labels({'P21': 'gender'})
>>> page.get_wikidata()
www.wikidata.org (wikidata) Simone de Beauvoir
www.wikidata.org (claims) Q142|Q5|Q3411417|Q859773|Q38066|Q151578...
en.wikipedia.org (imageinfo) File:Simone de Beauvoir.jpg
Simone de Beauvoir (en) data
{
wikidata: <dict(10)> category, death, citizenship, gender, image...
...
}
In [29]: page.data['wikidata']['gender']
Out[29]: u'female'
Simply calling get()
on a page will automagically fetch extracts, images, infobox data, wikidata, and other metadata via the MediaWiki, Wikidata, and RESTBase APIs.
>>> page = wptools.page('Gandhi').get()
en.wikipedia.org (query) Gandhi
en.wikipedia.org (parse) 19379
www.wikidata.org (wikidata) Q1001
www.wikidata.org (claims) Q6581097|Q5|Q129286|Q6512732|Q668
en.wikipedia.org (restbase) /page/summary/Mahatma_Gandhi
en.wikipedia.org (imageinfo) File:Portrait Gandhi.jpg|File:MKGandhi.jpg
Mahatma Gandhi (en) data
{
aliases: <list(10)> M K Gandhi, Mohandas Gandhi, Bapu, Gandhi, M...
claims: <dict(5)> Q6581097, Q5, Q129286, Q6512732, Q668
description: <str(67)> pre-eminent leader of Indian nationalism ...
exhtml: <str(1064)> <p>Mahātmā <b>Mohandas Karamchand Gandhi</b>...
exrest: <str(896)> Mahātmā Mohandas Karamchand Gandhi (; Hindust...
extext: <str(2985)> Mahātmā **Mohandas Karamchand Gandhi** (; Hi...
extract: <str(3212)> <p>Mahātmā <b>Mohandas Karamchand Gandhi</b...
image: <list(6)> {'kind': 'query-pageimage', u'descriptionshortu...
infobox: <dict(25)> known_for, other_names, image, signature, bi...
label: Mahatma Gandhi
length: 264,127
links: <list(10)> https://biblio.wiki/wiki/Mohandas_K._Gandhi, h...
modified: <dict(2)> wikidata, page
pageid: 19379
parsetree: <str(333405)> <root><template><title>Redirect</title>...
properties: <dict(8)> P345, P910, P27, P21, P569, P18, P31, P570
random: Pukara (Moquegua)
title: Mahatma_Gandhi
url: https://en.wikipedia.org/wiki/Mahatma_Gandhi
url_raw: https://en.wikipedia.org/wiki/Mahatma_Gandhi?action=raw
watchers: 1,733
what: human
wikibase: Q1001
wikidata: <dict(8)> category, death, citizenship, gender, image,...
wikidata_url: https://www.wikidata.org/wiki/Q1001
wikitext: <str(262663)> {{Redirect|Gandhi}}{{pp-move-indef}}{{pp...
}
You can also call get_more()
to get further page data that results in a more expensive (slower) query, like page files, categories, languages, contributors, and average daily views:
>>> page.get_more()
en.wikipedia.org (querymore) Gandhi
Mahatma Gandhi (en) data
{
categories: <list(67)> Category:1869 births, Category:1948 death...
contributors: 2,608
files: <list(52)> File:Aum Om red.svg, File:Commons-logo.svg, Fi...
languages: <list(167)> {u'lang': u'af', u'title': u'Mahatma Gand...
modified: <dict(1)> page
pageid: 19379
title: Mahatma Gandhi
views: 21,603
}