-
Notifications
You must be signed in to change notification settings - Fork 78
Examples
- Get an article extract
- Get a representative image
- Get page HTML
- Get Infobox data
- Get cover images
- Get Wikidata
- Extend Wikidata claims
- Get all the page info
- Get category members
- Get site info
- List most popular articles
The get_query()
method gets (light) HTML and (Markdown) text extracts.
>>> page = wptools.page('Ella Fitzgerald')
>>> page.get_query()
en.wikipedia.org (query) Ella Fitzgerald
en.wikipedia.org (imageinfo) File:Ella Fitzgerald (Gottlieb 02871...
Ella Fitzgerald (en) data
{
extext: <str(2002)> **Ella Jane Fitzgerald** (April 25, 1917 – J...
extract: <str(2067)> <p><b>Ella Jane Fitzgerald</b> (April 25, 1...
...
}
Compare to RESTBase extracts:
>>> page.get_restbase('summary')
en.wikipedia.org (restbase) /page/summary/Ella Fitzgerald
Ella Fitzgerald (en) data
{
exhtml: <str(1455)> <p><b>Ella Jane Fitzgerald</b> (April 25, 19...
exrest: <str(1424)> Ella Jane Fitzgerald (April 25, 1917 – June ...
...
}
A representative image for a page can come from the Wikimedia:API, from an Infobox, from Wikidata Property:P18, or from the RESTBase. See the Images wiki page for details.
>>> page = wptools.page('Frida Kahlo')
>>> page.get_query()
en.wikipedia.org (query) Frida Kahlo
en.wikipedia.org (imageinfo) File:Frida Kahlo, by Guillermo Kahlo...
Frida Kahlo (en) data
{
image: <list(2)> {'kind': 'query-pageimage', u'descriptionshortu...
...
}
>>> page.pageimage()
['query-pageimage', 'query-thumbnail']
>>> page.pageimage('page')['url']
u'https://upload.wikimedia.org/wikipedia/commons/0/06/Frida_Kahlo%2C_by_Guillermo_Kahlo.jpg'
>>> page.pageimage('thumb')['url']
u'https://upload.wikimedia.org/wikipedia/commons/thumb/0/06/Frida_Kahlo%2C_by_Guillermo_Kahlo.jpg/160px-Frida_Kahlo%2C_by_Guillermo_Kahlo.jpg'
The most performant way to get article HTML is via RESTBase.
>>> page = wptools.page('Buddha')
>>> page.get_restbase('html')
en.wikipedia.org (restbase) /page/html/Buddha
Buddha (en) data
{
html: <str(628054)> <!DOCTYPE html><html prefix="dc: http://purl...
}
Getting data from Infoboxes may be unavoidable, but getting Wikidata (via get_wikidata()
) is preferred. Wikidata is structured but sometimes data poor, while Infoboxen are unstructured and frequently data rich. Please consider updating Wikidata if the information you want is only available in a MediaWiki instance so that others may benefit from open linked data.
>>> page = wptools.page('Fela Kuti')
>>> page.get_parse()
en.wikipedia.org (parse) Fela Kuti
en.wikipedia.org (imageinfo) File:Fela Kuti.jpg
Fela Kuti (en) data
{
infobox: <dict(17)> website, associated_acts, death_place, image...
...
}
>>> page.data['infobox']['instrument']
'Saxophone, vocals, keyboards, trumpet, guitar, drums'
Most media (album, book, film, etc.) cover images on Wikipedia appear in an Infobox. For convenience, we put "cover" files from infoboxes in the image
attribute.
>>> page = wptools.page('Blue Train (album)')
>>> page.get_parse()
en.wikipedia.org (parse) Blue Train (album)
en.wikipedia.org (imageinfo) File:John Coltrane - Blue Train.jpg
Blue Train (album) (en) data
{
image: <list(1)> {'kind': 'parse-cover', u'descriptionshorturl':...
infobox: <dict(16)> Name, Language, Artist, Cover, Recorded, Lab...
...
}
>>> page.pageimage()
['parse-cover']
>>> page.pageimage('cover')['url']
u'https://upload.wikimedia.org/wikipedia/en/6/68/John_Coltrane_-_Blue_Train.jpg'
Resolved properties and claims are stored in the wikidata
attribute. Wikidata properties are selected by wptools.wikidata.LABELS
. Properties (e.g. P17 "country") are stored in data['properties']
, and those properties that have Wikidata items for values (e.g. Q142 "France") are stored in data['claims']
and resolved by another Wikidata API call. See the Wikidata page in our wiki for more details.
>>> page = wptools.page('Stephen Fry')
>>> page.get_wikidata()
www.wikidata.org (wikidata) Stephen Fry
www.wikidata.org (claims) Q8817795|Q5|Q7066|Q145
en.wikipedia.org (imageinfo) File:Stephen Fry cropped.jpg
Stephen Fry (en) data
{
aliases: <list(1)> Stephen John Fry
claims: <dict(4)> Q8817795, Q5, Q7066, Q145
description: English comedian, actor, writer, presenter, and activist
image: <list(1)> {'kind': 'wikidata-image', u'descriptionshortur...
label: Stephen Fry
modified: <dict(1)> wikidata
pageid: 191035
properties: <dict(8)> P135, P345, P910, P27, P856, P569, P18, P31
title: Stephen_Fry
what: human
wikibase: Q192912
wikidata: <dict(8)> website, category, citizenship, image, insta...
wikidata_url: https://www.wikidata.org/wiki/Q192912
}
If the predefined wptools.wikidata.LABELS
do not include something you want resolved from a claim, you can simply add your property labels via update_labels()
:
>>> page = wptools.page('Simone de Beauvoir')
>>> page.update_labels({'P21': 'gender'})
>>> page.get_wikidata()
www.wikidata.org (wikidata) Simone de Beauvoir
www.wikidata.org (claims) Q142|Q5|Q3411417|Q859773|Q38066|Q151578...
en.wikipedia.org (imageinfo) File:Simone de Beauvoir.jpg
Simone de Beauvoir (en) data
{
wikidata: <dict(10)> category, death, citizenship, gender, image...
...
}
>>> page.data['wikidata']['gender']
u'female'
Simply calling get()
on a page will automagically fetch extracts, images, infobox data, wikidata, and other metadata via the MediaWiki, Wikidata, and RESTBase APIs.
>>> page = wptools.page('Gandhi').get()
en.wikipedia.org (query) Gandhi
en.wikipedia.org (parse) 19379
www.wikidata.org (wikidata) Q1001
www.wikidata.org (claims) Q6581097|Q5|Q129286|Q6512732|Q668
en.wikipedia.org (restbase) /page/summary/Mahatma_Gandhi
en.wikipedia.org (imageinfo) File:Portrait Gandhi.jpg|File:MKGandhi.jpg
Mahatma Gandhi (en) data
{
aliases: <list(10)> M K Gandhi, Mohandas Gandhi, Bapu, Gandhi, M...
claims: <dict(5)> Q6581097, Q5, Q129286, Q6512732, Q668
description: <str(67)> pre-eminent leader of Indian nationalism ...
exhtml: <str(1064)> <p>Mahātmā <b>Mohandas Karamchand Gandhi</b>...
exrest: <str(896)> Mahātmā Mohandas Karamchand Gandhi (; Hindust...
extext: <str(2985)> Mahātmā **Mohandas Karamchand Gandhi** (; Hi...
extract: <str(3212)> <p>Mahātmā <b>Mohandas Karamchand Gandhi</b...
image: <list(6)> {'kind': 'query-pageimage', u'descriptionshortu...
infobox: <dict(25)> known_for, other_names, image, signature, bi...
label: Mahatma Gandhi
length: 264,127
links: <list(10)> https://biblio.wiki/wiki/Mohandas_K._Gandhi, h...
modified: <dict(2)> wikidata, page
pageid: 19379
parsetree: <str(333405)> <root><template><title>Redirect</title>...
properties: <dict(8)> P345, P910, P27, P21, P569, P18, P31, P570
random: Pukara (Moquegua)
title: Mahatma_Gandhi
url: https://en.wikipedia.org/wiki/Mahatma_Gandhi
url_raw: https://en.wikipedia.org/wiki/Mahatma_Gandhi?action=raw
watchers: 1,733
what: human
wikibase: Q1001
wikidata: <dict(8)> category, death, citizenship, gender, image,...
wikidata_url: https://www.wikidata.org/wiki/Q1001
wikitext: <str(262663)> {{Redirect|Gandhi}}{{pp-move-indef}}{{pp...
}
You can also call get_more()
to get further page data that results in a more expensive (slower) query, like page files, categories, languages, contributors, and average daily views:
>>> page.get_more()
en.wikipedia.org (querymore) Gandhi
Mahatma Gandhi (en) data
{
categories: <list(67)> Category:1869 births, Category:1948 death...
contributors: 2,608
files: <list(52)> File:Aum Om red.svg, File:Commons-logo.svg, Fi...
languages: <list(167)> {u'lang': u'af', u'title': u'Mahatma Gand...
modified: <dict(1)> page
pageid: 19379
title: Mahatma Gandhi
views: 21,603
}