Skip to content
Steve edited this page Sep 22, 2017 · 38 revisions

Get an article extract

The get_query() method gets (light) HTML and (Markdown) text extracts.

>>> page = wptools.page('Ella Fitzgerald')
>>> page.get_query()
en.wikipedia.org (query) Ella Fitzgerald
en.wikipedia.org (imageinfo) File:Ella Fitzgerald (Gottlieb 02871...
Ella Fitzgerald (en) data
{
  extext: <str(2002)> **Ella Jane Fitzgerald** (April 25, 1917J...
  extract: <str(2067)> <p><b>Ella Jane Fitzgerald</b> (April 25, 1...
  ...
}

Compare to RESTBase extracts:

>>> page.get_restbase('summary')
en.wikipedia.org (restbase) /page/summary/Ella Fitzgerald
Ella Fitzgerald (en) data
{
  exhtml: <str(1455)> <p><b>Ella Jane Fitzgerald</b> (April 25, 19...
  exrest: <str(1424)> Ella Jane Fitzgerald (April 25, 1917June ...
  ...
}

Get a representative image

A representative image for a page can come from the Wikimedia:API, from an Infobox, from Wikidata Property:P18, or from the RESTBase. See the Images wiki page for details.

>>> page = wptools.page('Frida Kahlo')
>>> page.get_query()
en.wikipedia.org (query) Frida Kahlo
en.wikipedia.org (imageinfo) File:Frida Kahlo, by Guillermo Kahlo...
Frida Kahlo (en) data
{
  image: <list(2)> {'kind': 'query-pageimage', u'descriptionshortu...
  ...
}
>>> page.pageimage()
['query-pageimage', 'query-thumbnail']
>>> page.pageimage('page')['url']
u'https://upload.wikimedia.org/wikipedia/commons/0/06/Frida_Kahlo%2C_by_Guillermo_Kahlo.jpg'
>>> page.pageimage('thumb')['url']
u'https://upload.wikimedia.org/wikipedia/commons/thumb/0/06/Frida_Kahlo%2C_by_Guillermo_Kahlo.jpg/160px-Frida_Kahlo%2C_by_Guillermo_Kahlo.jpg'

!Frida Kahlo

Get page HTML

The most performant way to get article HTML is via RESTBase.

>>> page = wptools.page('Buddha')
>>> page.get_restbase('html')
en.wikipedia.org (restbase) /page/html/Buddha
Buddha (en) data
{
  html: <str(628054)> <!DOCTYPE html><html prefix="dc: http://purl...
}

Get Infobox data

Getting data from Infoboxes may be unavoidable, but getting Wikidata (via get_wikidata()) is preferred. Wikidata is structured but sometimes data poor, while Infoboxen are unstructured and frequently data rich. Please consider updating Wikidata if the information you want is only available in a MediaWiki instance so that others may benefit from open linked data.

>>> page = wptools.page('Fela Kuti')
>>> page.get_parse()
en.wikipedia.org (parse) Fela Kuti
en.wikipedia.org (imageinfo) File:Fela Kuti.jpg
Fela Kuti (en) data
{
  infobox: <dict(17)> website, associated_acts, death_place, image...
  ...
}
>>> page.data['infobox']['instrument']
'Saxophone, vocals, keyboards, trumpet, guitar, drums'

Get cover images

Most media (album, book, film, etc.) cover images on Wikipedia appear in an Infobox. For convenience, we put "cover" files from infoboxes in the image attribute.

>>> page = wptools.page('Blue Train (album)')
>>> page.get_parse()
en.wikipedia.org (parse) Blue Train (album)
en.wikipedia.org (imageinfo) File:John Coltrane - Blue Train.jpg
Blue Train (album) (en) data
{
  image: <list(1)> {'kind': 'parse-cover', u'descriptionshorturl':...
  infobox: <dict(16)> Name, Language, Artist, Cover, Recorded, Lab...
  ...
}
>>> page.pageimage()
['parse-cover']
>>> page.pageimage('cover')['url']
u'https://upload.wikimedia.org/wikipedia/en/6/68/John_Coltrane_-_Blue_Train.jpg'

!Blue Train

Get wikidata

Resolved properties and claims are stored in the wikidata attribute. Wikidata properties are selected by wptools.wikidata.LABELS. Properties (e.g. P17 "country") are stored in data['properties'], and those properties that have Wikidata items for values (e.g. Q142 "France") are stored in data['claims'] and resolved by another Wikidata API call. See the Wikidata page in our wiki for more details.

>>> page = wptools.page('Stephen Fry')
>>> page.get_wikidata()
www.wikidata.org (wikidata) Stephen Fry
www.wikidata.org (claims) Q8817795|Q5|Q7066|Q145
en.wikipedia.org (imageinfo) File:Stephen Fry cropped.jpg
Stephen Fry (en) data
{
  aliases: <list(1)> Stephen John Fry
  claims: <dict(4)> Q8817795, Q5, Q7066, Q145
  description: English comedian, actor, writer, presenter, and activist
  image: <list(1)> {'kind': 'wikidata-image', u'descriptionshortur...
  label: Stephen Fry
  modified: <dict(1)> wikidata
  pageid: 191035
  properties: <dict(8)> P135, P345, P910, P27, P856, P569, P18, P31
  title: Stephen_Fry
  what: human
  wikibase: Q192912
  wikidata: <dict(8)> website, category, citizenship, image, insta...
  wikidata_url: https://www.wikidata.org/wiki/Q192912
}

Extend Wikidata claims

If the predefined wptools.wikidata.LABELS do not include something you want resolved from a claim, you can simply add your property labels via update_labels():

>>> page = wptools.page('Simone de Beauvoir')
>>> page.update_labels({'P21': 'gender'})
>>> page.get_wikidata()
www.wikidata.org (wikidata) Simone de Beauvoir
www.wikidata.org (claims) Q142|Q5|Q3411417|Q859773|Q38066|Q151578...
en.wikipedia.org (imageinfo) File:Simone de Beauvoir.jpg
Simone de Beauvoir (en) data
{
  wikidata: <dict(10)> category, death, citizenship, gender, image...
  ...
}
>>> page.data['wikidata']['gender']
u'female'

Get all the page info

Simply calling get() on a page will automagically fetch extracts, images, infobox data, wikidata, and other metadata via the MediaWiki, Wikidata, and RESTBase APIs.

>>> page = wptools.page('Gandhi').get()
en.wikipedia.org (query) Gandhi
en.wikipedia.org (parse) 19379
www.wikidata.org (wikidata) Q1001
www.wikidata.org (claims) Q6581097|Q5|Q129286|Q6512732|Q668
en.wikipedia.org (restbase) /page/summary/Mahatma_Gandhi
en.wikipedia.org (imageinfo) File:Portrait Gandhi.jpg|File:MKGandhi.jpg
Mahatma Gandhi (en) data
{
  aliases: <list(10)> M K Gandhi, Mohandas Gandhi, Bapu, Gandhi, M...
  claims: <dict(5)> Q6581097, Q5, Q129286, Q6512732, Q668
  description: <str(67)> pre-eminent leader of Indian nationalism ...
  exhtml: <str(1064)> <p>Mahātmā <b>Mohandas Karamchand Gandhi</b>...
  exrest: <str(896)> Mahātmā Mohandas Karamchand Gandhi (; Hindust...
  extext: <str(2985)> Mahātmā **Mohandas Karamchand Gandhi** (; Hi...
  extract: <str(3212)> <p>Mahātmā <b>Mohandas Karamchand Gandhi</b...
  image: <list(6)> {'kind': 'query-pageimage', u'descriptionshortu...
  infobox: <dict(25)> known_for, other_names, image, signature, bi...
  label: Mahatma Gandhi
  length: 264,127
  links: <list(10)> https://biblio.wiki/wiki/Mohandas_K._Gandhi, h...
  modified: <dict(2)> wikidata, page
  pageid: 19379
  parsetree: <str(333405)> <root><template><title>Redirect</title>...
  properties: <dict(8)> P345, P910, P27, P21, P569, P18, P31, P570
  random: Pukara (Moquegua)
  title: Mahatma_Gandhi
  url: https://en.wikipedia.org/wiki/Mahatma_Gandhi
  url_raw: https://en.wikipedia.org/wiki/Mahatma_Gandhi?action=raw
  watchers: 1,733
  what: human
  wikibase: Q1001
  wikidata: <dict(8)> category, death, citizenship, gender, image,...
  wikidata_url: https://www.wikidata.org/wiki/Q1001
  wikitext: <str(262663)> {{Redirect|Gandhi}}{{pp-move-indef}}{{pp...
}

You can also call get_more() to get further page data that results in a more expensive (slower) query, like page files, categories, languages, contributors, and average daily views:

>>> page.get_more()
en.wikipedia.org (querymore) Gandhi
Mahatma Gandhi (en) data
{
  categories: <list(67)> Category:1869 births, Category:1948 death...
  contributors: 2,608
  files: <list(52)> File:Aum Om red.svg, File:Commons-logo.svg, Fi...
  languages: <list(167)> {u'lang': u'af', u'title': u'Mahatma Gand...
  modified: <dict(1)> page
  pageid: 19379
  title: Mahatma Gandhi
  views: 21,603
}
Clone this wiki locally