Skip to content
alexz-enwp edited this page Jan 2, 2015 · 1 revision

pagelist.py is a module with some convience functions to generate Page objects from API query results or other lists of pages. In most cases, the functions will handle existence checking and will return Page, Category, or File objects depending on the namespace of the pages in the list.

Static functions

###listFromQuery listFromQuery(site, queryresult)

Generates a list of Page, Category, and/or File objects from an API query result. site is a Wiki object and queryresult is the list or dict that contains the list of page information. This will be slightly different depending on the type of query.

  • For "list=categorymembers", use result['query']['categorymembers']
  • For "prop=linkshere", use result['query']['pages'][pageid]['linkshere']
  • For "generator=linkshere", use result['query']['pages']

###listFromTextList listFromTextList(site, sequence, datatype, check=True, followRedir=False)

Generates a list of Page, Category, and/or File objects from a list (or similar iterable object) of one of the following datatypes

  • "titles" - Titles, with namespace prefixes
  • "pageids" - Pageid values
  • "dbkeys" - (ns, title) pairs (an int and a string), such as what might be retrieved from a database query where the namespace number and title are generally stored separately. Note that individual entries can contain more than the ns and title, as long as those are the first 2 items.

If check is True, it will use the API for existence checking. followRedir has the same meaning as in Page. The existence checks are done in batches, so it is orders of magnitude faster than individual checks on each item. Using listFromTextList on a list of 9,000 titles took 27 seconds (5 seconds logged into an account with the apihighlimits right). Using a list comprehension to create and check page objects individually took over 20 minutes. (see Examples for code)

Note that if check is False, the objects returned will all be Page objects, regardless of namespace.

###listFromTitles listFromTitles(site, titles, check=True, followRedir=False)

A wrapper function around listFromTextList for datatype='titles'. Kept mostly for backwards compatibility.

###listFromPageids listFromPageids(site, pageids, check=True, followRedir=False)

A wrapper function around listFromTextList for datatype='pageids'. Kept mostly for backwards compatibility.

###listFromDbKeys listFromDbKeys(site, keys, check=True, followRedir=False)

A wrapper function around listFromTextList for datatype='dbkeys'. Kept mostly for completeness.

###makePage makePage(result, site, followRedir)

Used internally by the other functions in pagelist to make a Page, Category, or File object from an API query result, but is potentially useful enough to be used on its own. result is a dict with at minimum 'title' and 'ns' keys. site is the wiki object and followRedir will be passed to the Page constructor.

##Examples Code comparing pagelist.listFromTextList to a list comprehension calling the Page constructor

from wikitools import wiki, page, pagelist
import time

f = open('/usr/share/dict/words', 'r')
site = wiki.Wiki('https://en.wikipedia.org/w/api.php')
site.setMaxlag(60)
titles = [t.strip() for t in f]
titles = titles[0:90000:10]
print("Starting individual")
start = time.time()
res = [page.Page(site, title, check=True, followRedir=False) for title in titles]
total = time.time() - start
print(total)
del res
print("Starting pagelist")
start = time.time()
res = pagelist.listFromTextList(site, sequence=titles, datatype='titles')
total = time.time() - start
print(total)

Result:

Starting individual
1491.6872305870056
Starting pagelist
27.80055832862854
Clone this wiki locally