-
Notifications
You must be signed in to change notification settings - Fork 51
pagelist
pagelist.py is a module with some convience functions to generate Page objects from API query results or other lists of pages. In most cases, the functions will handle existence checking and will return Page, Category, or File objects depending on the namespace of the pages in the list.
###listFromQuery
listFromQuery(site, queryresult)
Generates a list of Page, Category, and/or File objects from an API query result. site
is a Wiki object and queryresult
is the list or dict that contains the list of page information. This will be slightly different depending on the type of query.
- For "list=categorymembers", use
result['query']['categorymembers']
- For "prop=linkshere", use
result['query']['pages'][pageid]['linkshere']
- For "generator=linkshere", use
result['query']['pages']
###listFromTextList
listFromTextList(site, sequence, datatype, check=True, followRedir=False)
Generates a list of Page, Category, and/or File objects from a list (or similar iterable object) of one of the following datatype
s
- "titles" - Titles, with namespace prefixes
- "pageids" - Pageid values
- "dbkeys" - (ns, title) pairs (an int and a string), such as what might be retrieved from a database query where the namespace number and title are generally stored separately. Note that individual entries can contain more than the ns and title, as long as those are the first 2 items.
If check
is True, it will use the API for existence checking. followRedir
has the same meaning as in Page. The existence checks are done in batches, so it is orders of magnitude faster than individual checks on each item. Using listFromTextList on a list of 9,000 titles took 27 seconds (5 seconds logged into an account with the apihighlimits right). Using a list comprehension to create and check page objects individually took over 20 minutes. (see Examples for code)
Note that if check
is False, the objects returned will all be Page objects, regardless of namespace.
###listFromTitles
listFromTitles(site, titles, check=True, followRedir=False)
A wrapper function around listFromTextList for datatype='titles'
. Kept mostly for backwards compatibility.
###listFromPageids
listFromPageids(site, pageids, check=True, followRedir=False)
A wrapper function around listFromTextList for datatype='pageids'
. Kept mostly for backwards compatibility.
###listFromDbKeys
listFromDbKeys(site, keys, check=True, followRedir=False)
A wrapper function around listFromTextList for datatype='dbkeys'
. Kept mostly for completeness.
###makePage
makePage(result, site, followRedir)
Used internally by the other functions in pagelist to make a Page, Category, or File object from an API query result, but is potentially useful enough to be used on its own. result
is a dict with at minimum 'title' and 'ns' keys. site
is the wiki object and followRedir
will be passed to the Page constructor.
##Examples Code comparing pagelist.listFromTextList to a list comprehension calling the Page constructor
from wikitools import wiki, page, pagelist
import time
f = open('/usr/share/dict/words', 'r')
site = wiki.Wiki('https://en.wikipedia.org/w/api.php')
site.setMaxlag(60)
titles = [t.strip() for t in f]
titles = titles[0:90000:10]
print("Starting individual")
start = time.time()
res = [page.Page(site, title, check=True, followRedir=False) for title in titles]
total = time.time() - start
print(total)
del res
print("Starting pagelist")
start = time.time()
res = pagelist.listFromTextList(site, sequence=titles, datatype='titles')
total = time.time() - start
print(total)
Result:
Starting individual
1491.6872305870056
Starting pagelist
27.80055832862854