layout |
---|
default |
A script to count votes for the Wikisource anniversary contest.
usage: score.py [-h] [--booklist-cache BOOKLIST_CACHE] [--cache CACHE_FILE]
[--config CONFIG_FILE] [-d] [--enable-cache] [-f BOOKS_FILE]
[-o OUTPUT_TSV] [-v]
Count proofread and validated pages for the Wikisource contest.
optional arguments:
-h, --help show this help message and exit
--booklist-cache BOOKLIST_CACHE JSON file to read and store the booklist cache
(default: {BOOKS_FILE}.booklist_cache.json)
--cache CACHE_FILE JSON file to read and store the cache
(default: {BOOKS_FILE}.cache.json)
--config CONFIG_FILE INI file to read configs (default: contest.conf.ini)
-d Enable debug output (implies -v)
--enable-cache Enable caching
-f BOOKS_FILE TSV file with the books to be processed
(default: books.tsv)
-o OUTPUT_TSV Output file (default: {BOOKS_FILE}.results.tsv)
-v Enable verbose output
The script expect to read two files:
books.tsv
contest.conf.ini
books.tsv
a list of books. The number of pages is requested to the API and the response is cached.
Here's a sample:
# List of books participating in the Wikisource anniversary contest of 2015.
#
# FORMAT
# book_name
#
# Empty lines or lines starting with "#" are ignored.
"Bandello - Novelle, Laterza 1910, I.djvu"
contest.conf.ini
is INI-like configuration file.
It contains the language of the Wikisource, the start and end dates for the contest.
# Confiuration file for the wscontest-votecounter script
# Wikisource anniversary contest of 2015.
[contest]
language = it
# Dates are in the format yyyy-mm-dd
start_date = 2015-11-24
end_date = 2015-12-08
The scripts queries the Wikisource API and counts the number of pages that have been proofread by a user.
Results for every single book are cached in a JSON file called books_cache.json
.
Keep in mind that caching slows down the script (because the cache is continuosly
read and written), for this reason caching is can be optionally enabled with the
option --enable-caching
.
To empty the cache delete the cache file.
You can also remove individual books, you can use the jq utility to pretty-print the file and then modify it with any text editor or you can use an on-line tool such as JSONlint.
Here's a example of cached results for a book:
{
"CACHE_BOOKS_LIST": {
"Fineo - Il rimedio infallibile.djvu": 90,
"Racconti sardi.djvu": 168,
"Slataper - Il mio carso, 1912.djvu": 124
}
"Slataper - Il mio carso, 1912.djvu": {
"72": {
"query": {
"normalized": [
{
"from": "Page:Slataper - Il mio carso, 1912.djvu/72",
"to": "Pagina:Slataper - Il mio carso, 1912.djvu/72"
}
],
"pages": {
"412498": {
"title": "Pagina:Slataper - Il mio carso, 1912.djvu/72",
"ns": 108,
"revisions": [
{
"contentformat": "text/x-wiki",
"contentmodel": "proofread-page",
"timestamp": "2015-12-04T13:32:34Z",
"user": "Robybulga",
"*": "...>"
},
{
"contentformat": "text/x-wiki",
"contentmodel": "proofread-page",
"timestamp": "2015-11-25T07:50:05Z",
"user": "Stefano mariucci",
"*": "..."
},
{
"contentformat": "text/x-wiki",
"contentmodel": "proofread-page",
"timestamp": "2015-09-02T18:33:12Z",
"user": "Phe-bot",
"*": "..."
}
],
"pageid": 412498
}
}
},
"batchcomplete": ""
}
}
Results are written in TSV format in results.tsv
. Activating the --html
flag you can
also produce an HTML version of the output index.html
.
The HTML output uses a template index.template.html
that expects to find a {{{rows}}}
token to indicate where results will be written.
If you need to produce a Wikitable from the TSV output, you can use one of this tools:
- CSV to Wikitable
- Excel 2 Wiki (if you open the TSV as a spreadsheet)
This script uses Python 3, it has been tested with Python 3.4.3. It requires only libraries that are part of the standard Python 3 library.
You can install the additional Python module yajl
(GitHub repo) for faster reading/writing of JSON.
You can install it using pip
with the following command:
pip install -r requirements.txt
This has been tested to work also in a virtualenv.
You can use the script count_votes.sh
to process books in parallel,
The script assumes you have GNU parallel installed
on your system.
Furthermore, the script assums the existence of the list of books in a file named books.tsv
.
First, we split up the list of books in several files (say books01_sublist.tsv
,
books02_sublist.tsv
, books03_sublist.tsv
, books04_sublist.tsv
) to process them in
parallel.
The splitting of the original list of books is obtained with the following commands:
$ cat books.tsv | grep -v -e "#" | sort | sed '/^$/d' > united
$ split -l 11 --numeric-suffixes=01 --additional-suffix="_sublist.tsv" united books
The first line creates a list of books removing empty lines and lines starting with #
and saves it to a file named united
. The second line split the content of united
in smaller files named books01_sublist.tsv
, books02_sublist.tsv
, etc. with up to
11 lines per file (-l 11
option).
If you split the files by hand, a way to check if the original list (books.tsv
) and
the new lists contain the same books you can do the following:
$ cat books.tsv | grep -v -e "#" | sort | sed '/^$/d' > united
$ cat books*_sublist.tsv | grep -v -e "#" | sort | sed '/^$/d' > separated
$ diff united separated
The first line creates a list of books removing empty lines and lines starting with #
and saves it to a file named united
. The second line does the same for books01_sublist.tsv.tsv
,
...
, books04_sublist.tsv
and saves the results in a file named separated
.
The third line compares the two results. If you split the books in the new lists correctly,
you should see no difference.
You can launch the script on the different input file with the following command
(analogously for books02_sublist.tsv
, books03_sublist.tsv
, books04_sublist.tsv
):
python score.py -f books01_sublist.tsv
For best performance you should split the list in a balanced way with respect to the number of pages to process.
Using GNU parallel we can launch several processes in parallel.
Following our example, to process books01_sublist.tsv
, ...
, books04_sublist.tsv
in parallel:
$ seq -w 01 04 | parallel -t --files --results output_dir $(which python3) score.py -v -f books{}_sublist.tsv -o results{}_sublist.tsv
The results will be saved in files results01_sublist.tsv
, ...
, results04_sublist.tsv
.
You can check the progress of each process with the command:
$ tail -n 3 output_dir/1/*/stderr
or to have a dynamic picture of the situation:
$ watch -n 1 tail -n 3 output_dir/1/*/stderr
The results are merged using the merge.py
script.
To merge the results you can use the merge.py
script.
usage: merge.py [-h] [--booklist [BOOKLIST_FILE [BOOKLIST_FILE ...]]]
[--booklist-output BOOKLIST_OUTPUT]
[--cache [CACHE_FILE [CACHE_FILE ...]]]
[--cache-output CACHE_OUTPUT] [--config CONFIG_FILE] [-d]
[-o OUTPUT_TSV] [--html] [--html-output OUTPUT_HTML]
[--html-template TEMPLATE_FILE] [-v]
FILE1 ...
Merge results from score.py.
positional arguments:
FILE1 Result file no. 1
... Additional result files
optional arguments:
-h, --help show this help message and exit
--booklist [BOOKLIST_FILE [BOOKLIST_FILE ...]] Merge booklist cache files
--booklist-output BOOKLIST_OUTPUT JSON file to store the merged cache (requires --booklist)
(default: booklist_cache_tot.tsv)
--cache [CACHE_FILE [CACHE_FILE ...]] Merge cache files
--cache-output CACHE_OUTPUT JSON file to store the merged cache (requires --cache)
(default: books_cache_tot.tsv)
--config CONFIG_FILE INI file to read configs (default: contest.conf.ini)
-d Enable debug output (implies -v)
-o OUTPUT_TSV Output file (default: results_tot.tsv)
--html Produce HTML output
--html-output OUTPUT_HTML Output file for the HTML output
(default: {OUTPUT_TSV}.index.html)
--html-template TEMPLATE_FILE Template file for the HTML output
(default: index.template.html)
-v Enable verbose output
Assuming that the results files are named results01_sublist.tsv
, results02_sublist.tsv
,
results03_sublist.tsv
and results04_sublist.tsv
as from the previous section,
you can merge them with the following command:
$ python merge.py results*_sublist.tsv
the results are written to results_tot.tsv