Google Summer Of Code '17 (PSF)

Goals

To identify set of regressed benchmarks in the performance suite.
To find reasons for this regression
To fix the benchmark suite itself whereever possible
To come-up with new benchmarks for modules that do not have benchmarks yet.

NOTE: For parallel reports on advances see here

This repository contains source files of the project but it is suggested to use the blog for navigation.

Tools Used

cpython
performance
perf
Kcachegrind

Results

NOTE: For system-specs see here

Regressed benchmarks:

Regressed benchmarks:
Benchmark 	            py2 	    py3 	       Times-slow
python_startup_no_site 	9.42 ms 	26.0 ms 	2.76x slower (+176%)
python_startup 	        19.2 ms 	42.3 ms 	2.21x slower (+121%)
spectral_norm 	        194 ms 	        259 ms 	      2.20x slower (+120%)
sqlite_synth 	        6.70 us 	8.49 us 	1.27x slower (+27%)
crypto_pyaes 	        158 ms 	        199 ms 	      1.26x slower (+26%)
xml_etree_parse 	193 ms 	        242 ms 	          1.25x slower (+25%)
xml_etree_iterparse 	154 ms 	        179 ms 	      1.16x slower (+16%)
go 	                439 ms 	        493 ms 	          1.12x slower (+12%)

Startup-time

Startup-time is the most regressed benchmark.

+-------------------------+----------+-------------------------------+
| python_startup          | 19.2 ms  | 42.3 ms: 2.21x slower (+121%) |
+-------------------------+----------+-------------------------------+
| python_startup_no_site  | 9.42 ms  | 26.0 ms: 2.76x slower (+176%) |
+-------------------------+----------+-------------------------------+

Import time of specific modules:

* encodings + encodings.utf_8 + encodings.latin_1 took 2.5ms
* io + abc + _wakrefset took 1.2ms
* _collections_abc took 2.1ms
* sysconfig + _sysconfigdata took 0.9ms

py2 uses "read method of file object" which is done away with in py3. It imports io module(the prime reason for it being slow) then TextIOWrapper uses encoding passed by constructor. So following were the suggested solutions:

i) improving time for 'abc module' areas like these because getattr(value, "__isabstractmethod__", False) is called for all class attributes of ABCs (including subclass of ABCs). It's slow because:
- When the value is not abstractmethod, AttributeError is raised and cleared internally.
- getattr uses method cache (via PyType_Lookup), but __isabstractmethod__ is mostly in instance dict. So checking method cache is mostly useless efforts.
ii) avoiding import of whole sysconfig and only of the variables required.This is fixed in this "bpo" thread.

iii) avoiding the import of uncommon modules.

iv) But if the module excluded from startup is very common ,The module will be imported while Python "application" startup anyways,So faster import time is better than avoiding importing for such common modules,Like "functools, pathlib, os, enum, collections, re."
Idea of parallelizing marshalling :

If we could somehow paralleize marshalling and thus "loading"(not "executing") than it could speedup import and henceforth "startup-time".But it won't improve things drastically as loading is a small fraction of execution time Eg:-for complex module like "typings" -> "29x" greater but for smaller ones like "ABC"-> "4x" greater.

NOTE: For comparison relating to C portion of code see here
Next I tried writing a C version of WeakSet (here) but it wasn't approved by Raymond Hettinger as it would have been difficult to maintain.
On Nick Coughlan's Suggestion I tried if we could:
- Push "commonly-imported" modules to a separate zip archive
- Seed sys.modules with contents of that archive
- freeze the import of those modules

I wrote a python-script to create a zip-archive from common modules and ran the different versions of python inside docker containers.See this blog entry for more details. But it was realised that this might not reap huge benefits because in writing a custom-importer we are already using import of some common modules and also python by itself adds a .zip of library in sys.path.

Lazy- Loading I used a custom lazy-loader/importer for import of modules during “startup”, so as to prevent import of module which are not necessary and explore the possibility of possible decrease in startup time.Here's blog entry explaining the implementation and the code for custom lazy-loader/importer.But lazy-loading didn't decreased the startup-time and rather increased it slightly(Mostly because the lazy-loader already does a import of common modules.).
Also there was a suggestion to write cython modules for common modules.But it wasn't pursued due to lack of time.

Optimizing "logging" benchmark py3 showed regression on logging benchmark,

|  logging_format          | 57.7 us  | 75.1 us: 1.30x slower (+30%)  |
+-------------------------+----------+-------------------------------+
| logging_silent           | 818 ns   | 1.00 us: 1.22x slower (+22%)  |
+-------------------------+----------+-------------------------------+
| logging_simple           | 46.2 us  | 70.0 us: 1.51x slower (+51%)  |

This was fixed by this PR. See this blog entry for details.

Other benchmarks pickle, sqlite_synth, crypto_pyaes :

Pickle/Unpickle was realised of lower practical importance.here
For sqlite_synth see here
For crypto_pyaes see here

There was some work on FAT-python but it wasn't pursued as it didn't passed the test-suite and generated incorrect byte codes. PRs merged #13 and #12.

Newer benchmarks

The present performance-suite lacks benchmarks for many of the common library modules.

zlib

This benchmark tries to measure basic Compression and Decompression using zlib.This showed significant regression as length of binary string increased. The code and the details to reproduce are in this blog entry.
math

This benchmark measures basic math operations. And here here py2 comes out faster. Here's blog entry for code and statistics.
smtplib:

The smtplib module defines an SMTP client session object that can be used to send mail to any Internet machine with an SMTP or ESMTP listener daemon.This benchmark measure that performance. Here's the entry for code.Again py3 regresses.
Concurrency benchmark and concurency primitives:

There are primarily two concurency primitives offered by python that are threading and multiprocessing.This benchmark tries to measure a same number-crunching task when done concurrently by "threading" and "multiprocessing" separately. Here's the benchmark that measures concurency and also the cost of creating threading-objects(It is not of much use as such).

Future-Work:

Implementation of cython modules.
Improving the portion of ABC's code as pointed above
Accumlating use-cases for newer benchmarks.

Acknowledgements:

@Botanic(Matthew Lagoe)
Victor Stinner
Inada Naoki
James Lopeman
Ezio Melloti
And everyone in python community :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Google Summer Of Code '17 (PSF)

Goals

NOTE: For parallel reports on advances see here

This repository contains source files of the project but it is suggested to use the blog for navigation.

Tools Used

Results

Regressed benchmarks:

Newer benchmarks

Future-Work:

Acknowledgements:

Files

README.md

Latest commit

History

README.md

File metadata and controls

Google Summer Of Code '17 (PSF)

Goals

NOTE: For parallel reports on advances see here

This repository contains source files of the project but it is suggested to use the blog for navigation.

Tools Used

Results

Regressed benchmarks:

Newer benchmarks

Future-Work:

Acknowledgements: