Update August 9: urikanegun has kindly contributed a Japanese translation of this article.
Developers love speed, so developers love benchmarks. Benchmarks on programming language performance, app server performance, JavaScript engine performance, etc. have always attracted a lot of attention. However, there are lots of caveats involved in running a good benchmark. One of those caveats is benchmark stability: if you run a benchmark multiple times then the timings usually differ a bit. A a lot of people have the tendency to hand-wave this caveat by just shutting down all apps, rerunning the benchmark a few times and averaging the results. Is that truly good enough?
Lately, I have been researching the topic of benchmark stability because I am interested in creating reliable benchmarks that are reproducible by third parties, so that they can verify benchmark results by themselves — e.g. allowing users of my software to verify that my benchmarks are reliable. Such research has led me to Victor Stinner, a Python core developer who has been focusing on improving Python 3 performance for several years.