[ocaml-platform] Benchmarking in OPAM

Thu Mar 14 10:48:57 GMT 2013

[+CC platform@]
On 10 Mar 2013, at 21:45, Gabriel Scherer <gabriel.scherer at gmail.com> wrote:

> More precisely, what I've been trying to do this week-end is to find
> the right structure to use the benchmarks that I received from users
> after
>  http://gallium.inria.fr/~scherer/gagallium/we-need-a-representative-benchmark-suite/
> 
> I worked on this a bit yesterday with Simon Castellan and Frédéric
> Bour, and for now we decided to reuse code from Edwin Török, that
> itself relies on Edgar Friendly Bench's library, that has some
> similarities with Criterion -- I'm not really familiar with the
> internals of either of those three libs, but they try to impress users
> with statistics, confidence internal etc.
> 
> The focus is to be able to get quick yet reliable feedback on the
> interest of some compiler optimization among representative OCaml
> programs. That is fairly different from the classic use cases of
> continuous integration (CI) of improving the libraries themselves,
> monitoring correctness (through test batteries) and portability among
> architectures.
> 
> To get lots of short-lived opam compiler switches corresponding to SVN
> or git development branches of the compiler (to be reinstalled each
> time you make a change to the branch, that is quite often), I've
> tested the use of "preinstalled" but got some feedback from Thomas on
> better ways to do it (
> https://github.com/OCamlPro/opam/pull/519#issuecomment-14682101 ),
> using secret OPAM features. To get quick feedback it is rather
> important to minimize a package's dependencies.
> 
> Finally, we realized that we really need two distinct kinds of
> benchmarking software:
> - one "benchmark library" that is solely meant to run performance
> tests and return the results (will be used by and linked with the
> benchmark programs, so recompiled at each compiler change, so should
> be rather light if possible)
> - one "benchmark manager" that compares results between different
> runs, plots nice graphics, stores results over time or send them to a
> CI server, format them in XML or what not. This one is run from the
> system compiler and can have arbitrarily large feature sets and
> dependencies.
> 
> I believe a similar split would be meaningful for unit testing as
> well. Of course, if you're considering daily automated large-scale
> package building and checking, instead of tight feedback loops, it is
> much less compelling to force a split, you can just bundle the two
> kind of features under the same package.

The split you describe is generally good discipline, as it encourages
library authors to encode more small benchmarks that can be called from
larger tools.

The benchmark manager is definitely something we want to have in the
OPAM hosted test system.  It's very difficult to get representative
benchmark results without a good mix of architectures and operating
systems, and we're going to pepper lots of odd setups into OPAM (and
eventually have the facility to crowdsource other people's machines
into the build pool, to make it easier to contribute resources).

So for the moment, focussing on the benchmark library would seem to
be the best thing to do: I've not really used any of them, and would
be interested in knowing what I ought to adopt for (e.g.) the Mirage
protocol libraries.  Once we have that in place, the OPAM test integration
should be much more straightforward.

-anil

PS: We've been building up interesting IPC test data for about a year
now: http://www.cl.cam.ac.uk/research/srg/netos/ipc-bench/results.html
and obtaining results from different architectures has been consistently
puzzling and illuminating at the same time.  See my FOSDEM talk on how
complex IO can be: http://anil.recoil.org/talks/fosdem-io-2012.pdf