[ocaml-platform] Benchmarking in OPAM

Török Edwin edwin at etorok.net
Sun Mar 17 11:46:48 GMT 2013


On 03/14/2013 12:48 PM, Anil Madhavapeddy wrote:
> 
> So for the moment, focussing on the benchmark library would seem to
> be the best thing to do: I've not really used any of them, and would
> be interested in knowing what I ought to adopt for (e.g.) the Mirage
> protocol libraries.  Once we have that in place, the OPAM test integration
> should be much more straightforward.

I don't know if Gabriel replied to you about this, but here's my take on it:

I tried both bench and benchmark (as found in opam), but I found bench to give more accurate results.
Both calculate a number of iterations for each benchmark to get significant measurements wrt to accuracy, but:

Benchmark:
 + runs Gc by default between tests
 + gives you user,sys and wall times (probably more useful for benchmarks that perform I/O)
 - if you want to take a high number of samples, your total benchmark time will be quite high
 - you have to calculate mean/stdev on your own

Bench:
 + uses gettimeofday, seems to be more accurate than utimes() (probably better for CPU-bound benchmarks)
 - no user,sys, only wall time
 * Caveat: doesn't run Gc by default between tests, but can be turned on with Bench.config.gc_between_tests
 + takes 1000 samples / benchmark by default, number of iterations calculated so each sample is at least
(1000 * clock_resolution), i.e. ~1ms on Linux
 + computes mean and stdev using the bootstrap method, and calculates 95% confidence intervals for them
 + the individual measurements are more noisy than with benchmark, but because you have a lot more of them
it can calculate a more accurate mean and stdev

So what we need is a thin layer on top of bench, which can be the stripped down version of my edobench that Gabriel mentioned last time,
i.e. something like:
    run : config -> (name * (unit -> 'a)) list -> result
    log: log_file -> result -> unit

In my lib I also calculated the median (and its confidence interval), because not all the benchmark measurements are symmetric around the mean (
i.e. they're not from a normal distribution, and calculating mean-1.96*stdev can give negative results),
but for simplicity's sake I think mean+stdev might be enough for a statistic.

I've also done some "sanity" checks in my lib, but those should probably be part of the benchmark manager, i.e.:
 - check that CPU frequency scaling is off (i.e. the governor is performance, if any)
 - check that CPU core performance boosting is not enabled
Both of these can make the benchmark results hard to compare, as even comparing the same binary on the same machine you get wildly different timings.

Best regards,
--Edwin


More information about the Platform mailing list