[opam-devel] Travis is broken

Sun Oct 26 11:55:21 GMT 2014

On 24 Oct 2014, at 11:59, Peter Zotov <whitequark at whitequark.org> wrote:
> 
> On 2014-10-24 14:20, Anil Madhavapeddy wrote:
>> One reason I haven't spent too much time on buildbot and bors is that they
>> all need some level of customisation to the specific deployment.
> 
> I'm actually almost done. (Bored, insomnia, etc.) The Buildbot configuration
> is really simple in this case, it just runs a single Docker command, which
> pulls from the repo and then runs a script derived from .travis-ci.sh:
> 
> https://github.com/whitequark/opam-repository/blob/master/.docker-ci.sh
> https://gist.github.com/whitequark/516973336a55971e2507
> 
> A bigger problem is OS X workers, which don't have anything like Docker
> for build isolation. But I think they have a sandboxing mechanism.

Thanks for setting this up!  It's good to see that it's a relatively simple
configuration setup.  I'm tempted to have something like this run on a staging
version of opam-repository, since we could eliminate the 50 minute limit for
build jobs (and hence rigorously test Core).

>> The OCamlot work that David Sheets did last year is ripe for a refresh with
>> all the new infrastructure that's been built in the last year.  For example:
>> - opamLib is now much easier to use as a library than it was in opam 1.0
>> - the ocaml-git bindings work, so all the shelling out to the cmdline disappear
>> - David has almost finished GitHub webhooks integration to ease that
>> callback process
>> - Irmin or Arakoon could be used as the k/v store for the logs now
>> Al in all, I'd be inclined to put time into putting together a self-hosted
>> one using this infrastructure.  The only real missing major piece is the web
>> UI.  I wonder if there is some js_of_ocaml-friendly UI layer that we could drop
>> in for log viewing purposes...
> 
> This sounds like it could take months.

More on the order of weeks if you discount the web UI (which could be CLI
driven).  The reason it's worth doing is the customisation that you pointed
out is hard to do on external platforms.  Some lessons learnt from the previous
deployment of OCamlot last year:

- having a single-OCaml-binary deployment makes multi-OS workers really
  straightforward compared to using (e.g.) Jenkins.  Getting the JVM working
  on a Raspberry Pi was no fun.

- OPAM-specific logic is required to stop overwhelming slower (ARM, PowerPC,
  MIPS) workers with unnecessary jobs.  We had a 'stage 1' gateway that would
  run only on x86_64 to quickly test for errors, and then spawn off tasks
  on increasingly obscure architectures, as well as on non-Linux operating
  systems.

- There are a number of custom regexps in the ocamlot repo that do autotriage
  on the build logs for common OCaml-specific errors, such as ocamlfind packages
  not being found, or warnings-as-errors.  I do miss these in Travis land...

- Supporting multiple operating systems requires treating workers as heavyweight
  VMs, with Docker and similar OS-specific mechanisms being a useful optimisation
  to build times.  We can run *some* workers on Rackspace Cloud where they have
  been ported, but others (such as OpenBSD) need to run on hosted infrastructure
  somewhere (such as the Cambridge Computer Lab, which is fine by me). 

  Specific operating systems:
  - Windows, we could use Azure, which is also what Appveyor uses
  - FreeBSD is supported on Rackspace Cloud
  - OpenBSD requires custom hosting, but has some stability issues under Xen 
    that are on my debugging list (page table crashes on x86_32).
  - MacOS X could use Vagrant with the VMWare Fusion provider.  Sandboxing is
    more of an app model there, and not suitable for whole-system snapshots.
  - Most common Linux variants can be handled via Docker.

It's interesting how there doesn't seem to be any out-of-the-box open source
solution for continuous integration on multiple operating systems and weird
architectures (where the JVM wont work too well).

-anil