[opam-devel] How to know whether a package archive is already in the cache?

Sat Nov 14 02:17:58 GMT 2015

I finally went the "independent table that I manually (de)serialize
before and after download" route. The current state of the script is
available at

  https://github.com/gasche/opam/blob/ocamlbuild-migration-script/admin-scripts/add_ocamlbuild_dependency.ml

and I'm looking for feedback on the preliminary results at

  https://github.com/ocaml/opam-repository/pull/5140

On Thu, Nov 12, 2015 at 6:47 PM, Gabriel Scherer
<gabriel.scherer at gmail.com> wrote:
> Hi opam-devel,
>
> I'm currently hacking on a script to do a bulk update of OPAM
> metadata, adding "ocamlbuild" as an explicit dependency of all
> packages my killer heuristic decides certainly use ocamlbuild (right
> now: there is a _tags or myocamlbuild.ml somewhere, but I'm soon going
> to integrate the fact that an _oasis file explicitly lists ocamlbuild
> as the relied-upon build system).
>
> This is rather simple, with most of the time spent browsing through
> the rich opam-library API.
> - iterate over all packages in the repository (using the nice
> Opam_admin_top.iter_packages function)
> - for each package download the archive (I used
> OpamAction.download_package for this, although it requires an
> OpamState.t argument that I wasn't sure how to build¹)
> - extract each archive (OpamFilename.extract_generic_file, under some
> OpamFilename.with_tmp_dir call to get automatic cleanup)
> - walk the archive to test ocamlbuild usage
>
> Caching downloaded archive works very well, so re-running the script
> (during my test-refine feedback loops) does not re-download those as
> well. Unfortunately, for a handful of packages, download fails, and it
> only fails after a rather long timeout has expired, so just
> re-iterating on those failed packages make a process that should be
> instantaneous takes several minutes.
>
> So here is my question: how can I test whether a package archive is
> already in the cache? Because I know now that all packages that won't
> time out have been cached by previous runs of my script, I could
> iterate only on those. But I didn't find a clear way to do that (this
> seems to be available internally in some OpamHTTP backend, but I
> haven't seen this exported).
>
> A way to cache not only the successfully downloaded archives, but also
> the "did not work" last time decision would also fit the bill. In the
> worst case I could store that information in an independent table that
> I would (de)serialize across invocations of my script.
>
> (Opam seems to have fancy download functions designed to download a
> lot of stuff in parallel, but that seems incompatible with the
> sequential workflow imposed by `iter_packages`. I could first iterate
> to build a list of URLs, then download everything in parallel, then
> re-iterate but then again I need to only access the archives whose
> download actually succeeded.)
>
> While we're at it: is there a simple way to get a pretty string from a
> Package.t value? I use
>           Printf.sprintf "%s.%s"
>             (OpamPackage.name_to_string package)
>             (OpamPackage.version_to_string package)
> but would expect this to be available already.
>
> The complete code of the current prototype script (it is not editing
> any metada so far, just printing out the results that seem reasonable,
> except that the _oasis part of the heuristic needs to be implemented
> to get realistic results) is available at
>
>   https://github.com/gasche/opam/blob/2badfa0810e25ded1495b28b2ec8ff53f03a90cc/admin-scripts/add_ocamlbuild_dependency.ml
>
> Any comment or advice is warmly welcome. In particular there is a
> question in a comment about: what is the right way to build a
> OpamState.t value?