[wg-camlp4] Structured comments, shallow embeddings and deep quasiquotations

Sat Feb 2 16:17:14 GMT 2013

On 1 February 2013 15:38, Gabriel Scherer <gabriel.scherer at gmail.com> wrote:
> Finally, I think -ppx + arbitrary annotations, without any further
> restriction, is too free-form for robust syntax extensions: one
> important problem with Camlp4 is that it allowed syntax extension
> writers to modify the syntax in very bad way that hurt robustness. As
> already discussed, I have a strong distaste for extensions that only
> piggyback on existing syntax (without adding any explicit marker); I
> would feel safer if the extension *mechanism* disallowed such
> unstructured extensions, or at least made them less rewarding to write
> than the composable ones. (For example by only passing to the
> extension writer the part(s) of the AST that have been annotated).
> Unfortunately, I don't see how Bisect would fit any such restriction.
> Maybe that's a problem best solved by socialization (writing a
> documentation on good practices, and yelling on people), but I sort of
> doubt it -- I don't know how many time I've had to argue for *not*
> globally changing the associativity of infix operators through Camlp4
> in Batteries.

I agree with Gabriel.  Actually, I think that a small tweak to the
design of -ppx could address both this and a number of problems that
others have raised during the discussion here.

The -ppx approach applies one or more global transformations to the
ASTs of OCaml source files; these transformations can be parameterized
by attributes attached at particular points in the syntax tree.  This
is a significant improvement over the Camlp4 approach, largely because
it exchanges the (unnecessary) ability to change the concrete syntax
for a number of valuable guarantees, which make extensions easier to
write and code that uses extensions easier to understand.

We can go further in this direction, and give up more (unnecessary)
power in return for further guarantees.  As Gabriel says, since -ppx
extensions can arbitrarily transform the AST, it's not possible to
understand any part of a program that uses extensions without
understanding every aspect of the behaviour of every extension.  We
could, of course, seek to solve this by convention and social
pressure, but there seems to be an emerging consensus that this isn't
really satisfactory.  One of the nice things about functional
programming is that you have strong guarantees (via parametricity,
immutability, and so on) about the effects of calling a particular
function. We should strive to make it possible to reason in the same
manner about the effects of syntax extensions.

There are other legitimate concerns with the current proposal.  As
Xavier Clerc and others point out, attributes are apparently
undeclared (i.e. global) and untyped.  Alain rightly notes that it
seems to be difficult to introduce declarations and types for
attributes without significant complexity.  Still, as OCaml
programmers we're used to the benefits of precisely-scoped names and
strongly-typed data, and it seems a shame to give these benefits up if
we can find a way to keep them.

Hongbo Zhang raises a further concern: when syntax extensions are
global transformations on the whole file, the order in which
extensions are applied becomes significant.  This is a fairly serious
matter, I think: the semantics of code that uses syntax extensions is
now dependent on external factors, since we need to look for the flags
passed to OCaml in the build configuration in order to understand the
source.

I think we can address all these concerns with a small adjustment in
perspective.  Instead of globally-scoped, untyped attributes processed
by file-level externally-specified transformations, we might add a
single node to the OCaml grammar for statically-executed AST
rewriters.  Using the same syntax already proposed for attributes, we
might write, for example:

   (@deriving ["sexp"; "json"])
   type t = F of int | G of s
    and s = H of (t * t)

or

    (@perform)
       (x <-- m;
        y <-- n;
        return (x y))

In order for this to be valid code, 'deriving' and 'perform' should
resolve to functions of appropriate types:

    val deriving : string list -> Parsetree.structure_item ->
Parsetree.structure_item

    val perform : Parsetree.expression -> Parsetree.expression

Either during parsing or in a post parsing phase, the ASTs following
'@deriving ["sexp"; "json"]' and '@perform' are passed to those
functions and the results are inserted in place into the AST.
Gabriel's concern is addressed, because there's no way for @perform
(say) to access other parts of the AST: its effects are purely local.
Xavier's concern is addressed, since AST rewriters, unlike attributes
are declared and typed (and hence scoped). Hongbo's concern is
addressed, since composition is explicit:

    (@deriving ["sexp"])
    (@nonrec)
    type t = C of t

(Here '@deriving ["sexp"]' is applied to the result of applying '@nonrec'.)

It should be possible to write almost all extensions in this manner.
A variant of the stream parser syntax fits easily:

    (@parser)
       ([ `If; x = expr; `Then; y = expr; `Else; z = expr ] => "if";
        [ `Let; `Ident x; `Equal; x = expr; In; y = expr ] => "let")

as does Anil's cstruct extension:

    (@cstruct ~endianness:little)
    type pcap_header = {
       uint32_t magic_number;   (* magic number *)
       uint16_t version_major;  (* major version number *)
       ...
    }

Other extensions such as ifdef, js_of_ocaml, and pgsql could be
handled in the same sort of way.

Jeremy.

[I'm deliberately avoiding the interesting but orthogonal questions of
custom lexical syntax, and benign annotations for tools here.]