Staged Generic Programming

As with all efforts to simplify programming so that systems of greater complexity can be more easily created, there are issues to consider.

4. Staging SYB, Carefully

The naively staged SYB shows dramatic improvements over the unstaged library on standard benchmarks (Section 3.3). However, there are two reasons to pause before celebrating. First, the code generated by the staged version is still significantly slower than hand-written code – over twice as slow on the SelectInt benchmark. Second, none of the benchmarks represent situations where the hand-written code ignores large parts of the structure, and in such situations the naively staged code exhibits very poor performance. For example, when listify evenp is applied to a string list the resulting list is certain to be empty, and the optimal code is the constant function:

fun _ []

However, the naively staged SYB generates code that traverses the entire list, recursing via references and appending empty lists:

let () = f := (fun _  [])

let () = g := (fun l match l with | [] [] | h :: t !f h !g t)

There is evidently substantial room for improvement here. 

The effectiveness of staging is often improved by applying binding-time improvements – program transformations such as CPS conversion or eta expansion that allow more expressions to be classified as static. This section explores a succession of binding-time improvements to the staged SYB, resulting in a second round of drastic performance improvements and bringing the generated output much closer to hand-written code.


4.1 Recursion and Inlining 

The implementation in Section 3 inherits SYB’s agnosticism about recursion in type definitions, treating all types involved in a traversal as though they were defined in a mutually recursive group, and so generating a mutually-recursive group of bindings. This is a sound over-approximation, but has two drawbacks for performance. First, there is a certain overhead in each function call, particularly when every function call goes through a mutable cell. Second, the scheme is unfavourable to further optimizations, since it makes inlining difficult; each function definition consequently serves as a boundary between static and dynamic code.

1 let gfixQ f =
2  let m = new_map () in
3  let rec h {DȷDATA} x =
4   match lookup {D.Typeable} !m˛ peers {D} with
5   | Some g˛ _  R.dyn .< g x >.
6   | None˛ lazy [] 
7     f h {D} x
8   | None˛ lazy [_] 
9     let g = genrec1 ˘fun j  push m j;
10                             fun y  (f h y))
11     in .< g x >.
12  | None˛ _ 
13    let g = genrec ˘fun j 
14        push m j; .< fun y  .˜(f h .<y>.) >.)
15     in .< g x >.
16 in h
Fig. 11. A gfixQ for more precise recursion

  module type PS =
  sig
    type t and sta
    val dyn : sta code  t
    val cd : t  sta code
    val now : t  sta option
  end
Fig. 12. An interface to partially-static data

Furthermore, it is the over-approximation that makes the encoding with reference cells necessary. In practice, mutual recursion between large numbers of types is rare: the common case is for cycles in type definitions to involve only a single type. There is no difficulty in generating a single recursive binding using let rec; it is only when the number of bindings is unknown that difficulties arise. Figure 10 outlines a superior approach: starting from the types involved in a traversal (Figure 10(a)), the staged code should make use of the dependency graph between types to determine where the cycles occur (Figure 10(b)); then, rather than a single large group tied together with references (Figure 10(c)), the generated code should consist of a sequence of mutually-recursive groups (Figure 10(d)). This approach may be further refined by inlining all the functions generated from non-recursive type definitions (Figure 10(e)). The improved scheme avoids the performance cost arising from over-approximating recursion between types. More importantly, inlining enables many more opportunities for optimization. (This is no surprise: exposing optimization opportunities is generally the main benefit of inlining. For example, generating code for listify evenp at the type bool ˙ string involves generating functions at the types bool and string. For both bool and string the generated code ignores its argument, returning [], the default value passed to mkQ. The generated code for bool ˙ string calls the generated functions for bool and string, and appends the results:

fun (x˛ y) !f x !f y

This is clearly sub-optimal: both lists are statically known to be empty and so there is no need for the generated code to perform an append. Inlining alone is not sufficient to eliminate the dynamic append, but it is an essential prerequisite both for that and for many more powerful optimizations described in the following sections (Section 4.2-Section 4.6). 

Figure 11 gives an implementation of an improved gfixQ that takes proper account of the recursive structure of types. The key new detail is the use of the function peers that returns a list of those types that partake in a cycle with D.t. (It is perhaps worth noting that peers may itself be defined as a function using the SYB framework, since gmapQ may be used together with delimited control to retrieve a list of the types immediately referenced by a type with a DATA instance, which is all that Proc. ACM Program. Lang., Vol. 1, No. ICFP, Article 29. Publication date: September 2017. 29:14 Jeremy Yallop is needed to find the cycles.) There are then four cases to consider: either the function is already in the table (line 5); or there are no peers (line 6), in which case the function is non-recursive and is applied directly (i.e. inlined), not inserted in the table; or there is a single peer (line 8), in which case the function genrec1 (not shown) inserts a single let rec binding; or there are multiple peers (line 11), in which case gfixQ falls back to the old scheme. 

It would, of course, be easy to add a few more cases to handle other small fixed-size recursive groups.

With this improved gfixQ, the generated code for listify evenp at type (int ˙ string) list is as follows:

let rec evenslist x = match x with
| [] [] | h::t let (i˛ s) = h in
(if evenp i then [i] else []) [] evenslist t

The generated code for listify evenp at the types int and string has been inlined, and there is an obvious opportunity for optimization (i.e. eliminating the append of []), which the next section considers.


4.2 Partially-Static Data

Several of the code excerpts seen so far have involved dynamic values with some statically-known structure. For example, Figure 2b includes quotations involving list literals of known length such as .<[]>. and .<[x]>.. Quoted literals often suggest missed opportunities for static computation; for example, in the closing example of Section 4.1, the subexpression (if evenp i then [i] else []) [] can be simplified to eliminate the append.

As with the over-approximation of recursive structure, the root cause is that the binding-time analysis is too crude, marking an entire expression as dynamic whenever any sub-expression is dynamic. Every implementation of the gmapQ function must have the same return type ('u code); the result is that even fully-static expressions such as [] are marked as dynamic and quoted (Figure 2b, line 11).

One solution to this problem is partially-static data, well known in the partial evaluation literature as a binding-time, and often used in writing staged programs. As the name suggests, partially-static data are built partly from staticallyknown values, and partly from unknown dynamic values, and typically support some form of computation on the static portions.

Figure 12 defines a basic interface to partially-static data. The PS interface exposes two types: t, the partially-static type, and sta, the fully-static type. The functions dyn and cd convert back and forth between partially-static and fully-dynamic values; the function now attempts to extract a fully-static value from a partially-static one. Instances of PS typically expose additional constructors for injecting static data into t, along with operations for computing with partially-static values.

There are many useful instances of PS. The binding-times of listify may be improved by defining a type of partially-static lists, whose elements are dynamic, and whose spines may be partly static and partly dynamic (Figure 13). The result of appending two such partially-static lists, l <¸>r, results in static reduction in the case where l has a static suffix and r a static prefix (Figure 14). Furthermore, the static value [] acts as a left and right identity for append, which is needed to simplify the code from the end of Section 4.1.

Partially-static data and algebraic structure. The discussion above suggests a connection between partially-static data and algebraic structure that often proves useful. Several other generic schemes


Fig. 13. Partially-static lists: adjacent static parts merge

type 'a ps_list =

Empty

| SCons of 'a code list ˙ 'a ps_list

| DCons of 'a list code ˙ 'a ps_list

let rec (<¸>) l r = match l with

| Empty → r

| SCons (s˛ tl) → (match tl <¸> r with

| SCons (u˛ tl) → SCons (s ␣ u˛ tl)

| Empty | DCons _ as r → SCons (s˛ r))

| DCons (s˛ tl) → (match tl <¸> r with

| DCons (u˛tl) → DCons (.<.˜s ␣ .˜u>.˛tl)

| Empty | SCons _ as r → DCons (s˛ r))

Fig. 14. Partially-static lists with append

benefit from partially-static data designed to take advantage of the algebraic laws of the underlying structure: as with partially-static lists, the string monoid proves useful when defining the generic pretty-printer gshow; commutative monoids appear when defining gsize; tropical semirings (formed from addition and max) arise in the definition of gdepth (Section 5). Incorporating partially-static data into the staged SYB implementation involves a few changes. First, the original flexibility in the return type of genericQ is restored; a generic query no longer unconditionally returns dynamic values:

type 'u genericQ = {DȷDATA} → D.t code → 'u

Second, gfixQ acquires a PS constraint, requiring that the return type of a query satisfies the partially-static interface:

val gfixQ ȷ {PȷPS} (P.t genericQ P.t genericQ) P.t genericQ

In practice the constraint is no hardship, since partially-static data subsume the old fully-dynamic data. The body of gfixQ must be updated with dyn and cd accordingly. Finally, schemes such as listify must be written in terms of partially-static data; in practice this means replacing list operations such as concat with corresponding operations for ps_list (Figure 14). With these changes the superfluous appends disappear from generated code. Here is the updated output of listify evenp at the type (int ˙ string) list:

let rec evenslist x = match x with | [] [] | h::t let (i˛ s) = h in

(if evenp i then [i] else []) ␣ evenslist t

1  module type rec DATA =
2  sig
3    type t and t_
4   module Typeable : TYPEABLE with
 type t = t
5    val gmapQ : 'u genericQ → t_ → 'u list
6    val reify : t code  ˘t_ 
 'a code¯ → 'a code
7  end
8
9  type ('a˛ 'r) list_ =
10      Nil
11    | Cons of 'a code ⋆ '
r
            12 implicit module rec Data_list 
{A: DATA} =
13 struct
14 type t = A.t list and t_ =
 (A.t˛ t code) list_
15 let gmapQ q l = match l with
16                 | Nil  []

                        17                 | Cons (h
˛ t)  [q h; q t
]
18
19 let reify c k = .< match 
c with
20                  | []  .˜(
k Nil)
21                  | h :: t  .˜(
k (Cons (.<h>.˛.<t
>

    )))>.
22 end

Fig. 15. Extended DATA with reify


4.3 Reification (and Reflection)

The staged gmapQ (Figure 2b, lines 17-19) combines two operations: it builds a code value that examines a datum, and it applies a generic query q to the sub-values of that datum. These operations need not always occur together; in fact, it is useful to decompose gmapQ into its constituent parts.

Figure 15 extends the DATA interface with a function reify that serves the first function, and a type t_ that describes the input to the second argument of reify. More precisely, t_ represents the one-step unrolling of t, with a constructor corresponding to each constructor of t, but with dynamic rather than static arguments. For example, the type A.t list (Figure 15, line 10) has a constructor ȷȷ (pronounced łconsž) with arguments of type A.t and A.t list. Accordingly, the type (A.t˛ t code) list_ (also Figure 15, line 10) has a constructor Cons with arguments of type A.t code and A.t list code.

With the addition of reify, the body of gmapQ is identical to the unstaged version, except for the names of the constructors. (However, the types differ.)

The łcontinuationž argument to reify offers considerable flexibility. For example, reify c (gmapQ q) behaves like the staged gmapQ from Figure 2b, while reify c id, where id is the identity function, is a function that can be used as the dyn injection to treat a DATA instance as partially static. The reflect function (not shown) is an inverse to reify, satisfying reify c reflect ≡ c up to observational equivalence. While reify is useful for analysing values described by DATA, reflect is useful for constructing values. A staged version of the generic traversal gmapT, which rebuilds a term after transforming its sub-values, may be conveniently expressed by pre-composing reify and postcomposing reflect.

The sections that follow give further uses for reify, which provides a basis for code motion (Section 4.4) and for statically exploring and eliminating dynamic branches (Section 4.5, Section 4.6).

κ[dyn                                   dyn
.< match x with                           .< match x with

    | []  .˜(cd e1)
             | []  
.˜(
cd κ[e1])
| h::t  .˜(cd e2)>.]
 | h::t  .˜(cd 
κ[
e2])>.

Fig. 16. Lifting match across continuation frames

val case_locus: {P:PS}  (unit  P.t)  P.t
val reify: {P:PS}  {D:DATA}  D.t code  (D.t_  P.t)  P.t

Fig. 17. case insertion, with partially-static data

reify, eta, and The Trick. The one-step dynamic unrolling that provides access to the top-level static structure of a dynamic value corresponds directly to a well-known binding-time improvement known in the partial evaluation community as The Trick. From another perspective, the reify function for a DATA instance simply performs the appropriate eta expansion for the type t, enabling a more favourable binding-time analysis.


4.4 Case Lifting

The reify function (Section 4.3) exposes the top-level structure of data to allow static computation with dynamic values. However, the inserted match expressions form a boundary between the static and dynamic parts of the program. For example, consider the code generated for listify evenp at the type int ˙ int:

< let (x˛ y) = p in if evenp x then .˜(cd [x]) else .˜(cd []) @ if evenp y then .˜(cd [y]) else .˜(cd [])>.

There are four possible outcomes for the two evenp tests, and for each outcome the length of the result list is statically known. However, there is no static context that encloses enough data to build any of the lists statically. Transforming the program to lift one of the if expressions to the highest-possible point, just below where x and y are bound, exposes more opportunities for optimization:

< let (x˛ y) = p in if evenp x then .˜(.< if evenp y then .˜(cd ([x] [y])) else .˜(cd [x])>.) .˜(.< if evenp y then .˜(cd [y]) else .˜(cd [])>.) >.

This is not an anomalous example; the case lifting transformation increases the scope of static variables, which is generally a binding-time improvement. Figure 16 shows the interaction with contexts in the general case: the context κ will not further reduce the fully-dynamic match expression, but lifting the match across the context plugs in the possibly-static values e1 and e2, possibly exposing new redexes.

Interface and implementation. Figure 17 gives an interface for case lifting which resembles the let-insertion interface of Figure 4a. The case_locus function, analogous to let_locus, marks a place where bindings may be inserted. The reify function is a generic version of the reify member of DATA (Figure 15); it interacts with the reify member through the continuation argument, hoisting the binding generated by reify to a case_locus point and interposing the context κ to shuffle the

let f x = match x with
  Left x  []                     let f x = []
 | Right y  []
       a. before pruning            b. after pruning
Fig. 18. Pruning listify evenp at (bool˛string) either
let rec r l = match l with
   []  []                               let r l = []
 | h::t  r t
 a. before recursive pruning      b. after recursive pruning

Fig. 19. Pruning listify evenp at bool list

continuation frames as depicted in Figure 16. Since κ is duplicated in each branch the implementation makes use of multi-shot delimited control.

Case lifting elsewhere. While let-insertion is a common feature of staged programs, full case lifting appears much rarer in practice, although less powerful forms of if insertion are occasionally seen. One possible reason is that duplicating continuations is often unwise; while in our setting it is commonly the case that static computation will substantially reduce the resulting code, that may not be true elsewhere. Second, while let insertion is polymorphic in the type of the expression, case lifting is data-specific. The DATA constraint provides a convenient basis for lifting arbitrary match expressions; without DATA, case lifting would have to be defined separately for every type.

Case lifting for binary sums is, however, found in the broader literature relating to normalization by evaluation and type-directed partial evaluation.


4.5 Branch Pruning Here is a definition of binary sums in OCaml:

type ('a˛ 'b) either = Left of 'a | Right of 'b

The generic function listify evenp, applied to a value v of type (a˛ b) either, must first determine whether v is Left x or Right y and then return a list of even integers found in x or y. If both x and y are types such as bool or string that cannot contain integers, then the list will be empty in both cases (Figure 18a).

In such cases, where it is statically apparent that all branches of a match are equal and independent of the variables bound by the match, the match may be eliminated altogether, and replaced with the value of the branch (Figure 18b) (provided, of course, that the scrutinee is free of effects!).

This is an instance of eta contraction, which applies to all algebraic data types, not just to simple binary sums.

inary sums. This branch pruning optimization enjoys two favourable interactions with inlining (Section 4.1) and partially-static data (Section 4.2). First, reducing partially static data makes it more likely that the branches of the match will produce evidently-equal values. Second, branch pruning eliminates the dynamic matches introduced by reification (Section 4.3), making their static results available for further computation.

The eta rule therefore appears twice in converting a fully dynamic value to a fully static value: first, reification statically exposes the top-level structure of a dynamic value. Here is an example, with a dynamic value . . of type (string˛ bool) either

listify evenp . .

The effect of reification is to eta-expand .<v>.:

dyn .< match v with Left x .˜(listify evenp .<x>.) | Right y .˜(listify evenp .<y>.) >.

Next, computation with partially-static data simplifies the branches:

dyn .< match v with Left x .˜(sta []) | Right y .˜(sta []) >.

Finally, the branches are determined to be equal, and the result is eta reduced again, leaving only a static value.

sta []

Implementation. A natural implementation of pruning is a wrapper around reify that passes into reify a continuation that accumulates each value k v. When all the values have been accumulated they are compared for equality; if they can be determined to be static and equal then the whole expression built by reify is replaced with the static value.


4.6 Recursive Branch Elimination

The code in Figure 19(a), generated by the instantiation of listify evenp at the type bool list, is an example of an unnecessary branch that is not eliminated by the branch pruning optimization of Section 4.5.

The simplicity of the code is a consequence of the optimizations introduced so far: inlining brought the instantiation of listify for bool into the body of r; reification expanded the value to examine the true and false cases; branch pruning eliminated the match, having determined that both branches were static empty lists, and partially-static data eliminated the resulting append of an empty list.

However, the recursive data type (list), which led to the generation of the recursive function, prevented further optimizations from taking place. The difficulty arises because the recursive call r t is not static, and so the approach in Section 4.5 is not sufficiently powerful to detect that the code can be simplified. Nevertheless, it is both evident that the function always returns the empty list, and important that the code should be simplified, since the list it traverses may be arbitrarily long.
How might Section 4.5 be generalized to the recursive case? The solution is to determine a static fixed point. Starting from the assumption that r always returns the empty list, every recursive call to r is replaced with the static empty list value. If the branches are then all statically equal to the empty list then it is legitimate to eliminate the whole recursion (Figure 19(b)). This approach naturally generalises both to mutual recursion, where the starting assumption is that all the functions in a recursive binding group return zero, and to monoids other than lists. (Strictly speaking, it is also necessary to check that the recursion is well-founded, or we will make the embarrassing optimization of replacing a non-terminating loop with a constant. However, it is likely that all recursion in the SYB setting is well-founded.)

  let rec gshow {D:DATA} v =
    show_constructor (constructor v) (gmapQ gshow v)
Fig. 20a. gshow, unstaged
let gshow = gfixQ (fun f {D:DATA} v 
                     (show_constructor (constructor v) (gmapQ f v)))
Fig. 20b. gshow, staged
let gshow_list_bool =
  let sl = ref dummy in let sb = ref dummy in
  let () = sb := fun b  apply_constructor (string_of_bool b) [] in
  let () = sl := fun l  apply_constructor
                       (match l with []  "[]" | _::_  "::")
                       (match l with []  []
                                   | h::t  [!sb h; !sl t])
  in fun x  !sl x
Fig. 20c. gshow, naively staged: generated code (instantiated at bool list, slightly simplified)
let rec r l = match l with
  | []  "[]"
  | hȷȷt  if h then "(true :: "^ r t ^")"
           else "(false :: "^ r t ^")"
Fig. 20d. gshow, carefully staged: generated code (instantiated at bool list)

Implementation. The implementation is broadly similar to the implementation of branch pruning (Section 4.5): the gfixQMon combinator, an extension of gfixQ with an additional MONOID constraint (ensuring that the result type has a zero element) interacts with reify to extract the results of each branch in an environment where all functions in the local recursive group return zero. If the results of every branch are zero for every function in the group then (under the well-foundedness assumption mentioned above) the functions in the group may be replaced with constant zero functions and inlined in any bindings subsequently generated.

Here is a simple example: when listify evenp is instantiated with argument type int ˙ (bool list), the fixed-point calculation determines that traversing the second component of the pair always returns an empty list and generates the following efficient code:

let evenspair p = let (i˛ l) = p in if evenp i then [i] else []