Staged Generic Programming

As with all efforts to simplify programming so that systems of greater complexity can be more easily created, there are issues to consider.

6. Evaluation: Performance

The code generated by the improved staging library is evidently clearer, shorter, and simpler than the code generated by the naive staging. We now confirm that it also performs better.

Figure 24 compares the performance on five representative benchmarks of the unstaged SYB (Section 2), the naively-staged version (Section 3), the more carefully staged version (Section 4), and a hand-written version. The first three benchmarks, Map, SelectInt, and RMWeights, are the same as those used to evaluate the naive staging in Section 3.3. The two additional benchmarks, Size and Show, are introduced in this work; they are drawn from Section 5, and illustrate how the more sophisticated optimizations of Section 4 improve the performance of traversals that do not simply visit each node in a data structure.

All measurements, both for these benchmarks and those in Section 3.3, were made using the ».02.1¸modular‚implicits‚ber fork of MetaOCaml, which is the most recent available version with support for implicits. The benchmarks were run on an AMD FX 8320 machine with 16GB of memory running Debian Linux. With the exception of the Map benchmark, whose times have 95% confidence intervals within ±5%, all timings have 95% C.I. within ±2%. The measurements were taken using core-bench, a sophisticated micro-benchmarking library that accounts for garbage collection overheads and automatically computes the number of runs needed to eliminate the cost of the timing function from the measurements. Each measurement reported in Figures 9 and 24 is thus computed by core-bench from thousands of runs of the specified function.

The hand-written code for each benchmark is written in idiomatic functional style, prioritizing modularity over low-level performance tricks. For example, code for the Map benchmark first defines a function mapTree, which is then applied to the successor function (rather than, say, inlining succ within mapTree):

let rec mapTree f t = match t with

| Leaf Leaf

| Bin (v˛ l˛ r) Bin (f v˛ mapTree f l˛ mapTree f r)

Similarly, the hand-written code for the printing benchmark Show is written using a combinator per type constructor rather than fusing the code together as in Figure 20d. Since the output of a multi-stage program is OCaml code, it is always possible in principle, but rarely advisable in practice, to manually write identical code to the output of a staged library, and so achieve identical performance.

The performance of Map with the carefully staged library is almost 20× faster than the unstaged generic version, slightly improved over the naively staged version, and a little faster than the handwritten code, apparently because of the inlining of the successor function by the library.

The improvement in SelectInt is more dramatic: it is over 22× as fast as the unstaged generic version, over twice as fast as the naively-staged version, and there is no remaining overhead compared to handwritten code.


Fig. 24. Enhanced staged SYB Performance

The RMWeights benchmark shows improvements over the naive version; there is still a little residual overhead compared to the handwritten version.

The results of the two new benchmarks are more notable. For Size, the carefully staged version is almost 30× as fast as the unstaged version; more remarkably, it is over 4× as fast as handwritten code, due to the fusing together of generated functions and shifting of arithmetic work to compile-time (Figure 21d). 

The carefully staged version of Show is also faster than handwritten code, but there is an additional surprise: the unstaged generic version also beats the handwritten entry! Examining the handwritten code for printing lists uncovers the reason: the string concatenation operator is right associative in OCaml, and so the naive implementation copies the long strings on the right, generated by show_list, many times. 

let rec show_list f l = match l with

[] "[]"

| hȷȷt "("^ f h ^" ȷȷ " ^ show_list f t ^")"

Parenthesizing to avoid right nesting brings the running time down from 1979µs to 1022µs, almost as fast as the generated code that fuses together the printers for lists and bools (Figure 20d).