Tail calls and locals

Background & Goals

Tail call optimisation

Normally, calling a function pushes another frame onto the stack, so that the amount of stack space used grows with each nested function call. However, OCaml supports "tail call optimisation" (TCO), where functions pop their own stack frame before performing a "tail call", allowing e.g. the following recursive function to run in constant space:

let rec iter n =
  Printf.printf "%d\n" n;
  if n > 0 then iter (n - 1)
;;

TCO is not just applied to let rec definitions: TCO applies to any call that appears "in tail position" (that is, as the last thing a function does before returning). For instance, the following loop also runs in constant space:

let maybe_call n f = if n > 0 then f n

let rec iter2 n =
  Printf.printf "%d\n" n;
  maybe_call n (fun n -> iter2 (n - 1))
;;

To make this run in constant space, OCaml needs to apply TCO to three different calls:

in maybe_call, the call f n
in iter2, the call to maybe_call
in the closure inside iter2, the call to iter2 (n - 1)

Importantly, getting this loop to run in constant space requires TCO even on nonrecursive functions like maybe_call.

Locals and stack allocation

Locals allow many values to be allocated on the stack, which is cheaper than standard GC allocation. However, this interacts poorly with tail calls optimisation: if we want to guarantee that a tail-recursive loop runs in constant space, then we need to know that each iteration of the loop is not going to accumulate junk on the stack.

So, the locals typechecker forces all arguments to tail calls to be global, meaning that any allocations occur on the heap rather than on the stack.

This fact prevents a vast amount of automatic stack allocation, because TCO applies to any call that happens to be the last thing that a function does. For instance:

let print_all xs = List.iter xs ~f:(fun x -> Printf.printf "%s\n" x)

The call to List.iter is in fact in tail position, so TCO applies (even though this function is not part of any recursive loop). This forces each argument to be global, meaning that despite the local annotation on the signature of List.iter, the closure is unnecessarily allocated on the heap.

Distinguishing tail recursion from normal calls

So, the central problem is to spot the difference between functions like iter, which are tail-recursive and need TCO applied (with the strict locals checking that implies), and functions like print_all, which happen to end in a call but need not have TCO and would prefer more relaxed locals checking.

The problem is complicated by functions like maybe_call / iter2, which need TCO to run in constant space but include definitions like maybe_call that do not look like they require TCO.

In tricky cases, we'd of course prefer if the programmer told us what they want, which they can do with the [@tail] and [@nontail] annotations. However, these are newish and not very widespread, and the problem remains of making reasonable decisions in the absence of a clear annotation.

Detecting tail recursion in compiled code

It's possible to detect tail recursive loops by looking at the compiler IR towards the end of compilation, when most calls have been resolved to their precise targets. Tail recursive loops can be detected as cycles in the directed graph whose nodes are functions and whose edges are tail calls, and can be detected by e.g. a variant of Tarjan's SCC algorithm.

However, this is not a good way to determine whether a call should be compiled as a tail call or not, for a number of reasons:

It's hard to predict. Whether this analysis decides a call is part of a tail-recursive loop depends on exactly which calls the compiler manages to resolve to specific targets, which depends on optimisation level, build settings, phase of moon, etc.
It's non-local. Whether an edge forms part of a cycle depends on the rest of the graph, so a programmer cannot determine whether a call is a tail recursive call per this analysis by looking at the call: they must look at the whole program.
It is decided too late in the compiler. The local vs. global decision is made in the typechecker (since locals show up in e.g. module types). This analysis requires information known only much later on, such as specific call targets.

So, this is not a good way to decide whether a given call should be compiled as a tail call.

Deciding whether a call is tail

Instead, the idea is to make the compilation decision as follows:

At each call site in tail position which is not annotated [@tail] or [@nontail], decide whether it should be a tail call by a simple, local, syntactic check.
Later, run the SCC analysis above to detect possible tail-recursive loops.
Issue a warning if the SCC analysis finds a cycle containing a call which was not annotated [@tail] or[@nontail] and which the syntactic check chose not to apply TCO to.

The idea here is that at least the meaning and behaviour of a program never depends on unstable, non-local facts. Whether a warning is issued does depend on compilation settings, etc., but if it ever occurs the fix is always to annotate the marked call to specify whether it is intended to be [@tail] or [@nontail].

"simple, local, syntactic check"

The ultimate goal of this project is to reach a state where we have the tools needed to pick and adopt the syntactic check. The tradeoffs are:

It should be reasonably simple: we're going to have to explain it to people
It should be reasonably accurate:
- Incorrectly deciding that something needs to be a tail call unnecessarily prevents locals being used.
- Incorrectly deciding that something need not be a tail call can cause runtime failures (stack overflow) or high memory use.

We should be particularly concerned about patterns of recursion which neither check will pick up on. I'm not sure if these occur in practice, but stack overflow on code that used to take O(1) space is scary.

Considerations for the backend analysis

We want our initial backend analysis to be conservative, since we'll be using it to insert [@tail] attributes to prevent stack overflows in existing code once we start using the syntactic heuristic. To do so, we'll try to leverage information available to the backend about which syntactic functions have tail-position calls to which others.

Closer to the backend, we know with certainty for many function applications what syntactic function is being called (we'll call such calls --- where the target is known --- "direct"). If, during our SCC analysis, we find a cycle of direct calls in tail position, we should issue a warning if any of those calls has been inferred to be nontail.

But what about those calls ("indirect" calls) where the target is not statically known? Indirect calls pose a fundamental challenge to the backend analysis. Consider the following simple function:

let f g x =
  let y = Int64.sub x 1L in
  g y
;;

Here's the relevant cmm:

(function{/tmp/foo.ml:1,6-45} camlFoo__f_0_2_code (g/336: val x/337: val)
 (app{/tmp/foo.ml:3,2-5} (load_mut int g/336)
   (alloc{/tmp/foo.ml:2,10-24} 2303 G:"caml_int64_ops"
     (+ (load int (+a x/337 8)) -1))
   g/336 val))

And here's the part where we call g in flambda2:

(apply
 (((Some g/23UV)〈k11〉《k10》(y/28UV)) (args_arity 𝕍=boxed_ℕ𝟞𝟜) 
  (return_arity 𝕍) (call_kind (Function (function_call Indirect_unknown_arity) (alloc_mode Heap)))
  (dbg foo.ml:3,2--5) (inline Default_inlined) (inlining_state (depth 0, arguments O3)) 
  (probe ()) (position Normal)))

We can see the call to g is indirect. So we don't actually know where it will take us. We might be tempted to make the analysis permit the call to g to be inferred nontail so that we can locally allocate y. But consider the following client:

let rec h x =
  match x with
  | 0L -> 0L
  | x -> f h x
;;

cmm:

(function{/tmp/foo.ml:6,10-56} camlFoo__h_1_3_code (x/342: val)
 (if (!= (load int (+a x/342 8)) 0)
   (app{/tmp/foo.ml:9,9-14} G:"camlFoo__f_0_2_code" G:"camlFoo__h_3" x/342
     val)
   L:"camlFoo__int6423"))

flambda2:

(apply
 (((Some
 (coerce Scratch_no_ppx__Foo.camlScratch_no_ppx__Foo__h_3
  (depth my_depth/46N -> (succ my_depth/46N))))〈k18〉《k17》(y/70UV)) 
 (args_arity 𝕍=boxed_ℕ𝟞𝟜) (return_arity 𝕍)
 (call_kind (Function (function_call (Direct camlScratch_no_ppx__Foo__h_1_3_code)) (alloc_mode Heap)))
 (dbg foo.ml:9,9--14;foo.ml:3,2--5) (inline Default_inlined) (inlining_state (depth 10, arguments O3)) 
 (probe ()) (position Normal))))

We can see that h is meant to form a tail-recursive loop.

So we want the backend analysis to complain if it sees that our type-based heuristic decided either of these two tail calls is allowed to be nontail.

To be conservative, the easiest "safe" thing to do is assume that an indirect call can go anywhere. Under this modification of the SCC analysis, we consider the graph of functions and their tail calls (excluding those tail calls that were explicitly annotated [@nontail]) and check that every tail call is TCO'd if:

It is possible to reach any indirect tail call from that tail call
That tail call is part of a loop of direct tail calls

While this analysis is unlikely to let us accidentally break tail-recursive loops, it's possible that it's too conservative to be practical. The reason is that, though the indirect calls we've seen here so far have all been pretty suspicious, there are actually plenty of higher-order functions --- many of which happen to innocuously call their arguments in tail position.

Fortunately, in the example h above, we can see that flambda2 has inlined f, so now the call to h has become direct. So we're hopeful that if we run the analysis with sufficient optimization, a large fraction of our calls will become direct.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly