Copies supersede OptimizationBarrier #20440

stephen-huan · 2024-12-11T19:33:18Z

Consider the JAX function

@partial(jit, donate_argnums=0)
def f(x: Array) -> tuple[Array, Array]:
    y = x[0, 0]
    x = x.at[0, 0].add(1)
    return x, y

Since XLA has control over scheduling, for efficiency it should schedule the slice first and then the in-place update, to avoid an unnecessary copy. However, on specifically the CPU backend it chooses to copy twice instead, generating

ENTRY %main.13 (Arg_0.1: f32[10000,10000]) -> (f32[10000,10000], f32[]) {
  %Arg_0.1 = f32[10000,10000]{1,0} parameter(0), metadata={op_name="x"}
  %copy.1 = f32[10000,10000]{1,0} copy(f32[10000,10000]{1,0} %Arg_0.1)
  %copy = f32[10000,10000]{1,0} copy(f32[10000,10000]{1,0} %copy.1)
  %add_dynamic-update-slice_fusion = f32[10000,10000]{1,0} fusion(f32[10000,10000]{1,0} %copy), kind=kLoop, calls=%fused_computation.1, metadata={op_name="jit(g)/jit(main)/scatter-add" source_file="..." source_line=30}
  %slice_bitcast_fusion = f32[] fusion(f32[10000,10000]{1,0} %copy.1), kind=kLoop, calls=%fused_computation, metadata={op_name="jit(g)/jit(main)/squeeze" source_file="..." source_line=29}
  ROOT %tuple.4 = (f32[10000,10000]{1,0}, f32[]) tuple(f32[10000,10000]{1,0} %add_dynamic-update-slice_fusion, f32[] %slice_bitcast_fusion)
}

(I'm not sure why it needs to make two copies here instead of just one, but the important part is that it copies at all.)

By the semantics of lax.optimization_barrier, I would expect that introducing an explicit dependency of x on y would force the slice to happen first, and then the liveliness analysis will kick in and remove the copies.

@partial(jit, donate_argnums=0)
def f(x: Array) -> tuple[Array, Array]:
    y = x[0, 0]
    x, y = lax.optimization_barrier((x, y))
    x = x.at[0, 0].add(1)
    return x, y

However, what ends up happening is XLA still introduces copies and re-orders the calls, so the generated code is the same as the one shown above. This seems to violate the scheduling control one expects from optimization_barrier.

Note that for this particular example, setting the XLA flag --xla_cpu_copy_insertion_use_region_analysis=true removes the copy and generates

ENTRY %main.13 (Arg_0.1: f32[10000,10000]) -> (f32[10000,10000], f32[]) {
  %Arg_0.1 = f32[10000,10000]{1,0} parameter(0), sharding={replicated}, metadata={op_name="x"}
  %slice_bitcast_fusion = f32[] fusion(f32[10000,10000]{1,0} %Arg_0.1), kind=kLoop, calls=%fused_computation, metadata={op_name="jit(g)/jit(main)/squeeze" source_file="..." source_line=28}
  %add_dynamic-update-slice_fusion = f32[10000,10000]{1,0} fusion(f32[10000,10000]{1,0} %Arg_0.1), kind=kLoop, calls=%fused_computation.1, control-predecessors={%slice_bitcast_fusion}, metadata={op_name="jit(g)/jit(main)/scatter-add" source_file="..." source_line=30}
  ROOT %tuple.4 = (f32[10000,10000]{1,0}, f32[]) tuple(f32[10000,10000]{1,0} %add_dynamic-update-slice_fusion, f32[] %slice_bitcast_fusion)
}

as expected, with or without optimization_barrier. Also, using a GPU device generates the copyless

ENTRY %main.13 (Arg_0.1.0: f32[10000,10000]) -> (f32[10000,10000], f32[]) {
  %Arg_0.1.0 = f32[10000,10000]{1,0} parameter(0), metadata={op_name="x"}
  %wrapped_slice = f32[1,1]{1,0} fusion(f32[10000,10000]{1,0} %Arg_0.1.0), kind=kLoop, calls=%wrapped_slice_computation
  %bitcast.43.0 = f32[] bitcast(f32[1,1]{1,0} %wrapped_slice)
  %loop_dynamic_update_slice_fusion = f32[10000,10000]{1,0} fusion(f32[10000,10000]{1,0} %Arg_0.1.0), kind=kLoop, calls=%fused_dynamic_update_slice, control-predecessors={%wrapped_slice}, metadata={op_name="jit(g)/jit(main)/scatter-add" source_file="..." source_line=30}
  ROOT %tuple.5 = (f32[10000,10000]{1,0}, f32[]) tuple(f32[10000,10000]{1,0} %loop_dynamic_update_slice_fusion, f32[] %bitcast.43.0)
}

also with or without optimization_barrier. Finall, the reverse explicit schedule

@partial(jit, donate_argnums=0)
def f(x: Array) -> tuple[Array, Array]:
    z = x.at[0, 0].add(1)
    z, x = lax.optimization_barrier((z, x))
    y = x[0, 0]
    return x, y

which should introduce a copy does not introduce a copy with --xla_cpu_copy_insertion_use_region_analysis=true.

I'm a bit confused why the flag workaround works now, since region analysis was introduced more than 3 years ago in 92292d1. The core logic of RemoveUnnecessaryCopies and TryElideCopy hasn't seemed to change much in that time either. Rather, what has recently changed is the flag xla_cpu_copy_insertion_use_region_analysis was added to CPU (disabled by default) (#18521) and region analysis was disabled on GPU (#14680). Is there some context I'm missing?

(originally reported in the discussion jax-ml/jax#19165 and JAX issue jax-ml/jax#25399.)

The text was updated successfully, but these errors were encountered:

stephen-huan mentioned this issue Dec 11, 2024

XLA-introduced copies supersede lax.optimization_barrier jax-ml/jax#25399

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Copies supersede OptimizationBarrier #20440

Copies supersede OptimizationBarrier #20440

stephen-huan commented Dec 11, 2024 •

edited

Loading

Copies supersede OptimizationBarrier #20440

Copies supersede OptimizationBarrier #20440

Comments

stephen-huan commented Dec 11, 2024 • edited Loading

stephen-huan commented Dec 11, 2024 •

edited

Loading