Page MenuHomePhabricator

[TRE] Reland: allow TRE for non-capturing calls.
Needs ReviewPublic

Authored by avl on Aug 9 2020, 9:57 AM.

Details

Summary

[TRE] Reland: allow TRE for non-capturing calls.

The D82085 "allow TRE for non-capturing calls" caused failure during bootstrap.
This patch does the same as D82085 plus fixes bootstrap error.

The problem with D82085 is that it does not create copies for byval
operands, while replacing function call with a branch.

Consider following example:

int zoo ( S p1 );

int foo ( int count, S p1 ) {
  if ( count > 10 )
    return zoo(p1);

  // temporarily variable created for passing byvalue parameter
  // p1 could be used when zoo(p1) is called(after TRE is done).
  // lifetime.start p1.byvalue.temp
  return foo(count+1, p1);
  // lifetime.end p1.byvalue.temp
}

After recursive call to foo is replaced with a jump into
start of the function, its parameters could be passed to
zoo function. i.e. temporarily variable created for byvalue
parameter "p1" could be passed to zoo. Finally zoo receives
broken operand:

int foo ( int count, S p1 ) {
:tailrecurse
  p1_tr = phi p1, p1.byvalue.temp
  if ( count > 10 )
    return zoo(p1_tr);

  // temporarily variable created for passing byvalue parameter
  // p1 could be used when zoo(p1) is called(after TRE is done).
  lifetime.start p1.byvalue.temp
  memcpy (p1.byvalue.temp, p1_tr)
  count = count + 1
  lifetime.end p1.byvalue.temp
  br tailrecurse
}

To prevent using p1.byvalue.temp after its scope finished by
lifetime.end marker this patch copies value from p1.byvalue.temp
into another temporarily variable and uses this variable on
next iteration.

This patch passes bootstrap build and bootstrap build with AddressSanitizer.

Diff Detail

Event Timeline

avl created this revision.Aug 9 2020, 9:57 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 9 2020, 9:57 AM
avl requested review of this revision.Aug 9 2020, 9:57 AM
avl added a comment.Aug 31 2020, 7:21 AM

@efriedma would you mind to take a look at this patch once more, please?

Among all the attributes that can be attached to an argument of a tail call, byval is unusual. It significantly changes the semantics of the call: when the call executes, instead of passing the pointer itself, it makes a copy of the data behind the pointer.

Consider, for example, the following testcase on x86-64:

typedef struct A { long long x[100]; } A;
A global;
int printf(const char*, ...);
void dostuff(A arg, int i) { if (i==100) return; arg.x[5]++; printf("%lld\n", arg.x[i]); dostuff(global, i+1); }
__attribute((optnone)) int main() { dostuff(global, 0); }

dostuff has a byval argument. At the point of the call, the compiled code is supposed to copy that argument onto the stack, so in the callee the stack contains the values of all the array elements. If this is working correctly, the fifth line of the output will be "1": each call copies the value from the global, so we only ever increment the array element once. (With optimizations enabled, clang currently gets this wrong.)

If we're doing tail-call elimination, in general, we need to make this implicit copy explicit. One way to implement this is to generate a pair of copies for each byval argument. First, create a temporary alloca corresponding to each byval argument. Then, copy each byval argument to the call to its temporary alloca. Finally, copy each byval argument to the call from the temporaries to the memory originally allocated for the function's arguments.

The copies have to happen in the order I described in case a byval argument to the call refers to a different byval argument in the caller. For example, if you have something along the lines of void dostuff(A arg, Arg b) { dostuff(b, a); }

You might be able to take shortcuts in some cases. For example, if every byval argument to the call is just forwarding on the corresponding byval argument to the caller, you can just reuse the memory without making any copies. Or if you don't want to implement the general case, it would be fine to make TRE just bail if there's a byval argument.

Hopefully that explains why messing with the lifetime markers isn't the correct solution.

avl added a comment.Sep 21 2020, 5:24 AM

Among all the attributes that can be attached to an argument of a tail call, byval is unusual. It significantly changes the semantics of the call: when the call executes, instead of passing the pointer itself, it makes a copy of the data behind the pointer.

Consider, for example, the following testcase on x86-64:

typedef struct A { long long x[100]; } A;
A global;
int printf(const char*, ...);
void dostuff(A arg, int i) { if (i==100) return; arg.x[5]++; printf("%lld\n", arg.x[i]); dostuff(global, i+1); }
__attribute((optnone)) int main() { dostuff(global, 0); }

dostuff has a byval argument. At the point of the call, the compiled code is supposed to copy that argument onto the stack, so in the callee the stack contains the values of all the array elements. If this is working correctly, the fifth line of the output will be "1": each call copies the value from the global, so we only ever increment the array element once. (With optimizations enabled, clang currently gets this wrong.)

It looks like current patch works correctly with this test case.

If we're doing tail-call elimination, in general, we need to make this implicit copy explicit. One way to implement this is to generate a pair of copies for each byval argument. First, create a temporary alloca corresponding to each byval argument. Then, copy each byval argument to the call to its temporary alloca. Finally, copy each byval argument to the call from the temporaries to the memory originally allocated for the function's arguments.

The copies have to happen in the order I described in case a byval argument to the call refers to a different byval argument in the caller. For example, if you have something along the lines of void dostuff(A arg, Arg b) { dostuff(b, a); }

yep. this patch works incorrectly for "dostuff(A arg, Arg b) { dostuff(b, a); }" case. Thank you for pointing that!
It requires additional temp to store byval argument. will update the patch.

avl updated this revision to Diff 296732.Wed, Oct 7, 10:26 AM

addressed comment. implicit copy replaced with explicit one.

avl edited the summary of this revision. (Show Details)Wed, Oct 7, 10:29 AM
avl edited the summary of this revision. (Show Details)
efriedma added inline comments.Wed, Oct 7, 10:40 AM
llvm/lib/Transforms/Scalar/TailRecursionElimination.cpp
602

This is the right direction, but I'm not sure it does the right thing in general.

In particular, it's possible that CI->getArgOperand(OpndIdx) points to one of the allocas created by createTempForByValOperand(); if it does, then this copy might be clobbering data you need for subsequent copies. Again, consider the void dostuff(A arg, Arg b) { dostuff(b, a); } case.

This is why the general procedure I outlined has an extra step: we copy the temporary allocas to the original byval arguments, and use the byval arguments as the operands to the PHI. That way, CI->getArgOperand(OpndIdx) never points to one of the allocas allocated by createTempForByValOperand.

avl added inline comments.Wed, Oct 7, 11:57 AM
llvm/lib/Transforms/Scalar/TailRecursionElimination.cpp
602

This is the right direction, but I'm not sure it does the right thing in general.

It looks like "void dostuff(A arg, Arg b) { dostuff(b, a); }" test case is working correctly:

$ cat test.cpp
#include <stdio.h>
typedef struct A { long long x[10] = {0}; } A;
A global;
void dostuff(A a, A b, int i) { if (i==10) return; a.x[5]++; printf("%lld %lld\n", a.x[5], b.x[5]); dostuff(b, a, i+1); }
__attribute((optnone)) int main() { dostuff(global, global, 0); }
$ clang++ test.cpp -O3
$ ./a.out
1 0
1 1
2 1
2 2
3 2
3 3
4 3
4 4
5 4
5 5

In particular, it's possible that CI->getArgOperand(OpndIdx) points to one of the allocas created by createTempForByValOperand(); if it does, then this copy might be clobbering data you need for subsequent copies. Again, consider the void dostuff(A arg, Arg b) { dostuff(b, a); } case.

This is why the general procedure I outlined has an extra step: we copy the temporary allocas to the original byval arguments, and use the byval arguments as the operands to the PHI. That way, CI->getArgOperand(OpndIdx) never points to one of the allocas allocated by createTempForByValOperand.

If I am not mistaken the CI->getArgOperand(OpndIdx) could not point to one of the allocas created by createTempForByValOperand(). There is not setArgOperand() which set operand to newly created temps by createTempForByValOperand(). Thus CI->getArgOperand(OpndIdx) always point to the alloca created for recursive call to dostuff().

For the ""void dostuff(A arg, Arg b) { dostuff(b, a); }" test case:

before tre:

%2 = bitcast %struct.A* %agg.tmp to i8*
%3 = bitcast %struct.A* %b to i8*
call void @llvm.memcpy.p0i8.p0i8.i64(i8* nonnull align 8 dereferenceable(80) %2, i8* nonnull align 8 dereferenceable(80) %3, i64 80, i1 false), !tbaa.struct !6
^^^^^^^^^^^^^^^^ copy to %agg.tmp which is first byval argument
%4 = bitcast %struct.A* %agg.tmp5 to i8*
%5 = bitcast %struct.A* %a to i8*
call void @llvm.memcpy.p0i8.p0i8.i64(i8* nonnull align 8 dereferenceable(80) %4, i8* nonnull align 8 dereferenceable(80) %5, i64 80, i1 false), !tbaa.struct !6
^^^^^^^^^^^^^^^^ copy to %agg.tmp5 which is second byval argument
%add = add nsw i32 %i, 1
call void @_Z7dostuff1AS_i(%struct.A* nonnull byval(%struct.A) align 8 %agg.tmp, %struct.A* nonnull byval(%struct.A) align 8 %agg.tmp5, i32 %add)

after tre:

tailrecurse:                                      ; preds = %if.end, %entry
  %a.tr = phi %struct.A* [ %a, %entry ], [ %agg.tmp7, %if.end ]
  %b.tr = phi %struct.A* [ %b, %entry ], [ %agg.tmp58, %if.end ]

if.end:
...
  %2 = bitcast %struct.A* %agg.tmp to i8*
  %3 = bitcast %struct.A* %b.tr to i8*
  call void @llvm.memcpy.p0i8.p0i8.i64(i8* nonnull align 8 dereferenceable(80) %2, i8* nonnull align 8 dereferenceable(80) %3, i64 80, i1 false), !tbaa.struct !6
  ^^^^^^^^^^^^^^^^ copy to %agg.tmp which is first byval argument
  %4 = bitcast %struct.A* %agg.tmp5 to i8*
  %5 = bitcast %struct.A* %a.tr to i8*
  call void @llvm.memcpy.p0i8.p0i8.i64(i8* nonnull align 8 dereferenceable(80) %4, i8* nonnull align 8 dereferenceable(80) %5, i64 80, i1 false), !tbaa.struct !6
  ^^^^^^^^^^^^^^^^ copy to %agg.tmp5 which is second byval argument
  %add = add nsw i32 %i.tr, 1
  %6 = bitcast %struct.A* %agg.tmp7 to i8*
  %7 = bitcast %struct.A* %agg.tmp to i8*
  call void @llvm.memcpy.p0i8.p0i8.i64(i8* align 8 %6, i8* align 8 %7, i64 80, i1 false)
  ^^^^^^^^^^^^^^^^ copy to the agg.tmp7 which is used as input for next iteration
  %8 = bitcast %struct.A* %agg.tmp58 to i8*
  %9 = bitcast %struct.A* %agg.tmp5 to i8*
  call void @llvm.memcpy.p0i8.p0i8.i64(i8* align 8 %8, i8* align 8 %9, i64 80, i1 false)
  ^^^^^^^^^^^^^^^^ copy to the agg.tmp58 which is used as input for next iteration
  br label %tailrecurse

Probably the only difference with your description is that this patch creates additional temporarily variables(agg.tmp7/agg.tmp58) while it could use locations of a/b parameters instead of newly created temps.

efriedma added inline comments.Wed, Oct 7, 12:26 PM
llvm/lib/Transforms/Scalar/TailRecursionElimination.cpp
602

I guess if you're looking at code emitted directly by clang, you'll end up with "extra" aggregates due to the way clang does call lowering. That said, the temporary aggregates aren't actually required; optimizations like memcpyopt can remove them. For example, on master, "clang -O2 -emit-llvm" produces a call like tail call void @_Z7dostuff1AS_i(%struct.A* nonnull byval(%struct.A) align 4 %b, %struct.A* nonnull byval(%struct.A) align 4 %a, i32 %add).

avl added inline comments.Wed, Oct 7, 2:10 PM
llvm/lib/Transforms/Scalar/TailRecursionElimination.cpp
602

right. But for the case "void dostuff(A arg, Arg b) { dostuff(b, a); }" these temps and memcpy are not useless and then they would not be removed by memcpyopt.

Anyway, minimizing number of temps is also useful thing. Will update a patch to use function arguments locations instead of created temps.

efriedma added inline comments.Wed, Oct 7, 2:22 PM
llvm/lib/Transforms/Scalar/TailRecursionElimination.cpp
602

Take a look at the output of clang -O2 -emit-llvm for the following: typedef struct { int x[100]; } A; void foo(A, A); void bar(A a, A b) { return foo(b, a); }. No memcpy.

avl added inline comments.Wed, Oct 7, 2:41 PM
llvm/lib/Transforms/Scalar/TailRecursionElimination.cpp
602

For above test case that is probably correct(since foo is a last call). But for that test case :

typedef struct { int x[100]; } A; void foo(A, A); void bar(A a, A b) { foo(b, a); foo(b,a); }

define void @_Z3bar1AS_(%struct.A* byval nocapture readonly align 8, %struct.A* byval nocapture readonly align 8) local_unnamed_addr #0 {

call void @_Z3foo1AS_(%struct.A* byval nonnull align 8 %1, %struct.A* byval nonnull align 8 %0)
call void @_Z3foo1AS_(%struct.A* byval nonnull align 8 %1, %struct.A* byval nonnull align 8 %0)
ret void

}

What if first foo writes at byvalue location ? In this case second call to foo() would receive incorrect arguments.
Is it correct behavior ?

efriedma added inline comments.Wed, Oct 7, 2:49 PM
llvm/lib/Transforms/Scalar/TailRecursionElimination.cpp
602

byval implicitly makes a copy of the underlying memory. If foo modifies the memory, it's modifying its own copy, not the version in bar.

avl added inline comments.Wed, Oct 7, 3:12 PM
llvm/lib/Transforms/Scalar/TailRecursionElimination.cpp
602

looks like I lost a bit. That : "byval implicitly makes a copy of the underlying memory" is understood. So for that test case "typedef struct { int x[100]; } A; void foo(A, A); void bar(A a, A b) { return foo(b, a); }" -mllvm -print-after-all shows :

%agg.tmp = alloca %struct.A, align 8
%agg.tmp1 = alloca %struct.A, align 8
%0 = bitcast %struct.A* %agg.tmp to i8*
%1 = bitcast %struct.A* %b to i8*
call void @llvm.memcpy.p0i8.p0i8.i64(i8* align 4 %0, i8* align 8 %1, i64 400, i1 false), !tbaa.struct !2
%2 = bitcast %struct.A* %agg.tmp1 to i8*
%3 = bitcast %struct.A* %a to i8*
call void @llvm.memcpy.p0i8.p0i8.i64(i8* align 4 %2, i8* align 8 %3, i64 400, i1 false), !tbaa.struct !2
call void @_Z3foo1AS_(%struct.A* byval(%struct.A) align 8 %agg.tmp, %struct.A* byval(%struct.A) align 8 %agg.tmp

Why it is important that output of "clang -O2 -emit-llvm" for the same test case does not contain calls to memcpy ?

efriedma added inline comments.Wed, Oct 7, 3:18 PM
llvm/lib/Transforms/Scalar/TailRecursionElimination.cpp
602

I'm not completely sure what you're asking.

In general, we expect that the TRE pass should be able to handle arbitrary LLVM IR, even if clang with the standard pass pipeline can't generate IR with that pattern. So TRE needs to be able to handle the form where the byval argument is passed directly to a tail call.

avl updated this revision to Diff 297049.Thu, Oct 8, 1:59 PM

use incoming function arguments area for storing byvalue operand of the recursive call.

avl added inline comments.Thu, Oct 8, 2:07 PM
llvm/lib/Transforms/Scalar/TailRecursionElimination.cpp
602

In general, we expect that the TRE pass should be able to handle arbitrary LLVM IR, even if clang with the standard pass pipeline can't generate IR with that pattern. So TRE needs to be able to handle the form where the byval argument is passed directly to a tail call.

The IR for the testcase "typedef struct { int x[100]; } A; void foo(A, A); void bar(A a, A b) { return foo(b, a); }" before TRE loos like this currently :

%0 = bitcast %struct.A* %agg.tmp to i8*
%1 = bitcast %struct.A* %b to i8*
call void @llvm.memcpy.p0i8.p0i8.i64(i8* nonnull align 8 dereferenceable(400) %0, i8* nonnull align 8 dereferenceable(400) %1, i64 400, i1 false), !tbaa.struct !2
%2 = bitcast %struct.A* %agg.tmp1 to i8*
%3 = bitcast %struct.A* %a to i8*
call void @llvm.memcpy.p0i8.p0i8.i64(i8* nonnull align 8 dereferenceable(400) %2, i8* nonnull align 8 dereferenceable(400) %3, i64 400, i1 false), !tbaa.struct !2
tail call void @_Z3foo1AS_(%struct.A* nonnull byval(%struct.A) align 8 %agg.tmp, %struct.A* nonnull byval(%struct.A) align 8 %agg.tmp1)

Did I correctly understand that TRE should also correctly work if above memcpy calls are not generated ?

efriedma added inline comments.Thu, Oct 8, 3:07 PM
llvm/lib/Transforms/Scalar/TailRecursionElimination.cpp
602

Right, it should still work if the memcpy calls are not generated, or optimized out. For example, we shouldn't miscompile if someone writes clang foo.c -o - -S -emit-llvm -O2 -mllvm -disable-llvm-optzns | opt -S -sroa -memcpyopt -instcombine -tailcallelim.

Usually, memcpyopt doesn't run before tailcallelim, but allowing users to specify arbitrary optimizations in any order is part of what makes LLVM IR transforms flexible.

avl updated this revision to Diff 297566.Mon, Oct 12, 6:19 AM

Implemented that algorithm: First, create a temporary alloca
corresponding to each byval argument. Then, copy each byval
argument to the call to its temporary alloca. Finally, copy
each byval argument to the call from the temporaries to the
memory originally allocated for the function's arguments.

efriedma added inline comments.Tue, Oct 20, 12:18 PM
llvm/lib/Transforms/Scalar/TailRecursionElimination.cpp
612

I think you need to be more careful with alignment here: it's UB if you specify alignment higher than the actual alignment of the value. For byval arguments with the alignment set, the alignment is exactly that.

For byval arguments where the alignment attribute is missing, probably fine to just refuse to do TRE. It's an edge case which shouldn't really come up in practice.

avl added inline comments.Wed, Oct 21, 7:57 AM
llvm/lib/Transforms/Scalar/TailRecursionElimination.cpp
612

I think you need to be more careful with alignment here: it's UB if you specify alignment higher than the actual alignment of the value. For byval arguments with the alignment set, the alignment is exactly that.

so, instead of this:

Align Alignment(DL.getPrefTypeAlignment(AggTy));
Alignment = max(Alignment, MaybeAlign(CI->getParamAlign(OpndIdx)));

I need to do just this :

Align Alignment(DL.getPrefTypeAlignment(AggTy));

right?

For byval arguments where the alignment attribute is missing, probably fine to just refuse to do TRE. It's an edge case which shouldn't really come up in practice.

If alignment attribute is missing would it be OK to use Align(1) ?

avl added inline comments.Wed, Oct 21, 9:41 AM
llvm/lib/Transforms/Scalar/TailRecursionElimination.cpp
612

i.e. both above conditions would look like this:

Align StackSlotAlignment(max(1, MaybeAlign(CI->getParamAlign(OpndIdx))));
Align FinalAlignment(min(DL.getPrefTypeAlignment(AggTy), StackSlotAlignment));

would it be correct?

efriedma added inline comments.Wed, Oct 21, 2:04 PM
llvm/lib/Transforms/Scalar/TailRecursionElimination.cpp
612

I don't think you need to mix in getPrefTypeAlignment. I mean, it's not wrong to use a lower alignment than StackSlotAlignment, but there isn't any reason you'd want to.

Instead of max(1, MaybeAlign(CI->getParamAlign(OpndIdx)), you'd write CI->getParamAlign(OpndIdx).valueOrOne(), but sure, that's correct. The code quality might be a bit iffy for byval without alignment, but nothing should be generating code like that anyway, so I guess it's fine.

avl updated this revision to Diff 300008.Thu, Oct 22, 9:07 AM

corrected byvalue operand`s alignment calculation.

Sorry I didn't spot this sooner, but the current version of the memcpy-to-temp still isn't quite right. The operations aren't in the right order: you have to do all the copies to temporaries before any of the copies to arguments.

avl updated this revision to Diff 300278.Fri, Oct 23, 7:00 AM

changed patch so that all temporarily variables created first,
and then copy these variables into incoming parameters.