This is an archive of the discontinued LLVM Phabricator instance.

[InstCombine] Stop sinking instructions across function call.
Needs ReviewPublic

Authored by renlin on Jun 6 2018, 3:18 AM.

Details

Reviewers
hfinkel
Summary

instcombine pass will try to sink an instruction to the place where the value
is used when the CFG is very simple. However, this doesn't take the
function call into account.

While the sinking will reduce the live range of the result, it will increase the live range of related operands.
It is not clear whether overall, it is beneficial or not.
instcombine might not be the best place to make that kind of decision, , especially across function calls.
Here, instruction sinking is prohibit across function calls.

TargetTransformInfor::isLoweredToCall is used to decided whether the function will be become a really call or simplified into a different form.

Diff Detail

Event Timeline

renlin created this revision.Jun 6 2018, 3:18 AM
hfinkel added a subscriber: hfinkel.Jun 6 2018, 6:29 AM

I don't see how inlining is relevant here. Sinking across function call boundaries is generally also bad (because increasing register pressure across calls tends to increase spilling). I feel like the test you want here is to call TTI::isLoweredToCall.

renlin added a comment.Jun 6 2018, 7:53 AM

I agree that, generally it is not a good idea to sink across function call. I am a little bit consecutive on this change.

Thank you for your suggestion. I checked the document about TTI::isLoweredToCall.
IIUIC, it seems to me this backend information is still not strong enough.

  • If the backend returns false for a function, it will get expanded by some passes somewhere in the optimization pipeline. This might complicated the CFG, similar as the inlined function cases.
  • If the backend returns true for a function, it is an real function call. This will extend the live-range of related variables.

Should we just stop sinking instructions across any type of function call?
Normally, InstCombine will be called multiple time, if a function got expanded, we still have opportunities to sink instructions when it is beneficial.

I agree that, generally it is not a good idea to sink across function call. I am a little bit consecutive on this change.

Thank you for your suggestion. I checked the document about TTI::isLoweredToCall.
IIUIC, it seems to me this backend information is still not strong enough.

  • If the backend returns false for a function, it will get expanded by some passes somewhere in the optimization pipeline. This might complicated the CFG, similar as the inlined function cases.

In theory, yes. In practice, it shouldn't (unless the backend generates a custom expansion including branches, but in such cases, the costs are generally low). The purpose of the call, however, is to inform the mid-end optimizer whether or not it should view the call as actually having call-like overheads. This applies to information-only intrinsics, calls that get lowered to single instructions (e.g., sin/cos on some targets).

  • If the backend returns true for a function, it is an real function call. This will extend the live-range of related variables.

Should we just stop sinking instructions across any type of function call?

No. You would at least need to exclude a whole list of intrinsics. isLoweredToCall will do this for you, and also handle a number of useful cases where the backend expands the call into something cheap. Please use isLoweredToCall, and if we find cases where this turns out to be problematic, we'll figure out how to address them.

Normally, InstCombine will be called multiple time, if a function got expanded, we still have opportunities to sink instructions when it is beneficial.

renlin updated this revision to Diff 150174.Jun 6 2018, 12:16 PM
renlin edited the summary of this revision. (Show Details)
  • Update to use TargetTransformInfor::isLoweredToCall
  • Simplify the logic, remove the back-backward search from the user.
renlin updated this revision to Diff 150178.Jun 6 2018, 12:22 PM

Generally , the sinking will increase the live range of variables

This isn't a good description. It increases the live range of the operands, and reduces the live range of the result. Which one matters more isn't obvious. And in some cases, the cost of the operation itself might be more important than the cost of spilling.

Granted, instcombine might not be the best place to make that kind of decision; it's very hard to estimate register pressure before isel.

Ka-Ka added a subscriber: Ka-Ka.Jun 6 2018, 2:47 PM
renlin retitled this revision from [InstCombine] Don't sink instructions across inlined function call. to [InstCombine] Stop sinking instructions across inlined function call..Jun 7 2018, 4:20 AM
renlin edited the summary of this revision. (Show Details)
renlin retitled this revision from [InstCombine] Stop sinking instructions across inlined function call. to [InstCombine] Stop sinking instructions across function call..
renlin updated this revision to Diff 151883.Jun 19 2018, 2:51 AM

remove a space

I'm seeing what, but not why.
What is the motivation behind this change?
What problem is it trying to solve?

test/Transforms/InstCombine/sink_across_call.ll
2

instcombine tests use utils/update_test_checks.py

renlin updated this revision to Diff 151895.Jun 19 2018, 4:34 AM

Use utils./update_test_checks.py for test case checking.

renlin marked an inline comment as done.Jun 19 2018, 4:37 AM
renlin added inline comments.
test/Transforms/InstCombine/sink_across_call.ll
2

thanks, updated

renlin marked an inline comment as done.Jun 19 2018, 4:41 AM

I'm seeing what, but not why.
What is the motivation behind this change?
What problem is it trying to solve?

In one of the test case I have, the sinking of a load instruction (together with the operands used) across a inlined function increases the number of instructions generated.
The inline pass run after instCombine pass. If I reorder the passes to have inline pass before instCombine, this will make the CFG complex. And instCombine will give up the sinking.

I'm seeing what, but not why.
What is the motivation behind this change?
What problem is it trying to solve?

In one of the test case I have, the sinking of a load instruction (together with the operands used) across a inlined function increases the number of instructions generated.

Instructions as in IR instructions? Or the instructions in the final assembly?

The inline pass run after instCombine pass. If I reorder the passes to have inline pass before instCombine, this will make the CFG complex. And instCombine will give up the sinking.

I'm seeing what, but not why.
What is the motivation behind this change?
What problem is it trying to solve?

In one of the test case I have, the sinking of a load instruction (together with the operands used) across a inlined function increases the number of instructions generated.

Instructions as in IR instructions? Or the instructions in the final assembly?

Final machine assembly.

I'm seeing what, but not why.
What is the motivation behind this change?
What problem is it trying to solve?

In one of the test case I have, the sinking of a load instruction (together with the operands used) across a inlined function increases the number of instructions generated.

Instructions as in IR instructions? Or the instructions in the final assembly?

Final machine assembly.

Perhaps you can add an example test?
There seems to be some precedent for that:

llvm/test/CodeGen/X86$ grep -r instcombine | grep RUN
2009-03-23-i80-fp80.ll:; RUN: opt < %s -instcombine -S | grep 302245289961712575840256
2009-03-23-i80-fp80.ll:; RUN: opt < %s -instcombine -S | grep K40018000000000000000
vec_udiv_to_shift.ll:; RUN: opt < %s -instcombine -S | FileCheck %s
vec_ins_extract.ll:; RUN: opt < %s -sroa -instcombine | \
no-plt-libcalls.ll:; RUN: opt < %s -instcombine -S | FileCheck %s

But i must say, right now this sounds the problem is elsewhere,
and this change only papers over it, by pessimizing all other cases.

I'm seeing what, but not why.
What is the motivation behind this change?
What problem is it trying to solve?

In one of the test case I have, the sinking of a load instruction (together with the operands used) across a inlined function increases the number of instructions generated.

Instructions as in IR instructions? Or the instructions in the final assembly?

Final machine assembly.

Perhaps you can add an example test?
There seems to be some precedent for that:

llvm/test/CodeGen/X86$ grep -r instcombine | grep RUN
2009-03-23-i80-fp80.ll:; RUN: opt < %s -instcombine -S | grep 302245289961712575840256
2009-03-23-i80-fp80.ll:; RUN: opt < %s -instcombine -S | grep K40018000000000000000
vec_udiv_to_shift.ll:; RUN: opt < %s -instcombine -S | FileCheck %s
vec_ins_extract.ll:; RUN: opt < %s -sroa -instcombine | \
no-plt-libcalls.ll:; RUN: opt < %s -instcombine -S | FileCheck %s

But i must say, right now this sounds the problem is elsewhere,
and this change only papers over it, by pessimizing all other cases.

Given the following test case (similar as the one in my very initial patch)

define i64 @inline_func(i64* %in, i32 %radius) readonly noinline {
;;define i64 @inline_func(i64* %in, i32 %radius) readonly alwaysinline {
entry:
  %cmp12 = icmp sgt i32 %radius, 0
  br i1 %cmp12, label %for.body.preheader, label %for.cond.cleanup

for.body.preheader:
  %wide.trip.count = zext i32 %radius to i64
  br label %for.body

for.cond.cleanup:
  %max_val.0.lcssa = phi i64 [ 0, %entry ], [ %max_val.0., %for.body ]
  ret i64 %max_val.0.lcssa

for.body:
  %indvars.iv = phi i64 [ 0, %for.body.preheader ], [ %indvars.iv.next, %for.body ]
  %max_val.013 = phi i64 [ 0, %for.body.preheader ], [ %max_val.0., %for.body ]
  %arrayidx = getelementptr inbounds i64, i64* %in, i64 %indvars.iv
  %0 = load i64, i64* %arrayidx, align 4
  %cmp1 = icmp eq i64 %max_val.013, %0
  %max_val.0. = select i1 %cmp1, i64 %max_val.013, i64 %0
  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
  %exitcond = icmp eq i64 %indvars.iv.next, %wide.trip.count
  br i1 %exitcond, label %for.cond.cleanup, label %for.body
}

define void @test1(i64* nocapture %out, i64* %in, i32 %w, i32 %n)  {
entry:
  %idxprom = sext i32 %w to i64
  %arrayidx = getelementptr inbounds i64, i64* %in, i64 %idxprom
  %0 = load i64, i64* %arrayidx, align 4
  %call = tail call i64 @inline_func(i64* %in, i32 %n)
  %cmp = icmp eq i64 %call, -1
  br i1 %cmp, label %if.then, label %if.end

if.then: 
  %cmp1 = icmp eq i64 %0, %call
  %conv = sext i1 %cmp1 to i64
  store i64 %conv, i64* %out, align 4
  br label %if.end

if.end: 
  ret void
}

Compiled with Clang with -O2 flag for aarch64, without the change here, there is more core register save/restores in the stack frame and one more register move.
It is needed to hold the base pointer and offset in callee save registers to make sure the value is preserved across function call. And register move instructions are need to move them from argument registers to callee save registers.

With the change, only the register holding the load value will be saved in the prologue.
Even mark the function as alwaysinline, one more register move instruction is generated to hold the pointer of in.
Because, the code in inline_func will use and change the pointer.

I didn't put it as code-generation test because I think it might be too fragile to check the exact code-gen.

I agree, optimization passes are related.
I might missed something, but I didn't see any other passes that are obviously doing things wrong or not trying hard to optimize code for this particular case.

renlin removed a subscriber: hfinkel.
renlin added a comment.Aug 8 2018, 2:31 AM

@hfinkel , I have updated the patch as you suggested, and in the latest comment, I showed an artificial test case which could be improved by the change.
Do you think you would be able to review this? thanks!