This is an archive of the discontinued LLVM Phabricator instance.

[utils] InlineFunction: fix for debug info affecting optimizations
ClosedPublic

Authored by yechunliang on Oct 8 2019, 4:09 AM.

Details

Summary

Debug info affects output from "opt -inline", InlineFunction could not handle the llvm.dbg.value when it exist between alloca instructions, scanning the block of allocas will get wrong result if llvm.dbg.value instr exist. Skip dbg instr while allocas scanning.

Fix the issue: https://bugs.llvm.org/show_bug.cgi?id=43291k

Diff Detail

Event Timeline

yechunliang created this revision.Oct 8 2019, 4:09 AM
Herald added a project: Restricted Project. · View Herald TranscriptOct 8 2019, 4:09 AM
jmorse added a comment.EditedOct 8 2019, 4:55 AM

Looks good, one concern about whether too many debug instructions may get moved written inline

llvm/lib/Transforms/Utils/InlineFunction.cpp
1842–1843

Is there a possibility of an unrelated debug instruction being skipped here, and becoming part of the slice moved by lines 1847-1857? Moving dbg.values of arguments to the start of the caller may create a debug use-before-def situation, there could be other problem scenarios too.

Using a debug-instruction filtering iterator (like here [0]) might just do-the-right-thing, I don't know whether feeding one to splice would behave correctly though.

[0] https://github.com/llvm/llvm-project/blob/fdaa74217420729140f1786ea037ac445a724c8e/llvm/lib/Transforms/Utils/SimplifyCFG.cpp#L2592

llvm/test/Transforms/Inline/inline-with-debuginfo.ll
5–6

The bug report number would be nice too

58–60

We generally delete function attributes if they're not necessary

bjope added a comment.Oct 8 2019, 5:39 AM

I do not understand how this helps. The code is written in a way that it skips any instruction, but moves contigous blocks of allocas in one splice (not sure exactly why, is that really faster?). Maybe the difference is that the check for AI->useEmpty() only is done for the first alloca in a sequence of alloca instructions? Or can't we just remove the loop at line 1847 (only moving one alloca at a time).

aprantl added inline comments.Oct 8 2019, 10:14 AM
llvm/lib/Transforms/Utils/InlineFunction.cpp
1842–1843

Don't we have an iterator that automatically skips debug intrinsics?

jdoerfert added inline comments.Oct 8 2019, 10:34 AM
llvm/lib/Transforms/Utils/InlineFunction.cpp
1862

Don't we need to have similar logic here? What happens if there are two allocas, then the dbg intrinsic, then another one?

bjope added a comment.Oct 9 2019, 12:15 AM

I do not understand how this helps. The code is written in a way that it skips any instruction, but moves contigous blocks of allocas in one splice (not sure exactly why, is that really faster?). Maybe the difference is that the check for AI->useEmpty() only is done for the first alloca in a sequence of alloca instructions? Or can't we just remove the loop at line 1847 (only moving one alloca at a time).

The patch description really needs to be updated. Afaict the old code has been doing "Skip dbg instr while allocas scanning.", but with this patch you in some sense start to scan for both alloca instructions _and_ dbg intrinsics. And then some dbg intrinsics will be part of the splice.
As Johannes mentioned this is incomplete (since not all dbg intrinsics are considered, only those appearing after the first alloca in for each lap in the outer for loop).
Besides, inlining decisions should not be impacted by if we move the dbg intrinsics or not, right? That is why I think this fix isn't solving the problem seen in the bugzilla ticket. It only hides the actual problem (given the reduced reproducer).

Besides, inlining decisions should not be impacted by if we move the dbg intrinsics or not, right?

Correct. clang -g | strip and clang should produce identical output.

[inline] skip dbg info when scan allocas

Debug instruction will break the next alloca scanning, the alloca instruction next to debug instruction will fail to insert to the caller function. The debug instructions should be skipped when next allocas scanning.

Fix the issue: https://bugs.llvm.org/show_bug.cgi?id=43291

aprantl requested changes to this revision.Oct 10 2019, 8:30 AM
aprantl added inline comments.
llvm/lib/Transforms/Utils/InlineFunction.cpp
1842–1843

Please use BasicBlock:: instructionsWithoutDebug () instead of rolling your own.

https://llvm.org/doxygen/classllvm_1_1BasicBlock.html#aa696c5026c1241f4dd097f3558a43a47

1860

llvm style would be to reverse the condition: break first, then unconditionally execute the code.

This revision now requires changes to proceed.Oct 10 2019, 8:30 AM
bjope added a comment.Oct 10 2019, 9:25 AM

I'm still confused. IMO moving or not moving dbg instructions is a different question (not being discussed here). Still no explanation why result differs depending on if we move dbg instructuons or not. And this patch now blindly moves some dbg instructions, regardless if they are related to the allocas or not. Consider an instruction sequence like this:

alloca
bitcast
dbg.declare
bitcast
alloca
bitcast
dbg.value

The scanning for allocas will find the two alloca instructions and call splice twice, only moving the alloca instructions, right?

The input IR could probably have looked like this instead:

alloca
dbg.declare
bitcast
bitcast
alloca
dbg.value
bitcast

and then suddenly (with this patch) we start to move the dbg intrinsics (but we still do not move the bitcasts, nor dbg intrinsics that do not have an alloca as predecessor).

I simply do not see the logic with this change.

yechunliang marked 5 inline comments as done.EditedOct 10 2019, 7:19 PM

The code is written in a way that it skips any instruction, but moves contigous blocks of allocas in one splice (not sure exactly why, is that really faster?).

I also could not understand why continue to scan allocas block after first none use_empty alloca instruction, here is the first commit has some reason: https://github.com/llvm/llvm-project/commit/6f8865bf9

Maybe the difference is that the check for AI->useEmpty() only is done for the first alloca in a sequence of alloca instructions? Or can't we just remove the loop at line 1847 (only moving one alloca at a time).

with this example test case, second alloca is use_empty, and will insert to caller together with first alloca (!use_empty). But if there is dbg instruction between first alloca and second alloca instruction. the continue scan will break,
then with the debug instruction, the program will goto the front for() loop, and handle the second alloca as use_empty (because it has no use list like "xxx.sroa_cast = bitcast %rec1198* %volatileloadslot to i8*") and eraseFromParent.
this is difference as no-dbg inline will not erase second alloca instruction.

llvm/lib/Transforms/Utils/InlineFunction.cpp
1842–1843

Is there a possibility of an unrelated debug instruction being skipped here,

By default, the debug instruction between allocas will insert to caller together with block of allocas, I just keep the default behavier. Not sure if we need to remove the dbg instr when inline to the caller.

Using a debug-instruction filtering iterator (like here [0]) might just do-the-right-thing,

I could not found the way to use debug-instruction filtering like instructionsWithoutDebug(llvm::Instruction) to handle BasicBlock::iterator. the format not match, so I used DbgInfoInstrinsic for simple usage.

1862

thanks for the comments, this is a good case, I have updated code and testcase to handle this.

yechunliang marked an inline comment as done.

Update patch:

  1. Use BasicBlock::iterator skipDebugIntrinsics instead of rolling directly. The recommend API BasicBlock:: instructionsWithoutDebug() looks not very suitable to handle llvm::Instruction in this code.
  2. follow llvm style, condition: break first, then unconditionally execute the code

The code is written in a way that it skips any instruction, but moves contigous blocks of allocas in one splice (not sure exactly why, is that really faster?).

I also could not understand why continue to scan allocas block after first none use_empty alloca instruction, here is the first commit has some reason: https://github.com/llvm/llvm-project/commit/6f8865bf9

Maybe the difference is that the check for AI->useEmpty() only is done for the first alloca in a sequence of alloca instructions? Or can't we just remove the loop at line 1847 (only moving one alloca at a time).

with this example test case, second alloca is use_empty, and will insert to caller together with first alloca (!use_empty). But if there is dbg instruction between first alloca and second alloca instruction. the continue scan will break,
then with the debug instruction, the program will goto the front for() loop, and handle the second alloca as use_empty (because it has no use list like "xxx.sroa_cast = bitcast %rec1198* %volatileloadslot to i8*") and eraseFromParent.
this is difference as no-dbg inline will not erase second alloca instruction.

So the root cause is rather that we treat an alloca being immediately preceeded by another alloca differrently from the case when it is preceeded by another kind of instruction. This happens also when having other instructions in between, and is not specific to dbg intrinsics (could be interesting to add a test case where you replace the dbg intrinsics by something else).

So I think that the solution might be based on one of these ideas:

  1. Remove the check for use_empty in the outer loop.
  2. Add a check for !use_empty in the inner loop.
  3. Remove the inner loop (i.e only splice one alloca at a time).

Alternative 3 would be the simplest.

If there really is a speedup on doing fewer splices, then alternative 2 still moves consequtive !use_empty allocas in batches. The idea with alternative 2 is to split batches on allocas that has no uses (as they are handled in the outer loop).

Alternative 1 might work assuming that allocas with no uses are cleaned up somewhere else. But I think this alternative is the least interesting one.

yechunliang marked an inline comment as done.Oct 11 2019, 3:59 AM

So the root cause is rather that we treat an alloca being immediately preceeded by another alloca differrently from the case when it is preceeded by another kind of instruction. This happens also when having other instructions in between, and is not specific to dbg intrinsics (could be interesting to add a test case where you replace the dbg intrinsics by something else).

Yes I think so, if the other instruction is not dbg instr which exist between two allocas, the InlineFunction with and without "-strip-debug" will make the same behavior, that should both erase second use_empty alloca. This patch is to fix the issue that debug instr impact InlineFunction generate different output.

So I think that the solution might be based on one of these ideas:

  1. Remove the check for use_empty in the outer loop.
  2. Add a check for !use_empty in the inner loop.
  3. Remove the inner loop (i.e only splice one alloca at a time).

These good ideas should be talking about the design change of alloca inline or improvement of splice.
Read from the code, I think about the alloca inline behavior like this: First detect one !use_empty alloca, if next immediate instructions are allocas, even they are use_empty, they will all added one after one and move to caller together with first alloca. if other instruction (whatever dbg or others instrs) exist between allocas, the next alloca will check if is use_empty or not, if is use_empty then erase. Does this behavior correct, or could be improve? I don't know much about alloca inline. but seems the code run many years with this design.

aprantl added inline comments.Oct 11 2019, 1:30 PM
llvm/lib/Transforms/Utils/InlineFunction.cpp
1854

Does this work when I == E here?

lebedev.ri retitled this revision from fix debug info affects output when opt inline to [utils] InlineFunction: fix for debug info affecting optimizations.Oct 11 2019, 1:39 PM
bjope added a reviewer: fhahn.Oct 11 2019, 2:32 PM

So the root cause is rather that we treat an alloca being immediately preceeded by another alloca differrently from the case when it is preceeded by another kind of instruction. This happens also when having other instructions in between, and is not specific to dbg intrinsics (could be interesting to add a test case where you replace the dbg intrinsics by something else).

Yes I think so, if the other instruction is not dbg instr which exist between two allocas, the InlineFunction with and without "-strip-debug" will make the same behavior, that should both erase second use_empty alloca. This patch is to fix the issue that debug instr impact InlineFunction generate different output.

So I think that the solution might be based on one of these ideas:

  1. Remove the check for use_empty in the outer loop.
  2. Add a check for !use_empty in the inner loop.
  3. Remove the inner loop (i.e only splice one alloca at a time).

These good ideas should be talking about the design change of alloca inline or improvement of splice.
Read from the code, I think about the alloca inline behavior like this: First detect one !use_empty alloca, if next immediate instructions are allocas, even they are use_empty, they will all added one after one and move to caller together with first alloca. if other instruction (whatever dbg or others instrs) exist between allocas, the next alloca will check if is use_empty or not, if is use_empty then erase. Does this behavior correct, or could be improve? I don't know much about alloca inline. but seems the code run many years with this design.

I'm no expert on this part of the code either. But there is no reasonable logic in handling allocas differently depending on the existence of other allocas here afaict. It looks like it has been like that for over 10 years, but I doubt that it justifies adding yet another level of illogical handling here. This time handling dbg intrinsics differently depending on the existance of, possibly unrelated, alloca instructions. It would just make this an even bigger mess. Would be much better if we could fix what seems to have been a bug(?) for over a decade (which also would solve the debug invariance problem).

(Maybe you'll need another set of reviewers considering that the fix would impact the alloca inlining slightly, rather than just aiming at dbg intrinsics?)

! In D68633#1699421, @bjope wrote:
So I think that the solution might be based on one of these ideas:

  1. Remove the check for use_empty in the outer loop.
  2. Add a check for !use_empty in the inner loop.
  3. Remove the inner loop (i.e only splice one alloca at a time).

    Alternative 3 would be the simplest. If there really is a speedup on doing fewer splices, then alternative 2 still moves consequtive !use_empty allocas in batches. The idea with alternative 2 is to split batches on allocas that has no uses (as they are handled in the outer loop). Alternative 1 might work assuming that allocas with no uses are cleaned up somewhere else. But I think this alternative is the least interesting one.

I prefer to adopt alternative 2. I think use_empty alloca (the dead alloca) should also be erased in inner loop.

yechunliang updated this revision to Diff 225008.EditedOct 15 2019, 5:38 AM
yechunliang edited the summary of this revision. (Show Details)

Rethink about this issue and the root cause should be caused by use_empty. Just resubmitted the patch with "[InlineFunction] Erase use_empty allocas when inline".

Updated patch description:

[InlineFunction] Erase use_empty allocas when inline  
- When continuely scan block of allocas, in the inner loop, there is lack of use_empty check, this will cause use_empty alloca instruction next to previous alloca insert to the caller. This will make different output when there is debug instruction between two alloca instructions (with/without --with-strip). This patch add use_empty() check in the alloca scanning inner loop. Exit the inner loop when find next alloca instruction is use_empty and handle to earse it in the outer loop.
  a) callee:
    %1 = alloca (used)
    dbg
    %2 = alloca (use empty)
    dbg
    %3 = alloca3 (used)

  b) inline caller expection: (the use_empty %2 be removed)
    %1 = alloca (used)
    dbg
    dbg
    %3 = alloca3 (used)
- Also skip debug instruction in the inner loop for continue alloca scanning. 
Fix the issue: https://bugs.llvm.org/show_bug.cgi?id=43291

fix typo of patch description

vsk added a comment.Oct 15 2019, 10:01 AM

Hey @yechunliang, thanks for working on this.

This part of the inliner looks like it has two issues: 1) it's not fully aware of debug instructions, and 2) it erroneously extends the lifetimes of inlined variables. The current iteration of your patch might address the first issue, but it's not DRY, and it doesn't solve the second.

Here are a few suggestions.

  • Please split out the logic for moving callee allocas into a helper function. This should simplify review.
  • Use the filtering iterator to skip debug instructions. I understand that you had some trouble with this. Here's a patch based on an older version of your diff:

  • Moving the dbg.declares along with the callee's allocas to the entry block erroneously extends the lifetimes of all those allocas to the whole function. This makes frame inspection in the debugger really confusing. To fix this, I think you should just not move the dbg.declares. The easiest way to do that is to stop batch-splicing, as @bjope pointed out.
vsk requested changes to this revision.Oct 15 2019, 10:01 AM
This revision now requires changes to proceed.Oct 15 2019, 10:01 AM
In D68633#1709729, @vsk wrote:

Hey @yechunliang, thanks for working on this.

This part of the inliner looks like it has two issues: 1) it's not fully aware of debug instructions, and 2) it erroneously extends the lifetimes of inlined variables. The current iteration of your patch might address the first issue, but it's not DRY, and it doesn't solve the second.

Well, this part of the code is actually aware of dbg.declares (associated with the allocas), and moves those separately. This is done after the loop. And I guess that is (at least one reason to) why we save information about moves allocas in IFI.StaticAllocas.
Not sure if that is a bad thing to do or not, but that is probably a different problem. I think the initial goal here was to get rid of the debug invariance, i.e. that allocas could either be moved or removed depending on the presence of debug info.

My suggested solution to that is to handle all allocas the same (not treating two allocas in a row differently from two allocas separated by any other instruction). This makes this part of the code less sensitive to the order of instructions in the input. As a side effect (of such an improvement) the debug invariance problem is expected to go away.

Here are a few suggestions.

  • Please split out the logic for moving callee allocas into a helper function. This should simplify review.
  • Use the filtering iterator to skip debug instructions. I understand that you had some trouble with this. Here's a patch based on an older version of your diff:

  • Moving the dbg.declares along with the callee's allocas to the entry block erroneously extends the lifetimes of all those allocas to the whole function. This makes frame inspection in the debugger really confusing. To fix this, I think you should just not move the dbg.declares. The easiest way to do that is to stop batch-splicing, as @bjope pointed out.

I don't really see a reason for splitting this into a helper function. Nor starting to use a filtering iterator. My opinion is that the loop only should consider alloca instruction (just like it is today on trunk). But either the inner loop should be removed (only moving one alloca at a time), or if we still want to move sequences of alloca instructions, then the checks to determine if an alloca should be moved or not should be the same in the inner loop as in the outer loop. So the inner loop could simply look like this:

while (isa<AllocaInst>(I) &&
       !cast<AllocaInst>(I)->use_empty() &&
       allocaWouldBeStaticInEntry(cast<AllocaInst>(I))) {
  IFI.StaticAllocas.push_back(cast<AllocaInst>(I));
  ++I;
}

If we keep the inner loop the conditions for keeping/moving/removing an alloca is duplicated. So it would be much simpler just to stop doing batch moves.

vsk added a comment.Oct 15 2019, 1:17 PM

My suggested solution to that is to handle all allocas the same (not treating two allocas in a row differently from two allocas separated by any other instruction). This makes this part of the code less sensitive to the order of instructions in the input. As a side effect (of such an improvement) the debug invariance problem is expected to go away.

Yes, I agree with this. There's no need to use a filtering iterator with this approach: you need an isa<AllocaInst> check either way. We can consider the dbg.declare movement separately, sure.

yechunliang added a comment.EditedOct 16 2019, 2:36 AM

@bjope, @vsk: Thanks very much for the comments, really help me.

I have verified below two options, both could work, but don't know which is better.

Option1: Just add !use_empty check in the inner loop. if debug exist, alloca will be handled separately and move one alloca to the caller each time,
if no debug, the alloca will be handled in sequence and move all (!use_empty) allocas to the caller at one time. (here is the inline speed up improvement, don't know there is the case is frequently used and how to measure the performance?)
the process is a little different but the optimization results are same. Does this acceptable?

Option2: Remove inner loop, all alloca will be handled once, splice immediately for each alloca, the benefits are no duplicate checking code, no debug impact.
but it may loss inline speed up improvement for blocks of allocas.

just added use_empty check in the inner loop, this will be the smallest code change for issue fixing.

[InlineFunction] skip use_empty allocas when inline

Add use_empty check in alloca inner loop scanning, if the alloca is use_empty, do not move it to the entry block of the caller, stop scanning and goto splice immediately.

the behavior should be like this:
  a) the callee:
    %a = alloca (is !use_empty)
    %b = alloca (is use_empty)
    %c = alloca (is !use_empty)

  b) the caller after inline: (instr %b is skipped)
    %a = alloca (is !use_empty)
    %c = alloca (is !use_empty)

Fix the issue: https://bugs.llvm.org/show_bug.cgi?id=43291
bjope added a comment.Oct 18 2019, 5:05 AM

Thanks! I like this solution.

Moving focus to the test case. I've added a couple of suggestions for cleanup.

llvm/test/Transforms/Inline/inline-skip-use-empty-alloca.ll
31 ↗(On Diff #225583)

This can basically match any line containing the letter g. Is it possible to make a more complete match?
I also suggest to rename the functions here to something like @foo and @bar instead of @f and @g to get something that is easier to distinguish.

43 ↗(On Diff #225583)

This declaration can be removed (same goes for the lifetime.end).

One trick is to do some cleanup of the test case by using opt -S -strip-dead-prototypes -strip-dead-debug-info to both strip away dead prototypes as well as reducing the amount of unused metadata. Just notice that you have to add back the CHECK/RUN lines etc manually afterwards.

yechunliang updated this revision to Diff 225811.EditedOct 20 2019, 6:23 PM

update test case:

  1. using function name @foo and @bar instead of @f and @g for more complete match

2.cleanup test case with "-strip-dead-prototypes -strip-dead-debug-info"

Thank you @bjope for comments.

yechunliang marked 2 inline comments as done.Oct 20 2019, 6:25 PM
bjope accepted this revision.Oct 21 2019, 5:54 AM

Thanks for the patience with all my remarks, turning this around completely compared to the original patch.

LGTM!

bjope added a comment.Oct 28 2019, 5:58 AM

Not sure why this isn't in "Accepted" state. I know @aprantl and @vsk asked for a new revision on an earlier diff. So is it perhaps because not all comments have been marked as "Done"?

@yechunliang: Let me know if you need help committing this (not sure if you got commit permissions). I noticed some earlier patch that has been accepted by not yet committed as well (D66467).

yechunliang marked 3 inline comments as done.Oct 28 2019, 6:49 AM

@bjope thanks for help, to be honest, I don't know the process or execution of commit, I thought I have no commit permission right now.

bjope added a comment.Oct 28 2019, 8:33 AM

@bjope thanks for help, to be honest, I don't know the process or execution of commit, I thought I have no commit permission right now.

The procedure should be described here:

http://llvm.org/docs/Contributing.html#how-to-submit-a-patch

It says

This cycle continues until all requests and comments have been addressed and a reviewer accepts the patch with a Looks good to me or LGTM. Once that is done the change can be committed. If you do not have commit access, please let people know during the review and someone should commit it on your behalf.

So it is perfectly OK to ask the reviewers to submit it for you.

When your contribution increases, you can obtain commit access (to both push your own patches and also maybe to help new contributers). That should be covered by

http://llvm.org/docs/DeveloperPolicy.html#obtaining-commit-access

(not sure if the process will change due to the ongoing transfer from SVN->GitHub).

I haven't pushed directly to GitHub myself yet. I'll try to commit this patch for you, also verifying that my commit access still works.

This revision was not accepted when it landed; it landed in state Needs Review.Oct 28 2019, 10:19 AM
This revision was automatically updated to reflect the committed changes.

@bjope,@jmorse, @aprantl, @jdoerfert, @vsk thank you all for kind and patient help on code review and giving me many supports and guidelines, I'm glad this(my first) patch be accepted and merged on llvm-project. Really thank you.