This is an archive of the discontinued LLVM Phabricator instance.

[DAGCombiner] Fix DebugLoc propagation when folding !(x cc y) -> (x !cc y)
ClosedPublic

Authored by twoh on Feb 9 2017, 10:53 PM.

Details

Summary

Currently, when 't1: i1 = setcc t2, t3, cc' followed by 't4: i1 = xor t1, Constant:i1<-1>' is folded into 't5: i1 = setcc t2, t3 !cc', SDLoc of newly created SDValue 't5' follows SDLoc of 't4', not 't1'. However, as the opcode of newly created SDValue is 'setcc', it make more sense to take DebugLoc from 't1' than 't4'. For the code below

extern int bar();
extern int baz();

int foo(int x, int y) {
  if (x != y)
    return bar();
  else
    return baz();
}

, following is the bitcode representation of 'foo' at the end of llvm-ir level optimization:

define i32 @foo(i32 %x, i32 %y) !dbg !4 {
entry:
  tail call void @llvm.dbg.value(metadata i32 %x, i64 0, metadata !9, metadata !11), !dbg !12
  tail call void @llvm.dbg.value(metadata i32 %y, i64 0, metadata !10, metadata !11), !dbg !13
  %cmp = icmp ne i32 %x, %y, !dbg !14
  br i1 %cmp, label %if.then, label %if.else, !dbg !16

if.then:                                          ; preds = %entry
  %call = tail call i32 (...) @bar() #3, !dbg !17
  br label %return, !dbg !18

if.else:                                          ; preds = %entry
  %call1 = tail call i32 (...) @baz() #3, !dbg !19
  br label %return, !dbg !20

return:                                           ; preds = %if.else, %if.then
  %retval.0 = phi i32 [ %call, %if.then ], [ %call1, %if.else ]
  ret i32 %retval.0, !dbg !21
}

!14 = !DILocation(line: 5, column: 9, scope: !15)
!16 = !DILocation(line: 5, column: 7, scope: !4)

As you can see, in 'entry' block, 'icmp' instruction and 'br' instruction have different debug locations. However, with current implementation, there's no distinction between debug locations of these two when they are lowered to asm instructions. This is because 'icmp' and 'br' become 'setcc' 'xor' and 'brcond' in SelectionDAG, where SDLoc of 'setcc' follows the debug location of 'icmp' but SDLOC of 'xor' and 'brcond' follows the debug location of 'br' instruction, and SDLoc of 'xor' overwrites SDLoc of 'setcc' when they are folded. This patch addresses this issue.

Event Timeline

twoh created this revision.Feb 9 2017, 10:53 PM
twoh updated this revision to Diff 89153.Feb 20 2017, 7:55 PM

Some tests need to be updated as new DebugLoc changes instruction ordering.

mkuper added a subscriber: mkuper.Feb 20 2017, 8:21 PM
mkuper added inline comments.
test/CodeGen/X86/avx512-fsel.ll
15

A drive-by comment: please don't manually change auto-generate tests - rather, rerun the script. You shouldn't need CHECK-DAG here anyway, since the scheduling is supposed to be deterministic.

(Also, I'm a bit confused. I can understand SDLoc changes affecting scheduling if we do source-order scheduling, but why would this make a difference for IR input that doesn't have any line info to begin with?)

jlebar added a subscriber: jlebar.Feb 20 2017, 8:21 PM

lgtm for nvptx test change

twoh added inline comments.Feb 20 2017, 9:05 PM
test/CodeGen/X86/avx512-fsel.ll
15

Sorry, I missed the note. Thanks for the comments, and I will update the test again.

Even if there's no line info, SelectionDAGBuilder::SDNodeOrder is still maintained during the process (https://github.com/llvm-mirror/llvm/blob/master/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp#L942), and used to set SDNode::IROrder. For example, here (https://github.com/llvm-mirror/llvm/blob/master/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp#L1009) getCurSDLoc generates SDLoc that takes SelectionDAGBuilder::SDNodeOrder as its IROrder, and this values eventually passed to the SDNode constructor for IROrder.

twoh updated this revision to Diff 89154.Feb 20 2017, 9:07 PM

Now avx512-fsel.ll is auto-generated from the script.

mkuper added inline comments.Feb 20 2017, 10:24 PM
test/CodeGen/X86/avx512-fsel.ll
15

Ah, ok, that makes sense.

andreadb added inline comments.Feb 21 2017, 5:08 AM
lib/CodeGen/SelectionDAG/DAGCombiner.cpp
4700–4703

With this change, we always propagate the debug location of N0.

However, this may be suboptimal if we end up in a situation where N0 has a null DebugLoc, and the DebugLoc of N is not null. In that case, shouldn't we propagate SDLoc(N) instead?

When constructing a new SetCC/SelectCC node, what if we give higher priority to SDLoc(N) over SDLoc(N0) based on whether N0->getDebugLoc() returns a null location?

In order to guarantee a deterministic schedule, we could always call SelectionDAG::UpdadeSDLocOnMergedSDNode() on the newly created dag node, with the goal to set the new IROrder equal to the minimum node order between N0 and N.

Not sure if that makes sense.
I hope this helps.

aprantl added inline comments.Feb 21 2017, 9:07 AM
test/CodeGen/X86/xor-combine-debugloc.ll
2

Please add a comment that explains what is being checked here.
It might be more robust to run llc -stop-after=.. and check the MIR right after ISEL.

twoh added inline comments.Feb 21 2017, 10:53 AM
lib/CodeGen/SelectionDAG/DAGCombiner.cpp
4700–4703

@andreadb Thanks for the comments! In this particular case I think it is not correct to take SDLoc from N, and it is better to let the new SDLoc have null DebugLoc if N0 has a null DebugLoc. If we can't have a precise DebugLoc, I think it is better with no DebugLoc than imprecise DebugLoc.

Regarding scheduling order, It seems more natural to me that the new SDValue takes IROrder from N0 if it takes SDLoc from N0. A deterministic schedule is a desirable property, but not sure if it needs to be forced here.

twoh updated this revision to Diff 89270.Feb 21 2017, 1:15 PM

Addressing @aprantl's comments. Thanks!

andreadb added inline comments.Feb 22 2017, 4:44 AM
lib/CodeGen/SelectionDAG/DAGCombiner.cpp
4700–4703

I have to admit, it is not particularly obvious to me why a null DebugLoc is better than a potentially imprecise DebugLoc from N. That said, I am not worried about that particular case.
I think that overall the patch is good - but please wait for Adrian.

twoh added inline comments.Feb 22 2017, 8:42 AM
lib/CodeGen/SelectionDAG/DAGCombiner.cpp
4700–4703

For debugger there might be not much difference. However, for sample-profile based optimization, while null DebugLoc is simply ignored by the optimizer, imprecise DebugLoc may end up with "wrong" optimization. You can find more detailed discussion on it from comments under D25742.

aprantl added inline comments.Feb 22 2017, 9:14 AM
lib/CodeGen/SelectionDAG/DAGCombiner.cpp
4700–4703

Ok, if I understand this correctly:
`
t1: i1 = setcc t2, t3, cc, LOC0
t4: i1 = xor t1, Constant:i1<-1>, LOC1
->
t5: i1 = setcc t2, t3 !cc, LOC0
`

There are two reasonable ways to approach this:

  • "We are folding the xor into the setcc", so the location from the setcc wins / remains untouched. This is what this patch implements.
  • "We are merging two instructions", so we should merge the two locations. Then we should use the getMergedDebugLoc API.

I'm fine with both of the above variants. I don't think we should get into the business of "prefer Loc1, but take Loc2 if Loc1 isn't available". If we are concerned about loosing fidelity, we should investigate why the setcc doesn't have a location instead and we will benefit in more places from that.

(I'm aware that this is more a personal opinion than a sound technical argument :-)

andreadb added inline comments.Feb 22 2017, 9:19 AM
lib/CodeGen/SelectionDAG/DAGCombiner.cpp
4700–4703

D25742 is a completely different case where common code from two basic blocks is merged into a common tail. In that case, we don't want that common code still refers to its original debugloc.

In your test case, you are combining instructions which are from the same basic block. I don't see how this can affect the sample profile; in particular, it cannot negatively impact the computation of block frequency. Essentially, the problem is only when you are counting sample hits against the wrong basic block.

twoh added inline comments.Feb 22 2017, 11:03 AM
lib/CodeGen/SelectionDAG/DAGCombiner.cpp
4700–4703

@andreadb Oh, I just wanted to bring up the example of null debugloc is better than imprecise debugloc for sample profile. Sorry for the confusion.

@aprantl In general I think using getMergedDebugLoc API is the right way to go, but for this specific case it is pretty clear that xor is a part of what setcc supposed to do (i.e. setting predicate), so it makes more sense to take SDLoc from setcc.

In particular, I believe that SelectionDAGBuilder::visitSwitchCase (https://reviews.llvm.org/diffusion/L/browse/llvm/trunk/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp;295856$1872) is one major source generating the pattern of setcc followed by xor with constant -1, and when xor is created at that point it should have take SDLoc from setcc not from following branch instruction. Unfortunately, xor instruction sets its SDLoc to the value returned from getCurSDLoc() call, but CurInst is already a branch instruction. Fixing SDLoc for xor there would involve non-trivial amount of work because it will require SelectionDAGBuilder to maintain not only current instruction but also previous instruction, so I prefer to address it here.

aprantl edited edge metadata.Feb 22 2017, 2:38 PM

In general I think using getMergedDebugLoc API is the right way to go, but for this specific case it is pretty clear that xor is a part of what setcc supposed to do (i.e. setting predicate), so it makes more sense to take SDLoc from setcc.

Yes, and as I said I can follow that reasoning. As far as I am concerned this LGTM.

twoh added a comment.Mar 1 2017, 8:30 PM

Ping. thanks!

andreadb accepted this revision.Mar 2 2017, 4:43 AM

LGTM too!

This revision is now accepted and ready to land.Mar 2 2017, 4:43 AM
twoh updated this revision to Diff 90388.Mar 2 2017, 2:02 PM

Rebase. Fix test/CodeGen/X86/and-sink.ll that added 9days ago.

This revision was automatically updated to reflect the committed changes.