This is an archive of the discontinued LLVM Phabricator instance.

[CodeGen] Do the Simple Early Return in block-placement pass to optimize the blocks
ClosedPublic

Authored by ZhangKang on Jun 29 2019, 1:45 AM.

Details

Summary

In block-placement pass, it will create some patterns for unconditional we can do the simple early retrun.
But the early-ret pass is before block-placement, we don't want to run it again.
This patch is to do the simple early return to optimize the blocks at the last of block-placement.

When ChainBB is unconditional branch to the TBB, and TBB has no
fallthrough predecessor and fallthrough successor, try to merge
ChainBB and TBB. This is legal under the one of following conditions:

  1. ChainBB is empty except for an unconditional branch.
  2. TBB has only one predecessor.

Below is an example

BB:                   | BB:
   XOR 3, 3, 4        |   XOR 3, 3, 4
   B TBB              |   B ChainBB
...                   | ...
ChainBB:              | ChainBB:
   B TBB              |   ADD 3, 3, 4
...                   |   BLR
TBB:                  |
   ADD 3, 3, 4        |
   BLR                |

Diff Detail

Event Timeline

ZhangKang created this revision.Jun 29 2019, 1:45 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 29 2019, 1:45 AM
ZhangKang updated this revision to Diff 207192.Jun 29 2019, 2:30 AM

Modify the comments.

This seems like simple tail duplication, which the target-independent taildup pass should handle. Can you give an example which taildup doesn't handle?

This seems like simple tail duplication, which the target-independent taildup pass should handle. Can you give an example which taildup doesn't handle?

@efriedma , I am sorry for late reply. I spend some time to find the case and investigate the reason.
The pass tail duplication is for unconditional branch Tail BB. Most of cases I write manually can be optimized by the tail duplication pass.
But there are still some cases can't be optimized by the tail duplication pass, because the pattern will be created after tail duplication pass.

Below case is from SPEC, I have narrow down the origin case, also SPEC has many other cases can trigger the early-ret I wrote after running the tail duplication pass:

; ModuleID = 'HashXMLCh.ll'
source_filename = "HashXMLCh.cpp"
target datalayout = "e-m:e-i64:64-n32:64"
target triple = "powerpc64le-unknown-linux-gnu"

%"class.xercesc_2_7::HashXMLCh" = type { %"class.xercesc_2_7::HashBase" }
%"class.xercesc_2_7::HashBase" = type { i32 (...)** }

; Function Attrs: norecurse nounwind readonly
define dso_local zeroext i1 @_ZN11xercesc_2_79HashXMLCh6equalsEPKvS2_(%"class.xercesc_2_7::HashXMLCh"* nocapture readnone %this, i8* readonly %key1, i8* readonly %key2) unnamed_addr #0 align 2 {
entry:
  %0 = bitcast i8* %key1 to i16*
  %1 = bitcast i8* %key2 to i16*
  %cmp.i = icmp eq i8* %key1, null
  %cmp1.i = icmp eq i8* %key2, null
  %or.cond.i = or i1 %cmp.i, %cmp1.i
  br i1 %or.cond.i, label %if.then.i, label %while.cond.preheader.i

while.cond.preheader.i:                           ; preds = %entry
  %2 = load i16, i16* %0, align 2, !tbaa !2
  %3 = load i16, i16* %1, align 2, !tbaa !2
  %cmp926.i = icmp eq i16 %2, %3
  br i1 %cmp926.i, label %while.body.i, label %_ZN11xercesc_2_79XMLString6equalsEPKtS2_.exit

if.then.i:                                        ; preds = %entry
  br i1 %cmp.i, label %lor.lhs.false3.i, label %land.lhs.true.i

land.lhs.true.i:                                  ; preds = %if.then.i
  %4 = load i16, i16* %0, align 2, !tbaa !2
  %tobool.i = icmp eq i16 %4, 0
  br i1 %tobool.i, label %lor.lhs.false3.i, label %_ZN11xercesc_2_79XMLString6equalsEPKtS2_.exit

lor.lhs.false3.i:                                 ; preds = %land.lhs.true.i, %if.then.i
  br i1 %cmp1.i, label %if.else.i, label %land.lhs.true5.i

land.lhs.true5.i:                                 ; preds = %lor.lhs.false3.i
  %5 = load i16, i16* %1, align 2, !tbaa !2
  %tobool6.i = icmp eq i16 %5, 0
  br i1 %tobool6.i, label %if.else.i, label %_ZN11xercesc_2_79XMLString6equalsEPKtS2_.exit

if.else.i:                                        ; preds = %land.lhs.true5.i, %lor.lhs.false3.i
  br label %_ZN11xercesc_2_79XMLString6equalsEPKtS2_.exit

while.body.i:                                     ; preds = %while.cond.preheader.i, %if.end12.i
  %6 = phi i16 [ %7, %if.end12.i ], [ %2, %while.cond.preheader.i ]
  %psz2.028.i = phi i16* [ %incdec.ptr13.i, %if.end12.i ], [ %1, %while.cond.preheader.i ]
  %psz1.027.i = phi i16* [ %incdec.ptr.i, %if.end12.i ], [ %0, %while.cond.preheader.i ]
  %tobool10.i = icmp eq i16 %6, 0
  br i1 %tobool10.i, label %_ZN11xercesc_2_79XMLString6equalsEPKtS2_.exit, label %if.end12.i

if.end12.i:                                       ; preds = %while.body.i
  %incdec.ptr.i = getelementptr inbounds i16, i16* %psz1.027.i, i64 1
  %incdec.ptr13.i = getelementptr inbounds i16, i16* %psz2.028.i, i64 1
  %7 = load i16, i16* %incdec.ptr.i, align 2, !tbaa !2
  %8 = load i16, i16* %incdec.ptr13.i, align 2, !tbaa !2
  %cmp9.i = icmp eq i16 %7, %8
  br i1 %cmp9.i, label %while.body.i, label %_ZN11xercesc_2_79XMLString6equalsEPKtS2_.exit

_ZN11xercesc_2_79XMLString6equalsEPKtS2_.exit:    ; preds = %if.end12.i, %while.body.i, %if.else.i, %land.lhs.true5.i, %land.lhs.true.i, %while.cond.preheader.i
  %retval.0.i = phi i1 [ true, %if.else.i ], [ false, %land.lhs.true.i ], [ false, %land.lhs.true5.i ], [ false, %while.cond.preheader.i ], [ true, %while.body.i ], [ false, %if.end12.i ]
  ret i1 %retval.0.i
}

attributes #0 = { norecurse nounwind readonly "correctly-rounded-divide-sqrt-fp-math"="false" "disable-tail-calls"="false" "less-precise-fpmad"="false" "min-legal-vector-width"="0" "no-frame-pointer-elim"="false" "no-infs-fp-math"="false" "no-jump-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="false" "stack-protector-buffer-size"="8" "target-cpu"="pwr9" "target-features"="+altivec,+bpermd,+crypto,+direct-move,+extdiv,+htm,+power8-vector,+power9-vector,+vsx,-qpx" "unsafe-fp-math"="false" "use-soft-float"="false" }

!llvm.module.flags = !{!0}
!llvm.ident = !{!1}

!0 = !{i32 1, !"wchar_size", i32 4}
!1 = !{!"clang version 9.0.0 (git@github.ibm.com:compiler/llvm-project.git ab758ba128c46ba30cad058b89991852f7be5543)"}
!2 = !{!3, !3, i64 0}
!3 = !{!"short", !4, i64 0}
!4 = !{!"omnipotent char", !5, i64 0}
!5 = !{!"Simple C++ TBAA"}

We will get below assembly without this patch:

1   .text
  2   .abiversion 2
  3   .file "HashXMLCh.cpp"
  4   .globl  _ZN11xercesc_2_79HashXMLCh6equalsEPKvS2_ # -- Begin function _ZN11xercesc_2_79HashXMLCh6equalsEPKvS2_
  5   .p2align  4
  6   .type _ZN11xercesc_2_79HashXMLCh6equalsEPKvS2_,@function
  7 _ZN11xercesc_2_79HashXMLCh6equalsEPKvS2_: # @_ZN11xercesc_2_79HashXMLCh6equalsEPKvS2_
  8 .Lfunc_begin0:
  9 # %bb.0:                                # %entry
 10   cmpdi 1, 4, 0
 11   cmpdi 5, 0
 12   cror 20, 6, 2
 13   bc 4, 20, .LBB0_6
 14 # %bb.1:                                # %if.then.i
 15   bc 12, 6, .LBB0_3
 16 # %bb.2:                                # %land.lhs.true.i
 17   lhz 4, 0(4)
 18   li 3, 0
 19   cmplwi 1, 4, 0
 20   bnelr 1
 21 .LBB0_3:                                # %lor.lhs.false3.i
 22   bc 12, 2, .LBB0_5
 23 # %bb.4:                                # %land.lhs.true5.i
 24   lhz 4, 0(5)
 25   li 3, 0
 26   cmplwi  4, 0
 27   bnelr 0
 28 .LBB0_5:                                # %if.else.i
 29   b .LBB0_10
 30 .LBB0_6:                                # %while.cond.preheader.i
 31   lhz 8, 0(4)
 32   lhz 6, 0(5)
 33   li 3, 0
 34   cmplw 8, 6
 35   bnelr 0
 36 # %bb.7:                                # %while.body.i.preheader
 37   addi 6, 5, 2
 38   addi 7, 4, 2
 39   .p2align  4
 40 .LBB0_8:                                # %while.body.i
 41                                         # =>This Inner Loop Header: Depth=1
 42   andi. 8, 8, 65535
 43   beq 0, .LBB0_10
 44 # %bb.9:                                # %if.end12.i
 45                                         #   in Loop: Header=BB0_8 Depth=1
 46   addi 5, 5, 2
 47   addi 4, 4, 2
 48   lhz 8, 0(4)
 49   lhz 9, 0(5)
 50   addi 6, 6, 2
 51   addi 7, 7, 2
 52   cmplw 8, 9
 53   beq 0, .LBB0_8
 54   blr
 55 .LBB0_10:
 56   li 3, 1
 57   blr
 58   .long 0
 59   .quad 0
 60 .Lfunc_end0:
 61   .size _ZN11xercesc_2_79HashXMLCh6equalsEPKvS2_, .Lfunc_end0-.Lfunc_begin0
 62                                         # -- End function
 63
 64   .ident  "clang version 9.0.0 (git@github.ibm.com:compiler/llvm-project.git ab758ba128c46ba30cad058b89991852f7be5543)"
 65   .section  ".note.GNU-stack","",@progbits

We can see the line 29 b .LBB0_10 is unconditional branch to .LBB0_10,
Below is the .Lbb0_10

55 .LBB0_10:
56     li 3, 1
57   blr

This patch will optimize the line 29 to

29   li 3, 1
 30   blr

The line 29 b .LBB0_10 is created after running the pass branch-folder, The pass branch-folder is after the tail duplication pass.

Improve the conditional branch to early-ret like the line 43 beq 0, .LBB0_10` will be my next patch. I only improve the unconditional branch for this patch.

The line 29 b .LBB0_10 is created after running the pass branch-folder,

I did a quick test with -print-after-all, and it looks like it's actually created by MachineBlockPlacement?

If we're going to do this transform, we should use the existing TailDup code to do it, not reimplement it in PPCEarlyReturn. Would it make sense to run a re-run the entire tail duplication pass after block placement? Or should we try to do something more targeted?

The line 29 b .LBB0_10 is created after running the pass branch-folder,

I did a quick test with -print-after-all, and it looks like it's actually created by MachineBlockPlacement?

If we're going to do this transform, we should use the existing TailDup code to do it, not reimplement it in PPCEarlyReturn. Would it make sense to run a re-run the entire tail duplication pass after block placement? Or should we try to do something more targeted?

@efriedma Yes, you are right. This pattern is created by the pass block-placement. If I re-run tail duplication after the pass block-placement, this case will be what we want:

1   .text
 2   .abiversion 2
 3   .file "HashXMLCh.cpp"
 4   .globl  _ZN11xercesc_2_79HashXMLCh6equalsEPKvS2_ # -- Begin function _ZN11xercesc_2_79HashXMLCh6equalsEPKvS2_
 5   .p2align  4
 6   .type _ZN11xercesc_2_79HashXMLCh6equalsEPKvS2_,@function
 7 _ZN11xercesc_2_79HashXMLCh6equalsEPKvS2_: # @_ZN11xercesc_2_79HashXMLCh6equalsEPKvS2_
 8 .Lfunc_begin0:
 9 # %bb.0:                                # %entry
10   cmpdi 1, 4, 0
11   cmpdi 5, 0
12   cror 20, 6, 2
13   bc 4, 20, .LBB0_6
14 # %bb.1:                                # %if.then.i
15   bc 12, 6, .LBB0_3
16 # %bb.2:                                # %land.lhs.true.i
17   lhz 4, 0(4)
18   li 3, 0
19   cmplwi 1, 4, 0
20   bnelr 1
21 .LBB0_3:                                # %lor.lhs.false3.i
22   bc 12, 2, .LBB0_5
23 # %bb.4:                                # %land.lhs.true5.i
24   lhz 4, 0(5)
25   li 3, 0
26   cmplwi  4, 0
27   bnelr 0
28 .LBB0_5:                                # %if.else.i
29   li 3, 1
30   blr
31 .LBB0_6:                                # %while.cond.preheader.i
32   lhz 8, 0(4)
33   lhz 6, 0(5)
34   li 3, 0
35   cmplw 8, 6
36   bnelr 0
37 # %bb.7:                                # %while.body.i.preheader
38   addi 6, 5, 2
39   addi 7, 4, 2
40   .p2align  4
41 .LBB0_8:                                # %while.body.i
42                                         # =>This Inner Loop Header: Depth=1
43   andi. 8, 8, 65535
44   beq 0, .LBB0_5
45 # %bb.9:                                # %if.end12.i
46                                         #   in Loop: Header=BB0_8 Depth=1
47   addi 5, 5, 2
48   addi 4, 4, 2
49   lhz 8, 0(4)
50   lhz 9, 0(5)
51   addi 6, 6, 2
52   addi 7, 7, 2
53   cmplw 8, 9
54   beq 0, .LBB0_8
55   blr

For the good assembly, the .LBB0_10 has been removed, and .LBB0_5 is reserved. All instruction which branch to .LBB0_10 will branch to .LBB0_5.


Below is to analyze the reason for the bad pattern.
Before run the pass block-placement, the mir is below:

asm
body:             |
  bb.0.entry:
    successors: %bb.5(0x40000000), %bb.1(0x40000000)
    liveins: $x4, $x5

    renamable $cr1 = CMPDI renamable $x4, 0
    renamable $cr0 = CMPDI renamable $x5, 0
    renamable $cr5lt = CROR renamable $cr1eq, renamable $cr0eq
    BC killed renamable $cr5lt, %bb.5

  bb.1.while.cond.preheader.i:
    successors: %bb.2(0x40000000), %bb.11(0x40000000)
    liveins: $x4, $x5

    renamable $r8 = LHZ 0, renamable $x4 :: (load 2 from %ir.1, !tbaa !2)
    renamable $r6 = LHZ 0, renamable $x5 :: (load 2 from %ir.0, !tbaa !2)
    renamable $x3 = LI8 0
    renamable $cr0 = CMPLW renamable $r8, killed renamable $r6
    BCC 68, killed renamable $cr0, %bb.11

  bb.2.while.body.i.preheader:
    successors: %bb.3(0x80000000)
    liveins: $r8, $x3, $x4, $x5

    renamable $x6 = ADDI8 renamable $x5, 2
    renamable $x7 = ADDI8 renamable $x4, 2

  bb.3.while.body.i:
    successors: %bb.4(0x04000000), %bb.10(0x7c000000)
    liveins: $r8, $x3, $x4, $x5, $x6, $x7

    dead renamable $r8 = ANDIo killed renamable $r8, 65535, implicit-def $cr0
    BCC 68, killed renamable $cr0, %bb.10

  bb.4:
    renamable $x3 = LI8 1
    BLR8 implicit $lr8, implicit $rm, implicit killed $x3

  bb.5.if.then.i:
    successors: %bb.7(0x30000000), %bb.6(0x50000000)
    liveins: $cr0, $cr1, $x4, $x5

    BC killed renamable $cr1eq, %bb.7

  bb.6.land.lhs.true.i:
    successors: %bb.7(0x30000000), %bb.11(0x50000000)
    liveins: $cr0, $x4, $x5

    renamable $r4 = LHZ 0, killed renamable $x4 :: (load 2 from %ir.4, !tbaa !2)
    renamable $x3 = LI8 0
    renamable $cr1 = CMPLWI killed renamable $r4, 0
    BCC 68, killed renamable $cr1, %bb.11

  bb.7.lor.lhs.false3.i:
    successors: %bb.9(0x30000000), %bb.8(0x50000000)
    liveins: $cr0, $x5

    BC killed renamable $cr0eq, %bb.9

  bb.8.land.lhs.true5.i:
    successors: %bb.9(0x80000000)
    liveins: $x5

    renamable $r4 = LHZ 0, killed renamable $x5 :: (load 2 from %ir.6, !tbaa !2)
    renamable $x3 = LI8 0
    renamable $cr0 = CMPLWI killed renamable $r4, 0
    BCCLR 68, killed renamable $cr0, implicit $lr, implicit $rm, implicit killed $x3

  bb.9.if.else.i:
    renamable $x3 = LI8 1
    BLR8 implicit $lr8, implicit $rm, implicit killed $x3

  bb.10.if.end12.i:
    successors: %bb.3(0x7c000000), %bb.11(0x04000000)
    liveins: $x3, $x4, $x5, $x6, $x7

    renamable $x5 = ADDI8 killed renamable $x5, 2
    renamable $x4 = ADDI8 killed renamable $x4, 2
    renamable $r8 = LHZ 0, renamable $x4 :: (load 2 from %ir.14, !tbaa !2)
    renamable $r9 = LHZ 0, renamable $x5 :: (load 2 from %ir.12, !tbaa !2)
    renamable $x6 = ADDI8 killed renamable $x6, 2
    renamable $x7 = ADDI8 killed renamable $x7, 2
    renamable $cr0 = CMPLW renamable $r8, killed renamable $r9
    BCC 76, killed renamable $cr0, %bb.3

  bb.11._ZN11xercesc_2_79XMLString6equalsEPKtS2_.exit:
    liveins: $x3

    BLR8 implicit $lr8, implicit $rm, implicit killed $x3

The bb.4 and bb.9 are same.
The pass:block-placement will merge the same BasicBlock, so after running pass block-placement, the bb.9 will branch to the same BasicBlock bb.4, it's like below:

bb.9.if.else.i:
; predecessors: %bb.7, %bb.8
  successors: %bb.4(0x80000000); %bb.4(100.00%)

  B %bb.4

In the file llvm/lib/CodeGen/BranchFolding.cpp:

C++
687   // If both blocks are identical and end in a branch, merge them unless they
 688   // both have a fallthrough predecessor and successor.
 689   // We can only do this after block placement because it depends on whether
 690   // there are fallthroughs, and we don't know until after layout.
 691   if (AfterPlacement && I1 == MBB1->begin() && I2 == MBB2->begin()) {
 692     auto BothFallThrough = [](MachineBasicBlock *MBB) {
 693       if (MBB->succ_size() != 0 && !MBB->canFallThrough())
 694         return false;
 695       MachineFunction::iterator I(MBB);
 696       MachineFunction *MF = MBB->getParent();
 697       return (MBB != &*MF->begin()) && std::prev(I)->canFallThrough();
 698     };
 699     if (!BothFallThrough(MBB1) || !BothFallThrough(MBB2))
 700       return true;
 701   }
 702
 703   // If both blocks have an unconditional branch temporarily stripped out,
 704   // count that as an additional common instruction for the following
 705   // heuristics. This heuristic is only accurate for single-succ blocks, so to
 706   // make sure that during layout merging and duplicating don't crash, we check
 707   // for that when merging during layout.
 708   unsigned EffectiveTailLen = CommonTailLen;
 709   if (SuccBB && MBB1 != PredBB && MBB2 != PredBB &&
 710       (MBB1->succ_size() == 1 || !AfterPlacement) &&
 711       !MBB1->back().isBarrier() &&
 712       !MBB2->back().isBarrier())
 713     ++EffectiveTailLen;
 714
 715   // Check if the common tail is long enough to be worthwhile.
 716   if (EffectiveTailLen >= MinCommonTailLength)
 717     return true;

The bb.9 will be optimized to B %bb.4, because the line 700 return true. Maybe we can add more limitaion like below:

699     if ((!BothFallThrough(MBB1) || !BothFallThrough(MBB2))) {
700        if (!MBB1->back().isReturn() ||
701            !MBB2->back().isReturn() ||
702            CommonTailLen >= MinCommonTailLength)
703          return true;
704     }

When MBB1 is same as MBB2, both are end with return instruction and CommonTailLen < MinCommonTailLength, we won't return true. The default value of MinCommonTailLength is 3.

This new patch will do below optimization in the end of block-placement pass:

bb.1:
  B %bb.11

bb.2:
  renamable $x3 = LI8 1
  BLR8 implicit $lr8, implicit $rm, implicit killed $x3
bb.1:
  renamable $x3 = LI8 1
  BLR8 implicit $lr8, implicit $rm, implicit killed $x3

Above pattern will may be created in block-placement pass which is after the tail duplication pass.

@efriedma , I have updated this patch, to avoid call the tail-duplication again after block-placement pass.
In the end of block-placement I will do the optimization for unconditonal branch, this pattern may created by the block-placement pass.

efriedma added inline comments.Jul 15 2019, 1:31 PM
llvm/lib/CodeGen/MachineBlockPlacement.cpp
2741

If you have a block that ends in an unconditional branch, and the successor block has exactly one predecessor and no fallthrough successors, you can merge the two blocks; that's straightforward. But I don't see how the isSimpleBB check is related to that, and the isSuccessor() check seems redundant.

ZhangKang marked an inline comment as done.Jul 15 2019, 7:09 PM
ZhangKang added inline comments.
llvm/lib/CodeGen/MachineBlockPlacement.cpp
2741

Here, I will move all the instructions of TBB to ChainBB and remove TBB, if TBB->pred_size() > 1, I must ensure that ChainBB has only an unconditonal branch.
For below example:

BB:
   XOR 3, 3, 4
   B TBB
...
ChainBB:
   LI 3, 1
   B TBB
...
TBB:
   ADD 3, 3, 4
   BLR

TBB has two predecessors, I cannot move all the instructions of TBB to ChainBB and remove TBB. Of course, if I reserve TBB, I can also do the tail-duplication, but it will increase the size of program or has other side-effect, for those complex pattern, it need to be done by the tail-duplicaton pass. And here, I only optimize the pattern may be created by the block-placement pass.

But if ChainBB has only uncontional branch, I can do it.

BB:
   XOR 3, 3, 4
   B TBB
...
ChainBB:
   B TBB
...
TBB:
   ADD 3, 3, 4
   BLR

Can be optimized to:

BB:
   XOR 3, 3, 4
   B ChainBB
...
ChainBB:
   ADD 3, 3, 4
   BLR

Also TBB->pred_size() == 1, the only one predcessor is ChainBB, I will not care whether ChainBB has only one branch.

efriedma added inline comments.Jul 16 2019, 11:52 AM
llvm/lib/CodeGen/MachineBlockPlacement.cpp
2741

Oh, I didn't realize you were modifying the predecessors of TBB to branch to ChainBB instead. That's fine, I guess. Please update the comment with a more complete description of the transform, and please include tests which cover both the single-predecessor and multiple-predecessor transforms.

Also, a minor suggestion: instead of std::prev(I)->isSuccessor(TBB), maybe call std::prev(I)->canFallThrough() instead?

ZhangKang marked 2 inline comments as done.Jul 18 2019, 12:23 AM
ZhangKang added inline comments.
llvm/lib/CodeGen/MachineBlockPlacement.cpp
2741

Ok, I will modify the comment and add the test.

ZhangKang marked an inline comment as done.

Modify the comments and add test.

efriedma added inline comments.Jul 23 2019, 12:10 PM
llvm/lib/CodeGen/MachineBlockPlacement.cpp
2738

English grammar here is wrong; the first sentence is a fragment... and even if you fixed that it's hard to understand. Maybe something like the following: "Try to merge ChainBB and TBB. This is legal under the following conditions: 1. "...

llvm/test/CodeGen/PowerPC/block-placement-1.mir
303

I don't like the use of CHECK-NOT here; it's really easy for negative checks to fall out of sync with the actual expected output. Usually you can use "CHECK-NEXT" and "CHECK-EMPTY" more aggressively, instead.

ZhangKang updated this revision to Diff 211490.Jul 24 2019, 6:40 AM
ZhangKang marked an inline comment as done.

Modify the test and update the comments.

ZhangKang marked an inline comment as done.Jul 24 2019, 6:41 AM

Have updated the patch.

efriedma added inline comments.Jul 24 2019, 3:58 PM
llvm/lib/CodeGen/MachineBlockPlacement.cpp
2738

Maybe instead of "has only one unconditional branch", say "is empty except for an unconditional branch"

llvm/test/CodeGen/PowerPC/block-placement-1.mir
302

"CHECK-NET"?

ZhangKang updated this revision to Diff 211656.Jul 24 2019, 6:56 PM
ZhangKang marked 3 inline comments as done.

Modify the comment and fix the typo.

Have updated the patch.

llvm/lib/CodeGen/MachineBlockPlacement.cpp
2738

I will do it.

llvm/test/CodeGen/PowerPC/block-placement-1.mir
302

CHECK-NEXT, I will fix it.

This revision is now accepted and ready to land.Jul 25 2019, 1:27 PM
ZhangKang retitled this revision from [PowerPC] Do the Early Return for the li and unconditional branch to [PowerPC] Do the Simple Early Return in block-placement pass to optimize the blocks.Jul 25 2019, 6:53 PM
ZhangKang edited the summary of this revision. (Show Details)
This revision was automatically updated to reflect the committed changes.
ZhangKang retitled this revision from [PowerPC] Do the Simple Early Return in block-placement pass to optimize the blocks to [CodeGen] Do the Simple Early Return in block-placement pass to optimize the blocks.Jul 27 2019, 1:50 AM
ZhangKang updated this revision to Diff 212048.Jul 27 2019, 2:11 AM

The old patch has the memory leak error.
The new patch fix the memory leak error by using:

FunctionChain.remove(TBB);
BlockToChain.erase(TBB);

instead

F->remove(TBB);
efriedma reopened this revision.Jul 29 2019, 12:39 PM

You probably want F->erase(TBB), which both removes TBB from the list of blocks in the function, and deallocates TBB. I guess the empty block with no predecessors doesn't really matter much, in the long run, but easier to understand if the transform cleans up after itself properly.

This revision is now accepted and ready to land.Jul 29 2019, 12:39 PM

You probably want F->erase(TBB), which both removes TBB from the list of blocks in the function, and deallocates TBB. I guess the empty block with no predecessors doesn't really matter much, in the long run, but easier to understand if the transform cleans up after itself properly.

Yes, the empty block with no predecessors doesn't really matter much. Here, if I use F->erase(TBB), the memory leak error is still existed.
BlockToChain is the map which contain all BB.
In MachineBlockPlacement::maybeTailDuplicateBlock(), there is similar code to remove the BB.

2933         // Remove from the Chain and Chain Map
2934         if (BlockToChain.count(RemBB)) {
                    ...
2937           Chain->remove(RemBB);
2938           BlockToChain.erase(RemBB);
2939         }
2940

Here, if I use F->erase(TBB), the memory leak error is still existed.

What exactly is leaking? (If you're calling erase(), it isn't the MBB itself.)

Of course, that isn't a substitute for calling FunctionChain.remove etc.

Looking over this patch again, some of the other work involved in updating various data structures isn't complete here; the MachineDominatorTree/MachinePostDominatorTree isn't updated, MachineLoopInfo isn't updated.

Here, if I use F->erase(TBB), the memory leak error is still existed.

What exactly is leaking? (If you're calling erase(), it isn't the MBB itself.)

Of course, that isn't a substitute for calling FunctionChain.remove etc.

Looking over this patch again, some of the other work involved in updating various data structures isn't complete here; the MachineDominatorTree/MachinePostDominatorTree isn't updated, MachineLoopInfo isn't updated.

I have made a mistake to say if I use F->erase(TBB), the memory leak error is still existed..

The memory leak is TBB. I should say that if I use F->remove(), I will get memory leak error, but If use F->erase(TBB), the memory leak error will disappear, but I will get ERROR: AddressSanitizer: use-after-poison.

Analyse the ERROR: AddressSanitizer: use-after-poison when using F->erase(TBB)

2714   for (MachineBasicBlock *ChainBB : FunctionChain) {
2715     Cond.clear();
2716     MachineBasicBlock *TBB = nullptr, *FBB = nullptr; // For AnalyzeBranch.
2717     if (!TII->analyzeBranch(*ChainBB, TBB, FBB, Cond, /*AllowModify*/ true)) {

If I use F->erase(TBB), the FunctionChain still has the element which I have removed, so I will get the ERROR: AddressSanitizer: use-after-poison when call 2717 if (!TII->analyzeBranch(*ChainBB, TBB, FBB, Cond, /*AllowModify*/ true)) {.
I should remember those empty BasicBlocks, and do the clean work after the 2714 for loop exit.

Move the clean EmptyBB work out of the for loop.

Okay, delaying some of the work to sense, since you can't really modify FunctionChain while you're iterating over it.

Any response to my other comment?

Looking over this patch again, some of the other work involved in updating various data structures isn't complete here; the MachineDominatorTree/MachinePostDominatorTree isn't updated, MachineLoopInfo isn't updated.

Okay, delaying some of the work to sense, since you can't really modify FunctionChain while you're iterating over it.

Any response to my other comment?

Looking over this patch again, some of the other work involved in updating various data structures isn't complete here; the MachineDominatorTree/MachinePostDominatorTree isn't updated, MachineLoopInfo isn't updated.

My code is in optimizeBranches() function, after the optimizeBranches() function, the function MachineBlockPlacement::runOnMachineFunction will do some clean work and will never use the MachinePostDominatorTree info, and this pass don't preserve the MachinePostDominatorTree and never use MachineDominatorTree, so I need not update the MachineDominatorTree/MachinePostDominatorTree.

And I will add the code to update the MachineLoopInfo, becausee it will be used in alignBlocks().

ZhangKang updated this revision to Diff 212981.Aug 2 2019, 1:15 AM

Update the MachineLoopInfo.

Can you remove the dead MachineDominatorTree *MDT; declaration?

the function MachineBlockPlacement::runOnMachineFunction will do some clean work and will never use the MachinePostDominatorTree info, and this pass don't preserve the MachinePostDominatorTree

Please add this explanation as an explicit comment in the code.

Can you remove the dead MachineDominatorTree *MDT; declaration?

the function MachineBlockPlacement::runOnMachineFunction will do some clean work and will never use the MachinePostDominatorTree info, and this pass don't preserve the MachinePostDominatorTree

Please add this explanation as an explicit comment in the code.

This pass don't definite and use the MachineDominatorTree, it only defines and uses the MachinePostDominatorTree. I will add the comment about the change of MachinePostDominatorTree.

ZhangKang updated this revision to Diff 213190.Aug 3 2019, 8:23 AM

Add the comment about post-dominator tree.

@efriedma , I have updated the patch, do you have any comments?

ZhangKang updated this revision to Diff 214514.Aug 10 2019, 2:32 AM

The patch [MBP] Disable aggressive loop rotate in plain mode has modified the test case, so I update the test case to sync the test case.

This revision was automatically updated to reflect the committed changes.

Please provide a quick fix or revert the patch causing the crash due to transforming to incorrect IR.

llvm/trunk/lib/CodeGen/MachineBlockPlacement.cpp
2794 ↗(On Diff #214515)

This patches causes a crash because it does not take into account jump tables.

If TBB is a target of jump table you eliminate the machine basic block and jump table points to dead block now resulting in bad IR.

I guess that it was expected that ReplaceUsesOfBlockWith will do all things to replace usage of TBB but I see that it does not deal with jump tables.

jsji added a comment.Aug 27 2019, 8:01 AM

Please provide a quick fix or revert the patch causing the crash due to transforming to incorrect IR.

@ZhangKang is on vacation, contacted him, and reverted the commit on behalf of him in https://reviews.llvm.org/rL370069

jsji reopened this revision.Aug 27 2019, 8:11 AM
This revision is now accepted and ready to land.Aug 27 2019, 8:11 AM
jsji requested changes to this revision.Aug 27 2019, 8:11 AM
This revision now requires changes to proceed.Aug 27 2019, 8:11 AM

This reproducer fails with assertion:

] cat reduced.ll
target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128-ni:1"
target triple = "x86_64-unknown-linux-gnu"

define void @Init(i8 addrspace(1)* %this, i8 addrspace(1)* %0) #0 gc "statepoint-example" personality i32* ()* @personality_function !prof !1 {
bci_0:
  %xored = xor i64 0, -1
  %1 = inttoptr i64 %xored to i8*
  %2 = addrspacecast i8* %1 to i8 addrspace(1)*
  %result.i.i = select i1 false, i8 addrspace(1)* null, i8 addrspace(1)* %2
  switch i32 undef, label %mismatch [
    i32 2119, label %not_zero53
    i32 2122, label %not_zero53
    i32 2125, label %not_zero56
    i32 2128, label %not_zero53
  ]
not_zero53:
  fence release
  ret void

not_zero56:                                       ; preds = %bci_0
  %3 = icmp eq i8 addrspace(1)* %result.i.i, null
  br i1 %3, label %exit, label %not_zero59, !make.implicit !11

not_zero59:                                       ; preds = %not_zero56
  store atomic i8 1, i8 addrspace(1)* undef unordered, align 1
  br label %not_zero53

mismatch:                                         ; preds = %bci_0
  %statepoint_token18 = invoke token (i64, i32, i32 (i32, i8 addrspace(1)*)*, i32, i32, ...) @llvm.experimental.gc.statepoint.p0f_i32i32p1i8f(i64 1, i32 16, i32 (i32, i8 addrspace(1)*)* null, i32 2, i32 0, i32 undef, i8 addrspace(1)* %this, i32 0, i32 13, i32 0, i32 0, i32 0, i32 66, i32 0, i32 3, i32 0, i32 0, i8 addrspace(1)* %this, i32 0, i8 addrspace(1)* nonnull %0, i32 7, i8* null)
          to label %"exit" unwind label %exceptional_return46

"exit":
  unreachable
exceptional_return46:
  %lpad_token = landingpad token
          cleanup
  unreachable
}

declare i32* @personality_function() 
declare token @llvm.experimental.gc.statepoint.p0f_i32i32p1i8f(i64 immarg, i32 immarg, i32 (i32, i8 addrspace(1)*)*, i32 immarg, i32 immarg, ...)

!1 = !{!"function_entry_count", i64 32768}
!11 = !{}
] bin/llc -enable-implicit-null-checks reduced.ll
llc: /home/fsergeev/ws/llvm-upstream/llvm-mono/llvm/lib/CodeGen/AsmPrinter/AsmPrinter.cpp:1837: void llvm::AsmPrinter::EmitJumpTableEntry(const llvm::MachineJumpTableInfo*, const llvm::MachineBasicBlock*, unsigned int) const: Assertion `MBB && MBB->getNumber() >= 0 && "Invalid basic block"' failed.
...

It starts passing after the revert.

ZhangKang updated this revision to Diff 218595.Sep 3 2019, 11:29 PM

Fix the error for jump table.

New changes look okay... but maybe someone else should look too, since I've missed multiple serious issues with the CFG updating code.

This revision was not accepted when it landed; it landed in state Needs Review.Sep 6 2019, 1:14 AM
This revision was automatically updated to reflect the committed changes.
rnk added a subscriber: rnk.Oct 4 2019, 3:23 PM

I reverted this in rL373805, it caused https://bugs.llvm.org/show_bug.cgi?id=43566, which has to do with Windows EH. This code in general doesn't seem to consider address-taken blocks, so the problem is probably wider than just Windows EH. See the bug for a test case.