This is an archive of the discontinued LLVM Phabricator instance.

[BPF] Replace BPFMIPeepholeTruncElim by custom logic in isZExtFree()
ClosedPublic

Authored by eddyz87 on Aug 14 2023, 7:07 AM.

Details

Summary

Replace BPFMIPeepholeTruncElim by adding an overload for TargetLowering::isZExtFree() aware that zero extension is free for ISD::LOAD.

Short description

The BPFMIPeepholeTruncElim handles two patterns:

Pattern #1:

%1 = LDB %0, ...              %1 = LDB %0, ...
%2 = AND_ri %1, 0xff      ->  %2 = MOV_ri %1    <-- (!)

Pattern #2:

bb.1:                         bb.1:
  %a = LDB %0, ...              %a = LDB %0, ...
  br %bb3                       br %bb3
bb.2:                         bb.2:
  %b = LDB %0, ...        ->    %b = LDB %0, ...
  br %bb3                       br %bb3
bb.3:                         bb.3:
  %1 = PHI %a, %b               %1 = PHI %a, %b
  %2 = AND_ri %1, 0xff          %2 = MOV_ri %1  <-- (!)

Plus variations:

  • AND_ri_32 instead of AND_ri
  • SLL/SLR instead of AND_ri
  • LDH, LDW, LDB32, LDH32, LDW32

Both patterns could be handled by built-in transformations at instruction selection phase if suitable isZExtFree() implementation is provided. The idea is borrowed from ARMTargetLowering::isZExtFree.

When evaluating on BPF kernel selftests and remove_truncate_*.ll LLVM test cases this revisions performs slightly better than BPFMIPeepholeTruncElim, see "Impact" section below for details.

Commit also adds a few test cases to make sure that patterns in question are handled.

Long description

Why this works: Pattern #1

Consider the following example:

define i1 @foo(ptr %p) {
entry:
  %a = load i8, ptr %p, align 1
  %cond = icmp eq i8 %a, 0
  ret i1 %cond
}

Log for llc -mcpu=v2 -mtriple=bpfel -debug-only=isel command:

...
Type-legalized selection DAG: %bb.0 'foo:entry'
SelectionDAG has 13 nodes:
  t0: ch,glue = EntryToken
          t2: i64,ch = CopyFromReg t0, Register:i64 %0
        t16: i64,ch = load<(load (s8) from %ir.p), anyext from i8> t0, t2, undef:i64
      t19: i64 = and t16, Constant:i64<255>
    t17: i64 = setcc t19, Constant:i64<0>, seteq:ch
  t11: ch,glue = CopyToReg t0, Register:i64 $r0, t17
  t12: ch = BPFISD::RET_GLUE t11, Register:i64 $r0, t11:1
...
Replacing.1 t19: i64 = and t16, Constant:i64<255>
With: t16: i64,ch = load<(load (s8) from %ir.p), anyext from i8> t0, t2, undef:i64
 and 0 other values
...
Optimized type-legalized selection DAG: %bb.0 'foo:entry'
SelectionDAG has 11 nodes:
  t0: ch,glue = EntryToken
        t2: i64,ch = CopyFromReg t0, Register:i64 %0
      t20: i64,ch = load<(load (s8) from %ir.p), zext from i8> t0, t2, undef:i64
    t17: i64 = setcc t20, Constant:i64<0>, seteq:ch
  t11: ch,glue = CopyToReg t0, Register:i64 $r0, t17
  t12: ch = BPFISD::RET_GLUE t11, Register:i64 $r0, t11:1
...

Note:

  • Optimized type-legalized selection DAG:
    • t19 = and t16, 255 had been replaced by t16 (load).
    • Patterns like (and (load ... i8), 255) are replaced by load in DAGCombiner::BackwardsPropagateMask called from DAGCombiner::visitAND.
    • Similarly patterns like (shl (srl ..., 56), 56) are replaced by (and ..., 255) in DAGCombiner::visitSRL (this function is huge, look for TLI.shouldFoldConstantShiftPairToMask() call).

Why this works: Pattern #2

Consider the following example:

define i1 @foo(ptr %p) {
entry:
  %a = load i8, ptr %p, align 1
  br label %next

next:
  %cond = icmp eq i8 %a, 0
  ret i1 %cond
}

Consider log for llc -mcpu=v2 -mtriple=bpfel -debug-only=isel command.
Log for first basic block:

Initial selection DAG: %bb.0 'foo:entry'
SelectionDAG has 9 nodes:
  t0: ch,glue = EntryToken
  t3: i64 = Constant<0>
        t2: i64,ch = CopyFromReg t0, Register:i64 %1
      t5: i8,ch = load<(load (s8) from %ir.p)> t0, t2, undef:i64
    t6: i64 = zero_extend t5
  t8: ch = CopyToReg t0, Register:i64 %0, t6
...
Replacing.1 t6: i64 = zero_extend t5
With: t9: i64,ch = load<(load (s8) from %ir.p), zext from i8> t0, t2, undef:i64
 and 0 other values
...
Optimized lowered selection DAG: %bb.0 'foo:entry'
SelectionDAG has 7 nodes:
  t0: ch,glue = EntryToken
      t2: i64,ch = CopyFromReg t0, Register:i64 %1
    t9: i64,ch = load<(load (s8) from %ir.p), zext from i8> t0, t2, undef:i64
  t8: ch = CopyToReg t0, Register:i64 %0, t9

Note:

  • Initial selection DAG:
    • %a = load ... is lowered as t6 = (zero_extend (load ...)) w/o special isZExtFree() overload added by this commit it is instead lowered as t6 = (any_extend (load ...)).
    • The decision to generate zero_extend or any_extend is done in RegsForValue::getCopyToRegs called from SelectionDAGBuilder::CopyValueToVirtualRegister:
      • if isZExtFree() for load returns true zero_extend is used;
      • any_extend is used otherwise.
  • Optimized lowered selection DAG:
    • t6 = (any_extend (load ...)) is replaced by t9 = load ..., zext from i8. This is done by DagCombiner.cpp:tryToFoldExtOfLoad() called from DAGCombiner::visitZERO_EXTEND.

Log for second basic block:

Initial selection DAG: %bb.1 'foo:next'
SelectionDAG has 13 nodes:
  t0: ch,glue = EntryToken
            t2: i64,ch = CopyFromReg t0, Register:i64 %0
          t4: i64 = AssertZext t2, ValueType:ch:i8
        t5: i8 = truncate t4
      t8: i1 = setcc t5, Constant:i8<0>, seteq:ch
    t9: i64 = any_extend t8
  t11: ch,glue = CopyToReg t0, Register:i64 $r0, t9
  t12: ch = BPFISD::RET_GLUE t11, Register:i64 $r0, t11:1
...
Replacing.2 t18: i64 = and t4, Constant:i64<255>
With: t4: i64 = AssertZext t2, ValueType:ch:i8
...
Type-legalized selection DAG: %bb.1 'foo:next'
SelectionDAG has 13 nodes:
  t0: ch,glue = EntryToken
          t2: i64,ch = CopyFromReg t0, Register:i64 %0
        t4: i64 = AssertZext t2, ValueType:ch:i8
      t18: i64 = and t4, Constant:i64<255>
    t16: i64 = setcc t18, Constant:i64<0>, seteq:ch
  t11: ch,glue = CopyToReg t0, Register:i64 $r0, t16
  t12: ch = BPFISD::RET_GLUE t11, Register:i64 $r0, t11:1
...
Optimized type-legalized selection DAG: %bb.1 'foo:next'
SelectionDAG has 11 nodes:
  t0: ch,glue = EntryToken
        t2: i64,ch = CopyFromReg t0, Register:i64 %0
      t4: i64 = AssertZext t2, ValueType:ch:i8
    t16: i64 = setcc t4, Constant:i64<0>, seteq:ch
  t11: ch,glue = CopyToReg t0, Register:i64 $r0, t16
  t12: ch = BPFISD::RET_GLUE t11, Register:i64 $r0, t11:1
...

Note:

  • Initial selection DAG:
    • t0 is an input value for this basic block, it corresponds load instruction (t9) from the first basic block.
    • It is accessed within basic block via t4 (AssertZext (CopyFromReg t0, ...)).
    • The AssertZext is generated by RegsForValue::getCopyFromRegs called from SelectionDAGBuilder::getCopyFromRegs, it is generated only when LiveOutInfo with known number of leading zeros is present for t0.
    • Known register bits in LiveOutInfo are computed by SelectionDAG::computeKnownBits called from SelectionDAGISel::ComputeLiveOutVRegInfo.
    • computeKnownBits() generates leading zeros information for (load ..., zext from ...) but *does not* generate leading zeros information for (load ..., anyext from ...). This is why isZExtFree() added in this commit is important.
  • Type-legalized selection DAG:
    • t5 = truncate t4 is replaced by t18 = and t4, 255
  • Optimized type-legalized selection DAG:
    • t18 = and t4, 255 is replaced by t4, this is done by DAGCombiner::SimplifyDemandedBits called from DAGCombiner::visitAND, which simplifies patterns like (and (assertzext ...)).

Impact

This change covers all remove_truncate_*.ll test cases:

  • for -mcpu=v4 there are no changes in the generated code;
  • for -mcpu=v2 code generated for remove_truncate_7 and remove_truncate_8 improved slightly, for other tests it is unchanged.

For remove_truncate_7:

Before this revision                 After this revision
--------------------                 -------------------
    r1 <<= 0x20                          r1 <<= 0x20
    r1 >>= 0x20                          r1 >>= 0x20
    if r1 == 0x0 goto +0x2 <LBB0_2>      if r1 == 0x0 goto +0x2 <LBB0_2>
    r1 = *(u32 *)(r2 + 0x0)              r0 = *(u32 *)(r2 + 0x0)
    goto +0x1 <LBB0_3>                   goto +0x1 <LBB0_3>
<LBB0_2>:                            <LBB0_2>:
    r1 = *(u32 *)(r2 + 0x4)              r0 = *(u32 *)(r2 + 0x4)
<LBB0_3>:                            <LBB0_3>:
    r0 = r1                              exit
    exit

For remove_truncate_8:

Before this revision                 After this revision
--------------------                 -------------------
    r2 = *(u32 *)(r1 + 0x0)              r2 = *(u32 *)(r1 + 0x0)
    r3 = r2                              r3 = r2
    r3 <<= 0x20                          r3 <<= 0x20
    r4 = r3                              r3 s>>= 0x20
    r4 s>>= 0x20
    if r4 s> 0x2 goto +0x5 <LBB0_3>      if r3 s> 0x2 goto +0x4 <LBB0_3>
    r4 = *(u32 *)(r1 + 0x4)              r3 = *(u32 *)(r1 + 0x4)
    r3 >>= 0x20
    if r3 >= r4 goto +0x2 <LBB0_3>       if r2 >= r3 goto +0x2 <LBB0_3>
    r2 += 0x2                            r2 += 0x2
    *(u32 *)(r1 + 0x0) = r2              *(u32 *)(r1 + 0x0) = r2
<LBB0_3>:                            <LBB0_3>:
    r0 = 0x3                             r0 = 0x3
    exit                                 exit

For kernel BPF selftests statistics is as follows: (-mcpu=v4):

  • For -mcpu=v4: 9 out of 655 object files have differences, in all cases total number of instructions marginally decreased (-27 instructions).
  • For -mcpu=v2: 9 out of 655 object files have differences:
    • For 19 object files number of instruction decreased (-129 instruction in total): some redundant rX &= 0xffff and register to register assignments removed;
    • For 2 object files number of instructions increased +2 instructions in each file.

Both -mcpu=v2 instruction increases could be reduced to the same example:

define void @foo(ptr %p) {
entry:
  %a = load i32, ptr %p, align 4
  %b = sext i32 %a to i64
  %c = icmp ult i64 1, %b
  br i1 %c, label %next, label %end

next:
  call void inttoptr (i64 62 to ptr)(i32 %a)
  br label %end

end:
  ret void
}

Note that this example uses value loaded to %a both as a sign extended (%b) and as zero extended (%a passed as parameter). Here is the difference in final assembly code:

Before this revision          After this revision
--------------------          -------------------
    r1 = *(u32 *)(r1 + 0)         r1 = *(u32 *)(r1 + 0)
    r1 <<= 32                     r1 <<= 32
    r1 s>>= 32                    r1 s>>= 32
    if r1 < 2 goto <LBB0_2>       if r1 < 2 goto <LBB0_2>
                                  r1 <<= 32
                                  r1 >>= 32
    call 62                       call 62
<LBB0_2>:                     <LBB0_2>:
    exit                          exit

Before this commit %a is passed to call as a sign extended value, after this commit %a is passed to call as a zero extended value, both are correct as 32-bit sub-register is the same.

The difference comes from DAGCombiner operation on the initial DAG.
Initial selection DAG before this commit:

t5: i32,ch = load<(load (s32) from %ir.p)> t0, t2, undef:i64
      t6: i64 = any_extend t5         <--------------------- (1)
    t8: ch = CopyToReg t0, Register:i64 %0, t6
        t9: i64 = sign_extend t5
      t12: i1 = setcc Constant:i64<1>, t9, setult:ch

Initial selection DAG after this commit:

t5: i32,ch = load<(load (s32) from %ir.p)> t0, t2, undef:i64
      t6: i64 = zero_extend t5        <--------------------- (2)
    t8: ch = CopyToReg t0, Register:i64 %0, t6
        t9: i64 = sign_extend t5
      t12: i1 = setcc Constant:i64<1>, t9, setult:ch

The node t9 is processed before node t6 and load instruction is combined to load with sign extension:

Replacing.1 t9: i64 = sign_extend t5
With: t30: i64,ch = load<(load (s32) from %ir.p), sext from i32> t0, t2, undef:i64
 and 0 other values
Replacing.1 t5: i32,ch = load<(load (s32) from %ir.p)> t0, t2, undef:i64
With: t31: i32 = truncate t30
 and 1 other values

This is done by DAGCombiner.cpp:tryToFoldExtOfLoad called from DAGCombiner::visitSIGN_EXTEND. Note that t5 is used by t6 which is any_extend in (1) and zero_extend in (2). tryToFoldExtOfLoad() rewrites such uses of t5 differently:

  • any_extend is simply removed
  • zero_extend is replaced by and t30, 0xffffffff, which is later converted to a pair of shifts. This pair of shifts survives till the end of translation.

Diff Detail

Event Timeline

eddyz87 created this revision.Aug 14 2023, 7:07 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 14 2023, 7:07 AM
Herald added a subscriber: hiraditya. · View Herald Transcript
eddyz87 updated this revision to Diff 551319.Aug 17 2023, 4:35 PM
eddyz87 edited the summary of this revision. (Show Details)

Rebase, added detailed commit message.

eddyz87 updated this revision to Diff 551320.Aug 17 2023, 4:37 PM
eddyz87 edited the summary of this revision. (Show Details)

Commit message fixes.

eddyz87 edited the summary of this revision. (Show Details)Aug 17 2023, 4:44 PM

Coincidentally, this also fixes the following bug:
when LLVM is compiled with LLVM_ENABLE_EXPENSIVE_CHECKS some BPF kernel selftests fail to build with the following error:

*** Bad machine code: Illegal virtual register for instruction ***
- function:    _dissect
- basic block: %bb.9 if.end10 (0x621000e80d98)
- instruction: %24:gpr32 = MOV_rr %8:gpr32, debug-location !339; progs/bpf_flow.c:120:2
- operand 0:   %24:gpr32
Expected a GPR register, but got a GPR32 register

Using llvm-reduce it is possible to isolate the following test case:

define i1 @foo(ptr %p) {
entry:
  %short = load i16, ptr %p, align 2
  br label %next

next:
  %cond = icmp eq i16 %short, 0
  ret i1 %cond
}

Here is how this code looks before and after BPFMIPeepholeTruncElim transformation:

  Before this revision                      After this revision
  --------------------                      -------------------
bb.0.entry:                               bb.0.entry:
  %1:gpr = COPY $r1                         %1:gpr = COPY $r1
  %0:gpr32 = LDH32 %1:gpr, 0                %0:gpr32 = LDH32 %1:gpr, 0

bb.1.next:                                bb.1.next:
  %2:gpr32 = AND_ri_32 %0:gpr32, 65535      %2:gpr32 = MOV_rr %0:gpr32
             ^^^^^^^^^                                 ^^^^^^
  %4:gpr32 = MOV_ri_32 1                    %4:gpr32 = MOV_ri_32 1
  JEQ_ri_32 %2:gpr32, 0, %bb.3              JEQ_ri_32 %2:gpr32, 0, %bb.3

Note the MOV_rr instruction used with 32-bit sub-registers introduced by the transformation.
This happens because of the following code:

bool BPFMIPeepholeTruncElim::eliminateTruncSeq() {
  MachineInstr* ToErase = nullptr;
  bool Eliminated = false;

  for (MachineBasicBlock &MBB : *MF) {
    for (MachineInstr &MI : MBB) {
      ...
      BuildMI(MBB, MI, MI.getDebugLoc(), TII->get(BPF::MOV_rr), DstReg)
              .addReg(SrcReg);
      ...
    }
  }

  return Eliminated;
}

The basic fix is to select between MOV_rr and MOV_rr_32 in the above snippet. This fix leads to some size increase in the object files generated for BPF selftests: out of 655 object files 30 have more instructions, in total 107 more instructions is generated.

The increase is caused by a difference in BPFMIPreEmitPeephole::eliminateRedundantMov() behavior: it removes operations of form rA = MOV_rr rA, but does not remove wA = MOV_rr_32 wA (which is unsafe). E.g. for the running example:

  Before this revision                      After this revision
  --------------------                      -------------------
bb.0.entry:                               bb.0.entry:
  $w1 = LDH32 killed $r1, 0                 $w1 = LDH32 killed $r1, 0
  $w1 = MOV_rr killed $w1   <--- (1)        $w1 = MOV_rr_32 killed $w1   <--- (2)
  $w0 = MOV_ri_32 1                         $w0 = MOV_ri_32 1
  JEQ_ri_32 killed $w1, 0, %bb.2            JEQ_ri_32 killed $w1, 0, %bb.2
bb.1.next:                                bb.1.next:
  $w0 = MOV_ri_32 0                         $w0 = MOV_ri_32 0
bb.2.next:                                bb.2.next:
  RET implicit $w0                          RET implicit $w0

eliminateRedundantMov() will remove (1) but won't remove (2).

So, all-in-all I think that this revision has advantage over the basic fix.

eddyz87 published this revision for review.Aug 17 2023, 4:52 PM
eddyz87 added a reviewer: yonghong-song.

Hi Yonghong, could you please take a look?
As described here, this fixes one of the selftest compilation bugs that occur when LLVM_ENABLE_EXPENSIVE_CHECKS is on. Sorry for the long description, I tried to explain why instruction selection would always cover cases that BPFMIPeepholeTruncElim covered.

Herald added a project: Restricted Project. · View Herald TranscriptAug 17 2023, 4:52 PM
eddyz87 edited the summary of this revision. (Show Details)Aug 17 2023, 5:01 PM
yonghong-song accepted this revision.Aug 18 2023, 11:37 AM

Thanks, Eduard. Your change makes sense. I wish I knew isZExtFree() for load insn much earlier so we do not need to have pass to remove redundant codes....

This revision is now accepted and ready to land.Aug 18 2023, 11:37 AM
eddyz87 updated this revision to Diff 552082.Aug 21 2023, 11:04 AM
eddyz87 edited the summary of this revision. (Show Details)

Rebase, want to see green CI build